I'm working on a project that involves streaming data from Twitter, and sorting similar tweets into buckets. I want to be able to determine which buckets are growing fastest, within a window of 100,000 tweets. Put differently, out of the last 100,000 tweets, which buckets contain the majority of those tweets?
My first intuition was to do a count++ on a bucket whenever it gets a new tweet, but then the bucket that received a tweet 100,001 tweets ago needs a count-- (since I want to measure growth rate per the last 100,000 tweets only). That would mean keeping track of which buckets got the last 100,000 tweets, which sounds inefficient. Basically, I want there to be a finite number of points in circulation that can be assigned to buckets. If a bucket gets a new tweet then it also gets a point, and the bucket that earned a point least recently loses a point.
For what it's worth, I'm working in Java, so bonus points for any answers oriented towards that.
Aucun commentaire:
Enregistrer un commentaire