Bursty and Heirachical Structure in Streams
by Jon Kleinberg
Streams can be modeled as a stated automaton (infinite or finite) where the different states correspond to higher or lower rates of appearance of data (word frequency) with state transitions having cost. This method appears to work extremely well. I am unhappy only with the fact hat I currently do not know exactly how one learns the model and finds the ideal state sequence. I clearly need to read the reference on HMMs. Examples from the paper appear to very well mark the burstiness of topics. This paper sadly does not make any consideration on topics that form in the corpus. Perhaps multiple words would share extremely similar state sequences and could be identified as combining for a topic given a text similarity analysis of some sort. The author also brings up the important point that information can be gathered from the first message that marks a state transition to a higher state. This message might be the first point on a specific sum-topic, or simply somehow increases interest in the topic. I would like to learn how to implement this algorihtm and analyze the data it produces. It appears to be well motivated and strongly defined in terms of finding topics that are becoming more important in a corpus.
As of 1/23/06 I have implemented the bursty stream code for the most part. I have not yet implemented the code to weight the different bursts, but I don't think that aspect should be difficult.
Problem solved: Detecting increased frequency of events that happen in streams. (Word frequency in document streams...)