BlogPulse: Automated Trend Discovery for Weblogs
by Natalies S. Glance, Matthew Hurst and Takashi Tomokiyo
BlogPulse maintains a list of blogs to crawl. In this paper it was 100,000 and I think that currently it is over 1 million. They naively crawl all the blogs each day rather than trying to predict how often a blogger might update. Using daily and total Lucene indexes of the content they run several processing algorithms on the corpus to determine key phrases, people and topics.
Mist algorithms are unmotivated in this paper. They reference 'Analyst Workbench' and I don't know what that entails. Perhaps an earlier paper or an open toolkit? Or their own software.
They find key bi-grams, bursty phrases [Kleinburg], suffix trees. They also use the measures of 'phraseness' and 'formativeness' [Language Model to Keyword Extraction].
Combining these measure they are able to find phrases and names that are deemed important over the corpus, and rates at which those change. They also form topics by clustering phrases. Text similarity is used to cluster.
'Blog bites' are created by finding paragraphs that contain large number of phrases that constitute a topic.
This paper only investigates finding important phrases for ranking and searching the blogs. Nothing is done to track the topic flow. Sadly their motivations for algorithms to find key phrases is severely unmotivated. The methods they use to find key topics however are very useful.
Problem solved: Trend discovery and search.