Difference between revisions of "BlogPulse Review"

From PublicWiki
Jump to: navigation, search
Line 1: Line 1:
== BlogPulse: Automated Trend Discovery for Weblogs
+
== BlogPulse: Automated Trend Discovery for Weblogs ==
 
by Natalies S. Glance, Matthew Hurst and Takashi Tomokiyo
 
by Natalies S. Glance, Matthew Hurst and Takashi Tomokiyo
Reviewed 1/13/2006 ==
+
Reviewed 1/13/2006
  
 +
 +
----
  
 
BlogPulse maintains a list of blogs to crawl. In this paper it was 100,000 and I think that currently it is over 1 million. They naively crawl all the blogs each day rather than trying to predict how often a blogger might update. Using daily and total Lucene indexes of the content they run several processing algorithms on the corpus to determine key phrases, people and topics.
 
BlogPulse maintains a list of blogs to crawl. In this paper it was 100,000 and I think that currently it is over 1 million. They naively crawl all the blogs each day rather than trying to predict how often a blogger might update. Using daily and total Lucene indexes of the content they run several processing algorithms on the corpus to determine key phrases, people and topics.

Revision as of 04:54, 24 January 2006

BlogPulse: Automated Trend Discovery for Weblogs

by Natalies S. Glance, Matthew Hurst and Takashi Tomokiyo Reviewed 1/13/2006



BlogPulse maintains a list of blogs to crawl. In this paper it was 100,000 and I think that currently it is over 1 million. They naively crawl all the blogs each day rather than trying to predict how often a blogger might update. Using daily and total Lucene indexes of the content they run several processing algorithms on the corpus to determine key phrases, people and topics.

Mist algorithms are unmotivated in this paper. They reference 'Analyst Workbench' and I don't know what that entails. Perhaps an earlier paper or an open toolkit? Or their own software.

They find key bi-grams, bursty phrases [Kleinburg], suffix trees. They also use the measures of 'phraseness' and 'formativeness' [Language Model to Keyword Extraction].

Combining these measure they are able to find phrases and names that are deemed important over the corpus, and rates at which those change. They also form topics by clustering phrases. Text similarity is used to cluster.

'Blog bites' are created by finding paragraphs that contain large number of phrases that constitute a topic.

This paper only investigates finding important phrases for ranking and searching the blogs. Nothing is done to track the topic flow. Sadly their motivations for algorithms to find key phrases is severely unmotivated. The methods they use to find key topics however are very useful.

Problem solved: Trend discovery and search.