BlogResearch current status

From PublicWiki
Jump to: navigation, search

2/15/06

Realizing that I haven't been doing enough for this that is material, today I implemented a neural network that attempts to predict links [from the older data set] and it seems to be about 90% accurate on predicting correct link-infections given not too many negative-examples. I get about 84-85% accuracy including all my data and attempting to predict 10% of it. This is of course with test and train data sets. I am able to correctly predict a bit over 95% of the training data if that means anything (shouldn't it suggest something about the linearity of the dataset?!?).

This data set has about 489,383 positive examples of links infecting, and 3,338,129 negative examples (links not infecting). I learn a neural network for each user where the inputs are whether each friend has posted the link. There are of course no hidden nodes are other structure as this basic neural net is the model known as the 'Linear Threshold Model'.

It is lacking in that no relational data of topics/links or of people/users is kept. This should be holding it back. It also is somewhat weak in being a linear model, though this does make it easier to learn, correct? (ask Dan).

I next hope to implement a Bayesian model which should be mimic the 'Independent/General Cascade Model'. See if it does better. See if I have enough data. There are many examples, but over 100,000 users. The word data that I am currently collected should not have this problem.

The third move will be to attempt a statistical relational model. This should be incorporating more information and hence better predict user behaviour. It might also be significantly harder to predict users+topics than users+links. We shall see.



1/24/06

Information diffusion through blogspace seems to be closest to what I would like to do, but they appear to stop short of attempting to predict information flow. They create a model of influence, but stop there. They find influential users by tracking occurences of topics by proper nouns and do so very convincingly.

I also wonder if somehow I should be accounting for general variation in posting frequencies on the bursty code of Kleinberg. Currently there are bursts that are somewhat causes by general increase of posts during the day compared to during the night, Perhaps somehow compressing the time intervals to be the inverse of the total word frequency?


1/23/06

I am currently collecting data-downloading all the current posts from livejournal and saving them in their original form. I have implemented the bursty stream code and have written some code to extract the streams from the post data. Next I plan to automate the process of extracting a word's post frequency data and determine how bursty it is. Hopefully then I will automaticly be able to determine which words are spreading quickly through the blog network.

Blog_Results_1