Generative Model To Construct Blog and Post Networks In Blogosphere

27
Masters Thesis Defense Amit Karandikar Advisor: Dr. Anupam Joshi Committee: Dr. Finin, Dr. Yesha, Dr. Oates Date: 1 st May 2007 Time: 9:30 am Place: ITE 325B Generative Model To Generative Model To Construct Blog and Post Construct Blog and Post Networks Networks In Blogosphere In Blogosphere http:// prefuse.org/ gallery/

description

Generative Model To Construct Blog and Post Networks In Blogosphere. Masters Thesis Defense Amit Karandikar Advisor: Dr. Anupam Joshi Committee: Dr. Finin, Dr. Yesha, Dr. Oates Date: 1 st May 2007 Time: 9:30 am Place: ITE 325B. http://prefuse.org/gallery/. Outline. Introduction - PowerPoint PPT Presentation

Transcript of Generative Model To Construct Blog and Post Networks In Blogosphere

Page 1: Generative Model To Construct Blog and Post Networks  In Blogosphere

Masters Thesis DefenseAmit Karandikar

Advisor: Dr. Anupam JoshiCommittee: Dr. Finin, Dr. Yesha, Dr. Oates

Date: 1st May 2007Time: 9:30 amPlace: ITE 325B

Generative Model To Construct Blog Generative Model To Construct Blog and Post Networks and Post Networks

In BlogosphereIn Blogosphere

http://prefuse.org/gallery/

Page 2: Generative Model To Construct Blog and Post Networks  In Blogosphere

2

Outline

• Introduction

• Motivation

• Thesis Contribution

• Interactions in Blogosphere

• Proposed Model

• Experiments and Results

• Conclusion

Page 3: Generative Model To Construct Blog and Post Networks  In Blogosphere

3

Generative model: A generative model is a model for randomly / systematically generating the observed data using some input parameters.Parameters could be latent or input to the model.

Blogosphere: Blogosphere is the collective term encompassing all blogs linked together forming as a community or social network.

Blog network: Network formed by considering each blog single node.

Post Network: Network formed considering post as a node; ignoring its parent blog.

IntroductionGenerative Model To Construct Blog and Post Networks In

Blogosphere

finin.livejournal.comjoshi.blogspot.com

oates.myspace.comyesha.blogspot.com

Page 4: Generative Model To Construct Blog and Post Networks  In Blogosphere

4

Basics ..

Graphs are everywhere .. and so are Power laws!!

Internet Mapping Project [lumeta.com]

Friendship Network [Moody ‘01]

In simple words, power law can be explained by “rich get richer phenomenon” OR “20% of the population holds 80% of the wealth”

Considering web as a graph:

Scale-free network: Structure and properties independent of network size

Few high connectivity node (hubs)

http://www.prefuse.org/gallery/

Properties of interest (graph theory)

Average degree of node, degree distribution, degree correlation, distribution of strongly/weakly connected components, clustering coefficient and reciprocity

Page 5: Generative Model To Construct Blog and Post Networks  In Blogosphere

5

MotivationWhy simulate blog graphs?

• Reduce time to generate data- crawling the blogosphere over a few weeks- sampling the right blogs to get a representative sample

• Reduce time in preprocessing and data cleaning- removing links pointing outside the dataset, outside the time frame- splog removal [1]

• Generate graphs of different properties\sizes- average degree of node, degree distributions

• Testing of new algorithms for blog graphs- e.g. spread of influence in blogosphere [2], community detection [3]

• Extrapolation- how will fast growth affect the blogosphere properties?- how does this affect the connected components?

[1] Kolari et al “Svms for the blogosphere: Blog identification and splog detection,” in AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, 2006.

[2] Java et al “Modeling the spread of influence on the blogosphere,” tech. rep., University of Maryland, Baltimore County, March 2006.

[3] Lin et al “Discovery of Blog Communities based on Mutual Awareness

Page 6: Generative Model To Construct Blog and Post Networks  In Blogosphere

6

Thesis Contribution

1. To propose a generative model for a blog-blog network using preferential attachment and uniform random attachment by modeling the interactions among bloggers

2. To generate post-post network as part of the generative model for blog graphs.

3. Compare the properties of the simulated blog and post networks with the properties observed in the available real blog datasets.

DatasetsWorkshop on the Weblogging Ecosystem (WWE 2006)http://weblogging2006.blogspot.com/International Conference on Weblogs and Social Media (ICWSM 2007)http://ebiquity.umbc.edu/blogger/icwsm-2007-blogs-dataset/

Page 7: Generative Model To Construct Blog and Post Networks  In Blogosphere

7

Why existing models are not enough?

Erdos-Renyi random model

Barabasi Albert preferential attachment

web model

Preferential Attachment: The likelihood of linking to a popular website is higher

[1] M. Newman, “The structure and function of complex networks,” 2003

[3] R. Albert, Statistical mechanics of complex networks. PhD thesis, 2001.

[7] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst, “Cascading behavior in large blog graphs”, ICWSM, 2007

[32] X. Shi, B. Tseng, and L. Adamic, “Looking at the blogosphere topology through different lenses” ICWSM, 2007

•Two level network: blog and post level

•Inlinks and outlinks to and from posts

•NEED to model blogger interactions

Page 8: Generative Model To Construct Blog and Post Networks  In Blogosphere

8

Interactions in blogosphere

• Interesting findings from PEW Internet survey [1]- Blog writers are enthusiastic blog readers- Most bloggers post infrequently- Linking in the neighborhood: preferential or random?

(friends blog, blogroll)

• Blogger tend to link to some (how many?) of the posts that they read recently (often preferentially, sometimes random)

• Is popularity (inlinks) proportional to blogger activity (outlinks)? [NO] [2]

[1] A. Lenhart and S. Fox, “Bloggers: A portrait of the internet’s new storytellers.”[2] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst, “Cascading behavior in large blog graphs”, ICWSM 2007

Model parameters

Page 9: Generative Model To Construct Blog and Post Networks  In Blogosphere

9

Model Parameters

1. Probability of random reads (rR)

2. Probability of randomly selecting writer (rW)

3. Probability that new node does not link to the existing network (pD)

4. Growth exponent (g) – how many links should be added every step?

Page 10: Generative Model To Construct Blog and Post Networks  In Blogosphere

10

Proposed Model: Blog view

Should I link to someone? If yes who?

>> Preferentially based on indegree of node

1. Add new blog node

2. Select writer

3. Writers read blog posts, write posts

Writer selection: randomly? OR>> Preferentially based on outdegree?

Should I read - randomly? - preferentially?

I will not link to anyone!

Random writer Random destination

Reciprocal links

Strongly connected components Subset of nodes having directed path from every node to every other node

Weakly connected components

Information flow

Step=1Step=2

michellemalkin

dailykos

Page 11: Generative Model To Construct Blog and Post Networks  In Blogosphere

11

Proposed Model: Post view

Number of links?

Blogger A Blogger B

Post 1

Post 1

Post 2

Post 2

Post 3

Page 12: Generative Model To Construct Blog and Post Networks  In Blogosphere

12

Growth of blog graphs: Densification

Densification [1] has been observed in various real networks including blogosphere

Number of edges grows faster than number of nodes: super linear growth function

[1] ] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst, “Cascading behavior in large blog graphs”, ICWSM 2007

Reciprocity and clustering coefficient increase with growth exponent

Average degree increases with growth (evolution time)

Page 13: Generative Model To Construct Blog and Post Networks  In Blogosphere

13

Properties of simulated blog network

Page 14: Generative Model To Construct Blog and Post Networks  In Blogosphere

14

Properties of simulated post network

Page 15: Generative Model To Construct Blog and Post Networks  In Blogosphere

15

Blogosphere: Blog Inlinks distribution

Blogosphere follows power law distribution for blog inlinks and outlinks, post inlinks and post outlinks, component sizes, posts per blog, size of cascades …

Power law distribution

Slope = -2.07

Very few blog nodes have very high inlinks

Large number of blog nodes have very few inlinks

Page 16: Generative Model To Construct Blog and Post Networks  In Blogosphere

16

Simulation: Blog Inlinks distribution

Similar curves are observed for properties of simulated blog and posts networks

Power law distribution

Slope = -1.72

Page 17: Generative Model To Construct Blog and Post Networks  In Blogosphere

17

Power law distributions for various network sizes

Similar shape of curves for degree distributions as observed by Shi et al [1] in the “real” blogosphere.

[1] X. Shi, B. Tseng, and L. Adamic, “Looking at the blogosphere topology through different lenses,” in ICWSM, 2007

Page 18: Generative Model To Construct Blog and Post Networks  In Blogosphere

18

Hop plotAverage neighborhood size Vs. Hop count

Hop plot shows the reachability of nodes in the network

After N hops, hop plot becomes constant

Comparison of hop plots for ICWSM, WWE and Blogosphere (650K blog nodes, 1.4 million links)

Reachability?

pD = probability that new node remains disconnected

Page 19: Generative Model To Construct Blog and Post Networks  In Blogosphere

19

Simulation: Scatter plot and degree correlations

Correlation Coefficients

ICWSM: 0.056

WWE: 0.02

Simulation: 0.1

Correlation coefficient close to zero means there is NO definite relation between indegree and outdegree of blog nodes

Random writers (rW) helps to model low correlation coefficient

Popular blogs (high inlinks) Popular avid writers

(high inlinks and outlinks)

Avid writers (high outlinks)

BA model

correlation coefficient = 1

Page 20: Generative Model To Construct Blog and Post Networks  In Blogosphere

20

Distribution of SCC in blog and post network (WWE and Simulation)

Community detection, modeling influence uses connected components

Page 21: Generative Model To Construct Blog and Post Networks  In Blogosphere

21

Distribution of WCC in post network (WWE and Simulation)

Power law distribution in WCC for post

network

Page 22: Generative Model To Construct Blog and Post Networks  In Blogosphere

22

Simulation: Posts per blog distribution

Posts per blog also follows a power law distribution [1]

[1] ] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst, “Cascading behavior in large blog graphs”, ICWSM 2007

Power law distribution

Slope = -1.71

Page 23: Generative Model To Construct Blog and Post Networks  In Blogosphere

23

Effect of increase in blogs

Degree distributions almost the same

Reciprocity increases

Average degree increases

Clustering coefficient and reciprocity of the post network is much less compared to the blog network

Page 24: Generative Model To Construct Blog and Post Networks  In Blogosphere

24

Effect of parametersRandom reads (rR), random writers (rW), disconnected nodes (pD)

Increasing rR (random reads), decreases reciprocity because it reduces the likelihood of getting reverse link

Empirically rW = 0.35 (random writers) gives low degree correlation and similar values for other parameters as the blogosphere

Increasing pD reduces the size of largest WCC

Page 25: Generative Model To Construct Blog and Post Networks  In Blogosphere

25

Conclusion

1. Simulation resembles blogosphere in degree distributions, degree correlations, reciprocity, average degree, clustering coefficient, component distribution for blog and post networks.

2. Simulated post network is sparse compared to blog network and posts per blogs follows a power law distribution as observed in blogosphere.

3. Useful tool for analysis of blogosphere, testing new algorithms and extrapolation (how will increase in X affect some Y?)

Page 26: Generative Model To Construct Blog and Post Networks  In Blogosphere

26

Future work

• Can we model buzz and popularity in the post network?

• What is the effect of buzz on the properties of the network?

• In-depth temporal analysis of evolving blog graphs

• Can we enrich the model with topical information?

• How can we model the blogroll?

Page 27: Generative Model To Construct Blog and Post Networks  In Blogosphere

27

Questions?

Thank you!

AcknowledgementsAdvisor, committee members, coauthors, friends at UMBC

DataBlogPulse, ICWSM, WWE