Twitter LDA

31
Entity and Link annotation in Online Social Networks Karan Kurani & Akshay Bhat CS 6740 Fall 2010 Project at Cornell University

description

http://www.akshaybhat.com/LDA

Transcript of Twitter LDA

Page 1: Twitter LDA

Entity and Link annotation in Online Social Networks

Karan Kurani & Akshay Bhat

CS 6740 Fall 2010 Project at Cornell University

Page 2: Twitter LDA

Overview

IntroductionPrior workMethodology Datasets and Implementation

ResultsDiscussionFuture Work

Page 3: Twitter LDA

Overview

IntroductionPrior workMethodology Datasets and Implementation

ResultsDiscussionFuture Work

Page 4: Twitter LDA

Introduction Motivation:

We are interested in studying how social networks and textual information associated with entities in the network can be modeled for insight?

Goal : To create a model for annotating entities and links using

The social network Textual content

Dataset : Social network of 36 Million twitter users and 450 Million tweets

Applications Targeted Advertising, Friend suggestions etc.,

Page 5: Twitter LDA

Overview

IntroductionPrior workMethodology Datasets and Implementation

ResultsDiscussionFuture Work

Page 6: Twitter LDA

Prior Work Large scale analysis of user behavior on Twitter "What is

Twitter, a Social Network or a News Media” by Kwak et. al.

Studies the propagation of information through the network using retweets, to determine user influence.

“Automatic generation of personalized annotation tags for Twitter users” by Wu et. al.

Uses TFIDF weights to assign tags to each user, using textual information alone.

Page 7: Twitter LDA

Prior Work: Motivation Connections between the lines: Augmenting social

networks with text published by Chang et. al.

Using Wikipedia and Bible, annotated with entities, a network between entities and a topic model is constructed.

Page 8: Twitter LDA

Prior Work: Other models Block-LDA : Jointly modeling entity-annotated text and entity-entity links

by Cohen et. al. (Protein-Protein Interaction dataset)

Predefined undirected network & text associated with each node

Page 9: Twitter LDA

Prior Work: Other models Topic-link LDA : joint models of topic and author communities by Liu et. al.

Corpus of academic publications modeled using Bayesian hierarchical topic model To find topics within those papers as well as community of authors

Page 10: Twitter LDA

Overview

IntroductionPrior workMethodology Datasets and Implementation

ResultsDiscussionFuture Work

Page 11: Twitter LDA

Methodology: Overview

Community Detection• Detect communities of users in the social network

• Use Label Propagation algorithm

• Only the network information, who follows whom is used

• Communities are detected at various levels

LDA• Each community is considered as a corpus

• All tweets by a single entity are considered as a single document

• Only the textual information is used

• Stemming, stop word removal and rare word removal

Page 12: Twitter LDA

Methodology: Generating annotations A topic is considered to be relevant to a user if the probability

exceeds 0.05

Users are annotated using topics generated by the LDA model

For a link between users we take intersection of the topics

generated for each user forming the link.

We also detect general topics, by comparing topics generated

for randomly selected users from the network (not the

community)

Page 13: Twitter LDA

Methodology: Evaluation• Select a single community • Generate the LDA model from all users within

that community• Generate topic probabilities using the model

for a set of randomly selected users• A classifier (linear SVM) is used to

discriminate between a users in the community and randomly selected.

• Measure Accuracy, Precision and Recall• Repeat above procedure for different

communities

Page 14: Twitter LDA

Overview

IntroductionPrior workMethodology Datasets and Implementation

ResultsDiscussionFuture Work

Page 15: Twitter LDA

Dataset• Two types of data sets:

• Network:• Twitter follower network of 36 million users collected in June

2009.• Users with more than 900 followers are removed to uncover

the underlying social network,

• Textual data:• 450 million tweets from 20 million users. • Collected from June 2009 to December 2009. • Covers ~20-30% of all public tweets from above time period.

Page 16: Twitter LDA

Implementation• Community Detection

• Communities are detected using label propagation algorithm• Implemented on Cornell Web Lab Hadoop cluster• 15 iterations are performed• Communities from 7th and 15th iterations are considered

• LDA• LingPipe – Java package which provides implementation of

LDA.• Uses Collapsed Gibbs Sampling to infer the topic distributions.• Stop word removal, stemming and rare word removal is

performed. • Number of topics – 50.• Topic prior value - 0.02, word prior – 0.001, number of samples –

2000, burnin epochs – 100.

Page 17: Twitter LDA

Overview

IntroductionPrior workMethodology Datasets and Implementation

ResultsDiscussionFuture Work

Page 18: Twitter LDA

Results - Topics@18~114706

Word Word Count in

Topic

Probability

Binomial Z Coefficient

bdutt 2516 0.02 43.2cli 1753 0.02 36.2gs 1775 0.02 35.9

sardesairajdeep 825 0.01 25.2virsanghvi 949 0.01 24.9

bjp 1256 0.01 24kanchangupta 574 0.01 23.6pritishnandy 1158 0.01 22.4shashitharoor 1320 0.01 22.3

ndtv 1083 0.01 21.9acorn 592 0.01 20.2

journalism 559 0.01 19.9thecomicproject 725 0.01 19.4venkatananth 631 0.01 16.7

govt 900 0.01 14.8media 1255 0.01 14.1hindu 586 0.01 10.7

sir 561 0.01 10.4china 592 0.01 9.3indian 1245 0.01 8.1

@17~417938      

WordWord

Count in Topic

Probability

Binomial Z

Coefficient

ll 6447 0.02 54.2oh 5429 0.01 51ve 5244 0.01 44.9

don 6362 0.02 39.3haha 2542 0.01 37.6

re 4744 0.01 36.1yeah 3107 0.01 32.3ye 3258 0.01 27.7lol 4379 0.01 27.4

love 4705 0.01 24.6thank 3986 0.01 22mean 2143 0.01 19.8tweet 3544 0.01 16.9friend 2554 0.01 16.9feel 2668 0.01 15.6call 2533 0.01 15look 3366 0.01 13.4try 2222 0.01 12.3

people 2351 0.01 9.2watch 1938 0.01 5.4

Page 19: Twitter LDA

Results: User distribution over number of topics

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

5000

10000

15000

20000

25000

Frequency of topics (India)

indiarandom

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

5000

10000

15000

20000

25000

Frequency of topics (Port-land)

portlandrandom

Page 20: Twitter LDA

Results: Topic Distribution

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 490

1000

2000

3000

4000

5000

6000Topic Distribution (Portland)

portlandrandom

Page 21: Twitter LDA

Results: Topic Distribution

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849500

5000

10000

15000

20000

25000

30000

Topic Distribution (India)

indiarandom

Page 22: Twitter LDA

Results: Portland SpecificRatio freq Portland freq random Annotate2.98 229 683 loopt, pax, bedroom, condo3.8 297 1130 brew, beer, coffee

3.95 93 367 bike, ride4.15 967 4011 vegan , lucasd4.69 122 572 portland, pdx4.98 444 2209 Fb5.32 117 622 Food5.54 369 2044 fm,blip,wcpdx, radio5.61 110 617 mayorsamadam, election campaign5.68 209 1187 random users5.74 466 2673 local journalism5.83 347 2024 game – portland6.26 1930 12074 Food6.35 1753 11133 head, dinner, lunch6.77 185 1252 tomorrow, morning, weekend, rain6.78 101 685 Could not figure out 6.87 814 5595 business, market, home7.09 101 716 local random7.38 65 480 arts classic musueum7.4 469 3469 beach love happy

7.65 139 1064 meter, accuracy, church, christian, god8 1961 15684 random words

8.1 461 3733 store sale holiday fashion 8.16 3047 24864 don, yeah, bad, actually8.25 314 2590 twitpic

Page 23: Twitter LDA

Results: Portland General8.28 121 1002 can't figure out8.33 580 4832 health, obama, reform8.6 528 4541 oregon, win

8.73 609 5318 iphone, app, google, code9 3503 31529 twitpic, look, guy Common

9.83 471 4631 social, media9.96 579 5767 music, band, album

10.08 318 3204 blog post photo10.16 880 8940 avatar, iran, internet10.47 1026 10741 Design, site, web10.94 103 1127 tumblr, photo, flickr11.01 314 3458 Curse Words11.08 1583 17539 kid,life,family11.36 2646 30065 love hate people gonna awesome tonight13.81 362 4999 book, read, film, comic14.13 168 2374 video, upload, youtube15.93 121 1928 music artists, 16.64 61 1015 random usernames17.7 94 1664 health care

18.09 135 2442 myloc, mypict20.46 316 6466 timber, timberrfc, wood25.17 232 5839 random users, twilight, maybe teen users29.04 618 17945 generic words47.08 141 6638 Hawaii60.78 117 7111 Horoscope

Page 24: Twitter LDA

Results: India SpecificRatio freq India freq random Annotations0.54 772 419 banglore0.56 1539 859 politics journalists0.85 1029 878 tw linux0.88 1186 1040 sliverlight, erp, software0.93 2588 2401 interview helper, weight loss, recipe, scams1.09 2315 2521 local government1.13 888 1002 kerala, malyalam1.17 581 678 indian humorist1.31 645 844 users1.32 2477 3281 startups1.33 2128 2831 movies1.35 2913 3923 web, google1.47 3503 5140 social media, tech blogs1.47 3161 4651 actresses, movie directors, writers1.52 5411 8242 twitter, google,wave1.56 438 685 indian bloggers1.58 1645 2604 university, job, placement, engineer1.62 851 1377 stock market1.72 6925 11892 stock, road, car1.73 3075 5315 iphone, apps, mobile1.78 10979 19584 books, reading1.78 2125 3792 sports mostly cricket1.92 474 909 property, real estate2.01 1047 2104 140mafia, blogging, wordpress2.04 624 1270 twitpic, random

Page 25: Twitter LDA

Results: India General2.08 6785 14129 slang words2.11 992 2089 photo, flickr, tumblr2.14 585 1252 medicine, hospital2.19 9632 21073 time related words, marning, night, sleep2.25 1263 2840 contests, tata, channelV, prize2.3 807 1857 video, youtube, star news, desitv

2.37 517 1224 movie related english words2.39 1156 2760 hindi words2.46 3454 8503 english words2.51 995 2500 pak, china, pakistan, taliban2.55 800 2043 horoscopes, leo, twitterscope2.62 1750 4586 followers, check in, twitter2.72 565 1539 random2.94 2208 6491 mobsterworld, twit slang small words3.08 761 2344 industry/forex/banking3.29 1361 4482 addthi,via,bbc,iran election, climate hindu, world3.77 949 3575 f1 related4.05 1533 6206 khan,film, kapoor, actor5.13 2065 10595 fm, song, music, radio services5.26 689 3625 ff, tr, trim, digsby, chatting and IM5.36 4959 26566 english words5.59 3268 18280 tonight, music party, rock, art6.54 5196 33971 english words7.02 714 5014 kindle, amazon, michael jackson

16.74 368 6161 hp, laptops related

Page 26: Twitter LDA

Results: Common/Shared Topics

communities Topics in Portland community Topics in Minnesota community Common topics

topics topic 1 topic 2 topic 3 topic 1 topic 2 topic 3 topic 1 topic 2 topic 3

words

blazer bike p2 viking grizzly dbrauer chicken health iphone

game bikeportland hcr favre tpp bcollinsmn recipe Obama app

Portland ride tcot game teaparty story cook reform kr

duck trimet maddow football palin mpr cheese care flic

Team Bik Cont vike glennbeck mn wine healthcare apple

Page 27: Twitter LDA

Results: Classifier performance

  India Portland MinnesotaAccurac

y 0.79 0.88 0.77

Precision 0.83 0.88 0.86

Recall 0.72 0.63 0.64

Page 28: Twitter LDA

Overview

IntroductionPrior workMethodology Datasets and Implementation

ResultsDiscussionFuture Work

Page 29: Twitter LDA

Discussion

Page 30: Twitter LDA

Overview

IntroductionPrior workMethodology Datasets and Implementation

ResultsDiscussionFuture Work

Page 31: Twitter LDA

Future Work Use Hierarchical Dirichlet Processes to

determine the number of topics automatically.

Also use online version of LDA currently being developed by David Blei at Princeton. Which will allow the possibility of generating topic distribution over whole twitter dataset.