Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA...

20
Empirical Study of Topic Modeling in Twitter Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA

description

Empirical Study of Topic Modeling in Twitter. Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA USA. Why we care about text modeling in Twitter ?. SOMA 2010 . Why we care about text modeling in Twitter ?. Understanding users’ interests - PowerPoint PPT Presentation

Transcript of Liangjie Hong and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA...

Page 1: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Empirical Study of Topic Modeling in Twitter

Liangjie Hong and Brian D. DavisonComputer Science and Engineering

Lehigh UniversityBethlehem, PA USA

Page 2: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

SOMA 2010

Why we care about text modeling in Twitter ?

Page 3: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

SOMA 2010

Why we care about text modeling in Twitter ?

• Understanding users’ interests• Understanding social network• Identifying emerging topics

Page 4: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Problems

SOMA 2010

• Tweets are too short (140 char)• Hash tags• Abbreviations• Multiple languages

Page 5: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Question

SOMA 2010

How can we train an “effective” standard topic model ?

Page 6: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

We found

SOMA 2010

• Topics learned by different aggregation strategies are substantially different

• Training the model at user-level is faster

• Learned topics can help classification tasks

Page 7: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

A quick review of topic models

SOMA 2010

LDAAuthor-Topic

Page 8: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Our goal

SOMA 2010

Obtain topic mixtures for both tweets and users

Page 9: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Training Schemes

SOMA 2010

• Train on tweets• Infer users + tweets

• Train on aggregated tweets (by users)• Infer tweets

• Train on aggregated tweets (by terms)• Infer users + tweets

• Author-Topic model• Infer tweets

Page 10: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Datasets

SOMA 2010

• 1,992,758 tweets + 514,130 users• 3,697,498 terms

• 274 verified users from Twitter Suggestion• 16 categories • 50,447 tweets (150 tweets per user)

Page 11: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Tasks

SOMA 2010

• Topic modeling

• Retweet Prediction• User & Tweets Topical Classification

Logistic Regression

Page 12: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Topic Modeling

SOMA 2010

Page 13: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Topic Modeling

SOMA 2010

Page 14: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Topic Modeling

SOMA 2010

Page 15: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Retweet Prediction

SOMA 2010

Positive examples

@Jon Hello World2009-11-01

13:15pm

Hello World2009-11-01

12:00pm

@Kim @Jon Hello World2009-11-01

13:23pm

@Frank @Kim @Jon

Hello World2009-11-01

17:49pm

Negative examples

Page 16: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Retweet Prediction

SOMA 2010

Page 17: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Tweets Classification

SOMA 2010

Page 18: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

User Classification

SOMA 2010

Page 19: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Conclusion

SOMA 2010

• User Level Aggregation is helpful• Fast and good result

• Author-Topic model does not directly apply

• Topic Modeling can help other tasks • tweets classification

Page 20: Liangjie Hong  and Brian D. Davison Computer Science and Engineering Lehigh University Bethlehem, PA  USA

Thank you and IBM Travel Grant!

Contact Info:Liangjie [email protected] LaboratoryComputer Science and EngineeringLehigh UniversityBethlehem, PA 18015 USA

SOMA 2010