Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

21
Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Transcript of Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Page 1: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Aron Culotta

Jedsada Chartree

Page 2: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Introduction

• Growing interest in monitoring disease outbreaks.• Growing of twitter users

- February, 2010 50 million tweets/day- June, 2010 65 million tweets/day (750 tweets/s

- 190 million users

Source: http://en.wikipedia.org/wiki/Twitter

Page 3: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Introduction

• Twitter is a website, which offers a social networking and micro-blogging service.- Users send and read messages called “tweets”

(140 characters)

Page 4: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Introduction

• Advantages of Twitter for this research- Full messages provide more information than query.- Twitter profiles contain more detail to analyze.

(city, state, gender, age)- Diversity of twitter users.

Page 5: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Methodology

• Data- Collect 574,643 messages for 10 weeks

(February 12, 2010 to April 24, 2010) - The US Centers for Disease Control and Prevention (CDC)

publishes the US Outpatient Influenza-like Illness Surveillance Network (ILINet)

Page 6: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Methodology

The Ground truth ILI rates obtained from the CDC statistics

Page 7: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Methodology

• Regression Models 1. Simple linear regression

P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match W = D = a document collection Dw = a document frequency for word W

logit(x) =

log it(P) = β1 log it(Q(W ,D))+ β 2 +ε

β1

β2€

ε

Q(W ,D)

DwD

ln(x

1− x)

Page 8: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Methodology

• Regression Models 2. Multiple linear regression

P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match Wi =

D = a document collection Dwi = a document frequency for word Wi

logit(x) =

log it(P) = β1 log it(Q({W1},D))+ ...+ log it(Q({Wk},D))+ β k+1 +ε

β1

β2€

ε

Q({Wi },D)

DwiD

ln(x

1− x)

Page 9: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Methodology

• Keyword Selection1. Correlation Coefficient

- Simple linear regression model evaluation

2. Residual Sum of Squares (RSS)

- It measures a discrepancy between the data and an estimation model

RSS(P,^

P) = ( pi − p^

)2i∑

Page 10: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Methodology

• Keyword Generation1. Hand-chosen keywords

(flu, cough, sore throat, headache)

2. Most frequent keywords - Search all documents containing any of hand-chosen

keywords. - Find the top 5,000 most frequently occurring words.

Page 11: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Methodology

• Document Filtering - Applying logistic regression to predict whether a Twitter

message is reporting an ILI symptom.

yi = a binary random variable

(1 if document Di is positive, 0 otherwise)

xi = {xij} = number of times word j appears in document i€

p(y i = 1 | x i ;θ ) =1

1+ e(−xi •θ )

Page 12: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Methodology

Page 13: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Methodology

• Classification evaluation- Accuracy

- Precision - Recall - F-measure

F = 2•Pr ecision • Recall

Pr ecision +Recall

Page 14: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Results

• Document Filtering

Evaluation of messages classification with standard error in parentheses

Page 15: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Results

• Regression

The 10 different systems evaluated

Page 16: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Results

• Regression

The regression coefficient (r), residual sum of square (RSS), and standard error of each system

Page 17: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Results

Results for multi-hand-rss(2) Results for classification-hand

Page 18: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Results

Results for multi-freq-rss(3) Results for simple-hand-rss(1)

Page 19: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Results

Correlation results for simple –hand-rss and multi-hand-rss

Correlation results for simple –hand-corr and multi-hand-corr

Page 20: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Results

Correlation results for simple –freq-rss and multi-freq-rss

Correlation results for simple –freq-corr and multi-freq-corr

Page 21: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree.

Conclusion

• Several methods to identify influenza-related messages.• Compare a number of regression models to correlate the

messages with CDC statistics.• The best model achieves correlation of .78 .