towards detecting influenza epidemics by analyzing twitter massages aron culotta jedsada chartree
TRANSCRIPT
![Page 1: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/1.jpg)
Towards Detecting Influenza Epidemics by Analyzing Twitter Massages
Aron Culotta
Jedsada Chartree
![Page 2: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/2.jpg)
Introduction
• Growing interest in monitoring disease outbreaks.• Growing of twitter users
- February, 2010 50 million tweets/day- June, 2010 65 million tweets/day (750 tweets/s
- 190 million users
Source: http://en.wikipedia.org/wiki/Twitter
![Page 3: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/3.jpg)
Introduction
• Twitter is a website, which offers a social networking and micro-blogging service.- Users send and read messages called “tweets”
(140 characters)
![Page 4: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/4.jpg)
Introduction
• Advantages of Twitter for this research- Full messages provide more information than query.- Twitter profiles contain more detail to analyze.
(city, state, gender, age)- Diversity of twitter users.
![Page 5: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/5.jpg)
Methodology
• Data- Collect 574,643 messages for 10 weeks
(February 12, 2010 to April 24, 2010) - The US Centers for Disease Control and Prevention (CDC)
publishes the US Outpatient Influenza-like Illness Surveillance Network (ILINet)
![Page 6: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/6.jpg)
Methodology
The Ground truth ILI rates obtained from the CDC statistics
![Page 7: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/7.jpg)
Methodology
• Regression Models 1. Simple linear regression
P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match W = D = a document collection Dw = a document frequency for word W
logit(x) =
€
log it(P) = β1 log it(Q(W ,D))+ β 2 +ε
€
β1
€
β2€
ε
€
Q(W ,D)
€
€
DwD
€
ln(x
1− x)
![Page 8: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/8.jpg)
Methodology
• Regression Models 2. Multiple linear regression
P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match Wi =
D = a document collection Dwi = a document frequency for word Wi
logit(x) =
€
log it(P) = β1 log it(Q({W1},D))+ ...+ log it(Q({Wk},D))+ β k+1 +ε
€
β1
€
β2€
ε
€
Q({Wi },D)
€
DwiD
€
ln(x
1− x)
![Page 9: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/9.jpg)
Methodology
• Keyword Selection1. Correlation Coefficient
- Simple linear regression model evaluation
2. Residual Sum of Squares (RSS)
- It measures a discrepancy between the data and an estimation model
€
RSS(P,^
P) = ( pi − p^
)2i∑
![Page 10: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/10.jpg)
Methodology
• Keyword Generation1. Hand-chosen keywords
(flu, cough, sore throat, headache)
2. Most frequent keywords - Search all documents containing any of hand-chosen
keywords. - Find the top 5,000 most frequently occurring words.
![Page 11: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/11.jpg)
Methodology
• Document Filtering - Applying logistic regression to predict whether a Twitter
message is reporting an ILI symptom.
yi = a binary random variable
(1 if document Di is positive, 0 otherwise)
xi = {xij} = number of times word j appears in document i€
p(y i = 1 | x i ;θ ) =1
1+ e(−xi •θ )
![Page 12: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/12.jpg)
Methodology
![Page 13: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/13.jpg)
Methodology
• Classification evaluation- Accuracy
- Precision - Recall - F-measure
€
F = 2•Pr ecision • Recall
Pr ecision +Recall
![Page 14: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/14.jpg)
Results
• Document Filtering
Evaluation of messages classification with standard error in parentheses
![Page 15: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/15.jpg)
Results
• Regression
The 10 different systems evaluated
![Page 16: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/16.jpg)
Results
• Regression
The regression coefficient (r), residual sum of square (RSS), and standard error of each system
![Page 17: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/17.jpg)
Results
Results for multi-hand-rss(2) Results for classification-hand
![Page 18: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/18.jpg)
Results
Results for multi-freq-rss(3) Results for simple-hand-rss(1)
![Page 19: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/19.jpg)
Results
Correlation results for simple –hand-rss and multi-hand-rss
Correlation results for simple –hand-corr and multi-hand-corr
![Page 20: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/20.jpg)
Results
Correlation results for simple –freq-rss and multi-freq-rss
Correlation results for simple –freq-corr and multi-freq-corr
![Page 21: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree](https://reader035.vdocuments.net/reader035/viewer/2022062308/56649ddb5503460f94ad2295/html5/thumbnails/21.jpg)
Conclusion
• Several methods to identify influenza-related messages.• Compare a number of regression models to correlate the
messages with CDC statistics.• The best model achieves correlation of .78 .