Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams
-
Upload
muhammad-imran -
Category
Technology
-
view
486 -
download
0
Transcript of Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams
![Page 1: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/1.jpg)
Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams
Muhammad Imran (@mimran15) and Carlos Castillo (@ChaToX)
Qatar Computing Research Institute
Doha, Qatar.
SWDM’15 : WWW’15 May 18th 2015
![Page 2: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/2.jpg)
Information Variability on Social Media
• Different events present different information categories
• Even for recurring events, categories proportion change
![Page 3: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/3.jpg)
Information Variability on Social Media
• Different events present different information categories
• Even for recurring events, categories proportion change
![Page 4: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/4.jpg)
Information Variability on Social Media
• Different events present different information categories
• Even for recurring events, categories proportion change
![Page 5: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/5.jpg)
Information Variability on Social Media
• Different events present different information categories
• Even for recurring events, categories proportion change
![Page 6: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/6.jpg)
Information Variability on Social Media
• Different events present different information categories
• Even for recurring events, categories proportion change
![Page 7: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/7.jpg)
Different Classification Approaches
• Various classification approaches exist:– Manual classification by human experts– Automatic classification using unsupervised or
supervised approaches(needs training data)– Hybrid: Automatic + Manual
• Retrospective vs. real-time classification– Batch processing (offline, training data availability)– Stream processing (real-time, scarce training data)
![Page 8: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/8.jpg)
Real-time Stream Classification (Supervised )
• Fewer categories are better– Decrease workers dropout – More training data for each category, more accuracy– “7 plus/minus 2” rule [G. A. Miller, 56]
• Categories need to be defined carefully– Empty categories (waste space and efforts of workers)– Categories that are too large introduce heterogeneity
![Page 9: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/9.jpg)
Problem Statement
• How can we classify items arriving as a data stream into a small number of categories, if we cannot anticipate exactly which will be the most frequent categories?
Our research improves crowdsourcing-based and supervised learning-based systems (e.g. AIDR) by finding latent categories in fast data streams.
![Page 10: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/10.jpg)
Our Approach (top-down + bottom-up)
1. An expert defines information categories (top-down)2. Messages are categorized into the initial set plus an
extra “Miscellaneous” category3. Identify relevant and prevalent categories from the
messages in the “Miscellaneous” category (bottom-up)
1. Generate candidate categories2. Learn characteristics of good categories3. Rank categories on good characteristics
How do we identify relevant categories?
![Page 11: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/11.jpg)
Candidate Generation
We propose to apply Latent Dirichlet Allocation (LDA) on the Miscellaneous category:• Input: A set of n documents (all messages in
the Misc. category) and a number m (# of topics to be generated)
• Output: n x m matrix in which cell(i, j) indicates the extent to which document i corresponds to topic j.
![Page 12: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/12.jpg)
Candidate Evaluation
To reduce the workload of experts to decide which categories to pick or not, we propose the following criteria:• Volume: a category shouldn’t be too small• Novelty: a category must not overlap or be too
similar to the existing categories• Cohesiveness (intra- and inter-similarity): a
category should be cohesive (should have small intra-topic and large inter-topic values)
![Page 13: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/13.jpg)
Experimental Testing• We used Twitter data of 17 crises (from the
CrisisLexT26 dataset at crisislex.org)
A. Affected individuals, deaths, injuries, missing, found.
B. Infrastructure and utilities: buildings, roads, services damage.
C. Donation and volunteering: needs, requests of food, shelter, supplies.
D. Caution and advice: warnings issued or lifted, guidance and tips.
E. Sympathy and emotional support: thoughts, prayers, gratitude, etc.
Z. Other useful information not covered by any of the above categories.
![Page 14: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/14.jpg)
Candidate Generation Setup
• Applied LDA on the messages in the “Z” category of each crisis
• 5 topics were generated for each crisis• Considered messages with LDA score > 0.06 in
each topic• Presented the LDA generated topics to experts
in random order
![Page 15: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/15.jpg)
Candidate Annotation Setup
Recruited two experts from two Int. humanitarian organizations in the crisis response domain
![Page 16: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/16.jpg)
Results• Topics with avg. score <= 2.5 considered as bad topics• Topics with avg. score >= 3.5 considered as good topics• Hit: if the metric value of good topics > bad topics
A crisis is not considered for evaluation, if all of its topics receive an average score either below or above 3.0.
![Page 17: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/17.jpg)
Conclusion
• Novelty, intra-similarity and cohesiveness are useful in identifying good topics
• Our approach combines top-down (manual) and bottom-up (automatic) elements.
• Learned important characteristics of good topics
• Future work includes candidate ranking including recommendation for adding, merging, dropping new unseen categories
![Page 19: Towards a Data-driven Approach to Identify Crisis-Related Topics in Social Media Streams](https://reader030.fdocuments.us/reader030/viewer/2022032716/55b3adbfbb61ebe3568b46d3/html5/thumbnails/19.jpg)
Thank you!
Authors contact:Muhammad Imran @mimran15Carlos Castillo @ChaToX