Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of...
-
Upload
deja-upson -
Category
Documents
-
view
220 -
download
2
Transcript of Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of...
![Page 1: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/1.jpg)
Open Domain Event Extraction from Twitter
Alan RitterMausam, Oren Etzioni, Sam Clark
University of Washington
![Page 2: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/2.jpg)
Goal: Realtime Stream of Structured Information About Events
iPadTYPE:
LaunchDATE:Mar 7
Steve JobsTYPE:DeathDATE:Oct 6
YelpTYPE:IPO
DATE:March 2
TIME
Q: which set of events do we want to know about?
Q: How soon can we know about an event?
![Page 3: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/3.jpg)
Q: Where might we find information about events unfolding in the world?
• Structured / Semi-Structured data sources?– Facebook / Eventbrite– HTML / Wrapper induction
• Natural Language / Text?– News articles– Status Messages / Twitter
Claim: This is worth
investigating
![Page 4: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/4.jpg)
Calendar Demo
http://statuscalendar.com
• Extract Named Entities– 1% sample of global Twitter stream– 2.5 Million / Day– Using NER trained on Labeled Tweets
• [Ritter et. al. EMNLP 2011]
• Extract and Resolve Temporal Expressions– For example “Next Friday” = 09-09-11
• Count Entity/Day co-occurrences– G2 Log Likelihood Ratio
• Plot Top K Entities for Each Day
![Page 5: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/5.jpg)
Background: Event Extraction from Newswire
• Historically, the most important source of info on current events– Since spread of printing press
• Lots of previous work on Newswire– Timebank– MUC & ACE competitions• Limited to narrow domains• Performance is still not great
![Page 6: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/6.jpg)
Social Media
• Competing source of info on current events• Status Messages– Short– Easy to write (even on mobile devices)– Instantly and widely disseminated
• Double Edged Sword– Many irrelevant messages– Many redundant messages
Information Overload
![Page 7: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/7.jpg)
Noisy Text: Challenges
• Lexical Variation (misspellings, abbreviations)– `2m', `2ma', `2mar', `2mara', `2maro', `2marrow', `2mor', `2mora', `2moro', `2morow',
`2morr', `2morro', `2morrow', `2moz', `2mr', `2mro', `2mrrw', `2mrw', `2mw', `tmmrw', `tmo', `tmoro', `tmorrow', `tmoz', `tmr', `tmro', `tmrow', `tmrrow', `tmrrw', `tmrw', `tmrww', `tmw', `tomaro', `tomarow', `tomarro', `tomarrow', `tomm', `tommarow', `tommarrow', `tommoro', `tommorow', `tommorrow', `tommorw', `tommrow', `tomo', `tomolo', `tomoro', `tomorow', `tomorro', `tomorrw', `tomoz', `tomrw', `tomz‘
• Unreliable Capitalization– “The Hobbit has FINALLY started filming! I cannot wait!”
• Unique Grammar– “watchng american dad.”
![Page 8: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/8.jpg)
Off The Shelf NLP Tools Fail
![Page 9: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/9.jpg)
Off The Shelf NLP Tools Fail
Twitter Has Noisy & Unique Style
![Page 10: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/10.jpg)
Annotating Named Entities
• Annotated 2400 tweets (about 34K tokens)• Train on in-domain data
![Page 11: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/11.jpg)
Learning
• Sequence Labeling Task• IOB encoding
• Conditional Random Fields • Features:– Orthographic– Dictionaries– Contextual
Word Label T-Mobile B-ENTITY
to O
release O
Dell B-ENTITY
Streak I-ENTITY
7 I-ENTITY
on O
Feb O
2nd O
![Page 12: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/12.jpg)
Performance (NE Segmentation)
Stanford T-NER0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
PRF
![Page 13: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/13.jpg)
Event-Referring Phrases
Examples:Apple to Announce iPhone 5 on October 4th! YES!iPhone 5 announcement coming Oct 4th
WOOOHOO NEW IPHONE TODAY! CAN’T WAIT!
• Useful to display in connection with events– E.g. “Steve Jobs” + “died” + “October 6”
• Helpful in categorizing Events into Types
![Page 14: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/14.jpg)
Event Phrases:Annotation/Learning
• Annotated 1,000 tweets (19,484 tokens)• Similar to EVENT tags in TimeBank• Sequence-labeling problem– IOB Encoding– Conditional Random Fields
![Page 15: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/15.jpg)
Event Segmentation Results
![Page 16: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/16.jpg)
Event Representation
![Page 17: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/17.jpg)
Categorizing Event Types• Would like to categorize events into types, for
example:– Sports– Politics– Product releases– …
• Benefits:– Allow more customized calendars– Could be useful in upstream tasks
![Page 18: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/18.jpg)
Classifying Events:Challenges
• Many Different Types• Not sure what is the right set of types• Set of types might change– Might start talking about different things– Might want to focus on different groups of users
![Page 19: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/19.jpg)
Solution: Unsupervised Event Type Induction
• Latent Variable Models– Generative Probabilistic Models
• Advantages:– Discovers types which match the data– No need to annotate individual events– Don’t need to commit to a specific set of types– Modular, can integrate into various applications
![Page 20: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/20.jpg)
Each Event Phrase is modeled as a mixture of types
Each Event phrase is modeled as a mixture of types
Each Event Type is Associated with a Distribution over
Entities and Dates
P(SPORTS|cheered)= 0.6P(POLITICS|cheered)= 0.4
![Page 21: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/21.jpg)
Details…
• Gathered about 65M (entity, event, date) tuples
• Collapsed Gibbs Sampling– 1,000 iterations of burn in– Parallelized sampling (approximation) using MPI
[Newman et. al. 2009]
• 100 Event Types
![Page 22: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/22.jpg)
![Page 23: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/23.jpg)
Experiment: Categorizing Events
• Randomly Sampled 500 (entity, date) pairs• Annotated with event types– Using types discovered by the topic model
• Baseline:– Supervised classification using 10-fold cross
validation– Treat event phrases like bag of words
![Page 24: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/24.jpg)
Event Classification Performance
![Page 25: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/25.jpg)
End-to-end Evaluation
• Evaluate end-to-end calendar entries• Collect tweets up to cutoff date• Extract Named Entities, Event Phrases, Temporal
Expressions• Classify Event Type• Rank Events• Pick top K events occurring in a 2 week future
window• Evaluate Precision
![Page 26: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/26.jpg)
• Ngram Baseline– No Named Entity Recognition– Rely on significance test to rank ngrams– A few extra heuristics (filter out temporal
expressions etc…)
End-to-end Evaluation
![Page 27: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/27.jpg)
Shallow Syntax in Tweets: Difficult
• Performance Lower than News:– Entity Segmentation– Event Phrase Segmentation
• But:– Tweets are self contained– Short (usually not very complicated)– Simple discourse structure– Meant to be understood in isolation– More data
![Page 28: Open Domain Event Extraction from Twitter Alan Ritter Mausam, Oren Etzioni, Sam Clark University of Washington.](https://reader036.fdocuments.us/reader036/viewer/2022062404/5516fc7e550346fe558b4e7b/html5/thumbnails/28.jpg)
Contributions
• Analysis of challenges in noisy text• Adapted NLP tools to Twitter http://github.com/aritter/twitter_nlp• Calendar Demo
http://statuscalendar.com• Unsupervised Event Categorization– Discovers types which match the data
THANKS!