CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf•...
Transcript of CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf•...
![Page 1: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/1.jpg)
CS 572: Information Retrieval
Web 2.0+: Social Media/CGC
![Page 2: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/2.jpg)
Status
• Final project: updated proposal due by today/tomorrow
• Questions? Discuss at end of class.
3/28/2016 CS 572: Information Retrieval. Spring 2016 2
![Page 3: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/3.jpg)
3
3Eugene Agichtein, Emory University, IR Lab
![Page 4: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/4.jpg)
• Collaboratively Generated Content: CGC
– Creation models, quality
• Twitter/Weibo
• Searching:
– Indexing
– Ranking
3/28/2016 CS 572: Information Retrieval. Spring 2016 4
![Page 5: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/5.jpg)
5
Estimates of daily content creation [Ramakrishnan & Tomkins, IEEE Computer 2007]
Content type Amount produced / day
Published (books, magazines,
newspapers)
3-4 Gb
Professional Web (paid creation,
e.g., corporate Web site)
2 Gb
User-generated (reviews,
blogs, personal Web sites)
8-10 Gb
Private text (instant messages,
email)
3000 Gb (= 3Tb)
![Page 6: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/6.jpg)
The value of CGC
• Extrinsic value – impact on AI, IR and NLP applications
• Intrinsic value – resources for the people– How to make the data more useful
(e.g., increase user engagement)
• Synergistic view of CGC– AI techniques can be used to improve the
quality of CGC repositories,
– which can in turn improve the performance of other AI applications
6
![Page 7: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/7.jpg)
3,600,000 articles
Since 2001
2,500,000 entries
Since 2002
1,000,000,000 Q&A
Since 2005
4,000,000,000 images
Since 2004
Before7
After
Extrinsic value: sources of knowledge
Collaboratively Generated Content
articles: 65,000
Since 1768
entries: 150,000
Since 1985
assertions: 4,600,000
Since 1984
WordNet
CYC
![Page 8: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/8.jpg)
Extrinsic value: what we can do with all this knowledge
• Under the hood– CGC as an additional (huge!) corpus
• Better term statistics, observe the document authoring process
– Repositories of knowledge• Concepts as features (BOW Bag of Words + Concepts)
• Concepts as word senses (way beyond WordNet)
– More comprehensive lexicons, gazetteers
– Extending existing knowledge repositories (e.g., WordNet)
• Higher-level tasks– (Concept-based) information retrieval
– Question answering
8
![Page 9: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/9.jpg)
(Text) Social Media Today
Published: 4Gb/day Social Media: 10Gb/DayYahoo Answers: 120M users, 40M questions, 1B answers
Yes, we could read your blog. Or, you could tell us about your day
9Eugene Agichtein, Emory University, IR Lab
![Page 10: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/10.jpg)
Models for CGC generation: Overview
• Wikipedia: contribution, curation, and coordination
– http://en.wikipedia.org/wiki/
Wikipedia:Modelling_Wikipedia%27s_growth
• Community forums: Yahoo! Answers, focused forums
• Sharing: twitter
10
![Page 11: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/11.jpg)
New
lin
ks a
dd
ed
dail
y
Models of CGC Generation: Link Creation“Rich-Get-Richer” (for a while ...)
– Flickr and other sites alsoshow strong evidence of preferential attachment[Mislove et al. ‘08]
Preferential attachment:
More popular Wikipedia pages
are more likely to acquire links
(up to a point!)
[Capocci et al. ‘06]Po
pu
lari
ty11
![Page 12: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/12.jpg)
Models of CGC Generation: Edits“Rich-Get-Richer” (again, up to a point)
Popular Wiki pages (> 1000 edits):
Editing activity initially increases than drops off
Later on, arriving visitors
are less able to contribute
novel information
[Das and Magdon-Ismail ‘10]
12
![Page 13: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/13.jpg)
Models of CGC generation:Slowing Growth [Suh et al., WikiSym 2009]
• Article growth (#) per month, extrapolated to slow down dramatically by 2013
• Increased coordination and overhead, resistance to new edits
13
![Page 14: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/14.jpg)
Models for CGC generationWikipedia conflict and coordination
[Kittur & Kraut,
2008]
Explicit coordination: discussion page
Implicit coordination: leading the effort
Example: “Music of Italy” page – The bulk of
edits are made by TUF-KAT and Jeffmatt
Page quality increases with number of
editors iff concentrated effort; otherwise,
more editors higher overhead, lower
quality
14
![Page 15: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/15.jpg)
Models of CGC Generation: Content Visibility and Evolution
• Dynamics of votes of select Digg stories over a period of 4 days[Lerman ‘07]
• Dynamics of Y! Answersposts over a period of 4 days [Aji et al. ‘10]
15
![Page 16: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/16.jpg)
Models for CGC Generation: ForumsInfo seekers vs. info providers
• In Yahoo! Answers, question and answer quality are strongly related:bad questions attractbad or mediocre answers:[Agichtein et al. ’08]
• Askers (Java forums) prefer help from users with justslightly more expertise [Zhang et al. ‘07]
16
![Page 17: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/17.jpg)
Lots of content... Do we trust it?
3/28/2016 CS190: Web Science and Technology, 2010 17
On the Internet, nobody knows you are a dog.
![Page 18: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/18.jpg)
CGC Quality Filtering (Overview)
• Multiple dimensions of CGC content to exploit:
– Content persistence (Wikipedia)
– Surface quality (language-wise): all CGC
– Metadata: Author reliability, geo-spatial edit patterns
– Citation/link-based estimation of authority
• Most effective systems use a combination of all of the above
18
![Page 19: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/19.jpg)
Content-Driven Author Reputation[Adler & Alfaro, WWW 2007]
19
![Page 20: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/20.jpg)
Automatic Vandalism Detection
• Hoaxes
• Libelous content
• Malware distribution
• Errors picked up by mainstream media!
• Wikipedia “truthiness”
Vandalized Ben Franklin Wikipedia page
20
![Page 21: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/21.jpg)
Assigning Trust to Wikipedia Text
• Estimating text trust from edit history information.
• Trust is computed as:
1) Insertion: trust value for newly inserted text (E)
2) Edge effect: the text at the edge of blocks has the same trust as newly inserted text.
3) Revision: text trust may increase, if author reputation gets higher
[Adler et al., WikiSym 2008]Online demo+code are available: http://wikitrust.soe.ucsc.edu/
Before
After
21
![Page 22: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/22.jpg)
NLP-based Vandalism Detection
• Use language modeling to predict edit quality
– Lexical: vocabulary, regular expression/lists
– Syntactic: N-gram analysis using part-of-speech tags
– Semantic: N-gram analysis with hand-picked words
[Wang et al., COLING 2010]
Figure adapted from Andrew G. West, 2010
22
![Page 23: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/23.jpg)
Comparison of Vandalism Detection Approaches: Shared tasks [Potthast et al. ‘10]
• Shared vandalism corpus, 32,000 edits
• Best single approach: NLP-based
• Meta-detector: best– Task requires different
classes of features.
• CLEF 2010 Competition on Vandalism Detection:– Best result: AUC 0.92,
Precision = 100% atRecall = 20%
23
![Page 24: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/24.jpg)
Finding High Quality Content in SM
• Well-written• Interesting• Relevant (answer)• Factually correct• Popular?• Provocative?• Useful?
24
As judged by
professional editors
E. Agichtein, C. Castillo, D. Donato, A. Gionis,
and G. Mishne, Finding High Quality Content in
Social Media, in WSDM 2008
Eugene Agichtein, Emory University, IR Lab
![Page 25: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/25.jpg)
25
25
Eugene Agichtein, Emory University, IR Lab
![Page 26: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/26.jpg)
26
How do Question and Answer Quality relate?
26Eugene Agichtein, Emory University, IR Lab
![Page 27: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/27.jpg)
27
27
Eugene Agichtein, Emory University, IR Lab
![Page 28: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/28.jpg)
28
28
Eugene Agichtein, Emory University, IR Lab
![Page 29: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/29.jpg)
29
29
Eugene Agichtein, Emory University, IR Lab
![Page 30: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/30.jpg)
30
30
Eugene Agichtein, Emory University, IR Lab
![Page 31: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/31.jpg)
Community
31Eugene Agichtein, Emory University, IR Lab
![Page 32: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/32.jpg)
Link Analysis for Authority Estimation
32
Question 1
Question 2
Answer 5
Answer 1
Answer 2
Answer 4
Answer 3
User 1
User 2
User 3
User 6
User 4
User 5
Answer 6
Question 3
User 1
User 2
User 3
User 6
User 4
User 5
Kj
jAiH..0
)()(
Mi
iHjA..0
)()(
Hub (asker) Authority (answerer)
Eugene Agichtein, Emory University, IR Lab
[Jurzcyk & Agichtein, CIKM 2007]
![Page 33: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/33.jpg)
Qualitative Observations
HITS effective
HITS ineffective
33Eugene Agichtein, Emory University, IR Lab
![Page 34: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/34.jpg)
34
34
Random forest
classifier
Eugene Agichtein, Emory University, IR Lab
![Page 35: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/35.jpg)
Result 1: Identifying High Quality Questions
35Eugene Agichtein, Emory University, IR Lab
![Page 36: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/36.jpg)
Top Features for Question Classification
• Asker popularity (“stars”)
• Punctuation density
• Topical category
• Page views
• KL Divergence from reference corpus LM
36Eugene Agichtein, Emory University, IR Lab
![Page 37: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/37.jpg)
Identifying High Quality Answers
37Eugene Agichtein, Emory University, IR Lab
![Page 38: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/38.jpg)
Top Features for Answer Classification
• Answer length
• Community ratings
Answerer reputation
• Word overlap
• Kincaid readability score
38Eugene Agichtein, Emory University, IR Lab
![Page 39: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/39.jpg)
CGC as a corpus
• Better statistics on a larger or cleaner text collection
– Compute IDF on the entire Wikipedia corpus
• Beware of different style / topic distribution
– Cf. Web as a corpus (Mihalcea’04, Gonzalo et al.’ 03)
• Web corpus statistics, such as counts of search results
– Cf. using very large corpora in NLP (Banko and Brill ‘01a, ‘01b)
39
![Page 40: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/40.jpg)
CGC as a corpus (cont’d)
• Extend existing knowledge repositories (e.g., WordNet)
• Create new datasets, labeled by construction
– WSD datasets using OpenMind – Chklovski & Mihalcea ‘02, ‘03
– WSD datasets based on Wikipedia – Mihalcea’07; Bunescu & Pasca ‘06; Cucerzan ’07
– Text categorization datasets based on the ODP (Davidov et al. ‘04)
• Study the authoring process by observing multiple revisions of a document
40
![Page 41: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/41.jpg)
41 CGC as corpus / Extending WordNet
![Page 42: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/42.jpg)
Using corpus statistics for term weighting
• J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. WSDM 2010– Periodic crawls (starting from an unknown point in document’s life) vs.
formally tracked revision history
– LM-specific vs. general term frequency model, which can be plugged into any retrieval model
– Aji et al.’10 also model bursts – “significant” events in the document’s life
• M. Efron. Linear time series models for term weighting in information retrieval. JASIST 2010– Temporal behavior of terms as the entire collection changes over time
42 CGC as corpus / Document revisions
![Page 43: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/43.jpg)
Extending term weighting models with Revision History Analysis (RHA) [Aji et al. ’10]
• RHA captures term importance from the document authoring process.
• Natural integration with state of the art retrieval models (BM25, LM)
• Consistent improvement over existing retrieval models
• Beyond Wikipedia
– Tracking changes in MS Word documents; version control
– Past snapshots of Web pages
(Elsas & Dumais ’10 constructed a new corpus)
CGC as corpus / Document revisions43
![Page 44: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/44.jpg)
Observable document generation process Example: Revisions of “Topology” in Wikipedia
1st revision:
250th revision:
Current revision:
CGC as corpus / Document revisions44
![Page 45: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/45.jpg)
Revision History Analysis redefines Term Frequency (TF)
• TF is a key indicator of document relevance
• TF can be naturally integrated into ranking models
𝑆 𝑄, 𝐷 =
𝑡𝜖𝑄
𝐼𝐷𝐹 𝑡 ∙𝑇𝐹 𝑡, 𝐷 ∙ 𝑘1 + 1
𝑇𝐹 𝑡, 𝐷 + 𝑘1 1 − 𝑏 + 𝑏 ∙𝐷
𝑎𝑣𝑔𝑑𝑙
𝑆 𝑄, 𝐷 = 𝐷(𝑄| 𝐷 =
𝑡𝜖𝑉
𝑃 𝑡|𝑄 ∙ log𝑃 𝑡 𝑄
𝑃 𝑡 𝐷
BM25
Language Model
CGC as corpus / Document revisions45
![Page 46: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/46.jpg)
RHA global model
Define the term frequency based on the whole document generation process
– a document grows steadily over time
– a term is relatively important if it appears in the early revisions
𝑇𝐹𝑔𝑙𝑜𝑏𝑎𝑙 𝑡, 𝑑 =
𝑗=1
𝑛𝑐(𝑡, 𝑣𝑗)
𝑗𝛼
Frequency of term 𝑡 in revision 𝑣𝑗
Decay factor
CGC as corpus / Document revisions46
Iteration over revisions
![Page 47: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/47.jpg)
RHA global model: example
CGC as corpus / Document revisions
Decayed term importance
Decay factor
47
![Page 48: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/48.jpg)
But some pages are different: “Avatar(2009 film)” – very bursty editing process
1st revision:
500th revision:
Current revision:
CGC as corpus / Document revisions48
![Page 49: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/49.jpg)
RHA burst model
• A burst resets the decay clock for a term
• The weight will decrease after a burst
𝑇𝐹𝑏𝑢𝑟𝑠𝑡 𝑡, 𝑑 =
𝑗=1
𝑚
𝑘=𝑏𝑗
𝑛𝑐(𝑡, 𝑣𝑘)
(𝑘 − 𝑏𝑗 + 1)𝛽
Frequency of term 𝑡 in revision 𝑣𝑘
Decay factor for jth burst
Burst detection
Content-based (relative content change)
Activity-based (intensive edit activity)
Both models combined
CGC as corpus / Document revisions49
Iteration over bursts
Iteration over revisions within
each burst
![Page 50: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/50.jpg)
RHA burst model: example
CGC as corpus / Document revisions
Decay factor
Term importance
50
![Page 51: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/51.jpg)
𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷 = 𝜆1 ∙ 𝑇𝐹𝑔 𝑡, 𝐷 + 𝜆2 ∙ 𝑇𝐹𝑏 𝑡, 𝐷 + 𝜆3 ∙ 𝑇𝐹 𝑡, 𝐷
Combining the global model with the burst model:
𝜆1 + 𝜆2 + 𝜆3 = 1
CGC as corpus / Document revisions
Putting it all together:RHA Term Frequency
global original TFburst
51
![Page 52: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/52.jpg)
Integrating RHA into retrieval models
𝑆 𝑄, 𝐷 =
𝑡𝜖𝑄
𝐼𝐷𝐹 𝑡 ∙𝑇𝐹 𝑡, 𝐷 ∙ 𝑘1 + 1
𝑇𝐹 𝑡, 𝐷 + 𝑘1 1 − 𝑏 + 𝑏 ∙𝐷
𝑎𝑣𝑔𝑑𝑙
BM25
𝑆 𝑄, 𝐷 = 𝐷(𝑄| 𝐷 =
𝑡𝜖𝑉
𝑃 𝑡|𝑄 ∙ log𝑃 𝑡 𝑄
𝑃 𝑡 𝐷
Statistical Language Models
𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷
𝑇𝐹𝑟ℎ𝑎 𝑡, 𝐷
𝑃𝑟ℎ𝑎 𝑡, 𝐷
+ RHA
+ RHA
RHA Term Probability:
Performance improvements: +2-7% (INEX), +1-4% (TREC)
CGC as corpus / Document revisions52
![Page 53: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/53.jpg)
Overview of Part 3
Use CGC as an additional (huge!) corpus• Better term statistics
• Observing the authoring process
CGC as repositories of world knowledge• Distilling knowledge from CGC structure
and content
• Concepts as features(BOW Bag of Words + Concepts)
– Semantic relatedness, and later IR
• Concepts as word senses– WSD
53
![Page 54: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/54.jpg)
CGC as repositories of senses and concepts
• Articles / concepts as representation features
– From the bag of words to
the bag of words + concepts
– Semantic relatedness of words (and texts)
– Later: concept-based IR
• Articles / concepts as word senses
– Word sense disambiguation
CGC as knowledge repositories54
![Page 55: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/55.jpg)
Semantic relatedness of words and texts
• How related are
cat mouse
Augean stables golden apples of the Hesperides
慕田峪 (Mutianyu) 司马台 (Simatai)
• Used in many applications
– Information retrieval
– Word-sense disambiguation
– Error correction (“Web site” or “Web sight”)
55 CGC as knowledge / Semantic relatedness
![Page 56: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/56.jpg)
Evaluation criterion: correlation with human judgments
Human score [0 .. 10]
journey voyage 9.29
Jerusalem Israel 8.46
money property 7.57
football tennis 6.63
alcohol chemistry 5.54
coast hill 4.38
professor cucumber 0.23
56 CGC as knowledge / Semantic relatedness
![Page 57: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/57.jpg)
Previous approaches
• WordNet-based measures
– Using graph structure (path length, random walks), information content, glosses
– Comprehensive review: Budanitsky & Hirst ‘06
– Patwardhan et al. ‘03 – using semantic relatedness measures for WSD
• Jiang-Conrath (WN structure + info content) performs on par with Adapted Lesk (gloss overlap)
• All other WN-based measures are inferior to Adapted Lesk
• Co-occurrence based measures
– Latent Semantic Analysis (Deerwester et al. ‘90)
CGC as knowledge / Semantic relatedness57
![Page 58: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/58.jpg)
The role of background knowledge
When human judge semantic relatedness they
use massive amounts of background
knowledge
Can we endow computers with such
capability of knowledge-based
reasoning?
CGC as knowledge / Semantic relatedness58
![Page 59: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/59.jpg)
Using Wikipedia for computing semantic relatedness
• Structure (links)
– Strube & Ponzetto ’06 (WikiRelate)
– Milne & Witten ’08 (WLM)
• Content (text)
– Gabrilovich & Markovitch ’07 (ESA)
• Structure + content
– Yeh et al. ’09 (WikiWalk)
CGC as knowledge / Semantic relatedness59
![Page 60: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/60.jpg)
Explicit Semantic Analysis (ESA)[Gabrilovich & Markovitch ‘06-’09]
• Wikipedia-based semantic representation
• Some ESA applications– Computing semantic relatedness of words and texts
– Text categorization
– Information retrieval
• Cross-lingual retrieval utilizing cross-language links in Wikipedia
CGC as knowledge / Semantic relatedness60
![Page 61: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/61.jpg)
The AI Dream61
Problem
solverProblem Solution
CGC concepts as features
![Page 62: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/62.jpg)
62
The circular dependency problem
Problem
solverProblem Solution
Natural Language
Understanding
CGC concepts as features
![Page 63: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/63.jpg)
63
Problem
solverProblem Solution
Natural Language
Understanding
Wikipedia-based
semantics Statistical text processing
Circumventing the need for deep Natural Language Understanding
CGC concepts as features
![Page 64: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/64.jpg)
Every Wikipedia article represents a concept
Leopard
64 CGC concepts as features
![Page 65: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/65.jpg)
Wikipedia can be viewed as an ontology –a collection of concepts
65
Leopard
Isaac
Newton
Ocean
CGC concepts as features
![Page 66: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/66.jpg)
The semantics of a word is a vector of its associations with Wikipedia concepts
Word
66
Cf. Bengio’s talk
tomorrow on learning
text and image
representations
CGC concepts as features
![Page 67: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/67.jpg)
0.0 1.0
0.0
The semantics of a
word is a point in the
multi-dimensional
space defined by the
set of Wikipedia
concepts
word
67
Word representation: 2D example
Wikipedia
concept B
Wikipedia
concept A
1.0
CGC concepts as features
![Page 68: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/68.jpg)
0.0 1.0
0.0
68
Text representation: 2D example
Wikipedia
concept B
Wikipedia
concept A
1.0
The semantics of a
text fragment is a
centroid of the vectors
of the individual words
word
1
word
2
word
3
word
4
CGC concepts as features
![Page 69: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/69.jpg)
“Cat”
Leopard
Mouse
Veterinarian
PetsTennis
ball
Elephant
House
CGC concepts as features69
![Page 70: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/70.jpg)
Cat
Tom &
Jerry
Cat [0.92] Leopard [0.84] Roar [0.77]
Cat [0.95] Felis [0.74] Claws [0.61]
Animation [0.85] Mouse [0.82] Cat [0.67]
CatCat
0.95
Panthera
0.92
Tom & Jerry
0.67
Concept and word representation in Explicit Semantic Analysis[Gabrilovich & Markovitch ’09]
70
Panthera
CGC concepts as features
![Page 71: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/71.jpg)
71
Leopar
d0.84
Mac OS X
v10.5
0.61
Cat
0.52
Panthera
CatCat
0.95
Panthera
0.92
Tom & Jerry
0.67
Computing semantic relatednessas cosine between two vectors
CGC concepts as features
![Page 72: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/72.jpg)
Experimental results (individual words)
Algorithm Correlation with
human judgments
WordNet-based 0.33–0.35
Roget’s Thesaurus 0.55
LSA 0.56
WikiRelate 0.49
WLM 0.69
ESA (ODP) 0.65
ESA (Wikipedia) 0.75
TSA [Radinsky et al. ‘11] 0.8
EWC (ESA/Wikipedia + WordNet + collocations)
[Haralambous & Klyuev, forthcoming at IJCNLP’11]
0.79-0.87
CGC concepts as features72
![Page 73: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/73.jpg)
Using Wikipedia for Named Entity Disambiguation [Bunescu & Pasca ’06]
• Both the method and the dataset are based on Wikipedia
• Dataset: 1.7M labeled examples– ''The [[Vatican City | Vatican]] is now an independent enclave ...'‘
Geographical location, not the Holy See
CGC for WSD
The basic approach:
compute the cosine
similarity between the
word context and the
target articles
73
![Page 74: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/74.jpg)
Problems with the basic approach
1. Short articles and stubs
2. Synonymy (lexical mismatch between the word context and the article)
Solution: use categories for generalization
CGC for WSD74
![Page 75: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/75.jpg)
Word-Category Correlations
“John Williams and the Boston Pops a summer Star Wars concert at Tanglewood.”conducted
John Williams (composer)
?
John Williams (wrestler)
Film score composers
Composers
Musicians
Professional wrestlers
Wrestlers
People known in connection
with sports and hobbies
People by occupation
When a Wikipedia article is too short,
get additional context from its categories
CGC for WSD75
![Page 76: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/76.jpg)
The full solution [Bunescu & Pasca ’06]
• ML-based approach (SVM) to rank candidate senses
• Features:
– Cosine(word context, article text)
– Word-category affinities (binary features indicating if a word has occurred in any article of a given category)
• Dimensionality (|V| x |C|) can be reduced by thresholding word frequencies
• Accuracy improvement vs. cosine-only: 3%-25.5%
CGC for WSD76
![Page 77: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/77.jpg)
Results [Bunescu & Pasca ’06]
• The classification model can be trained on Wikipedia, then applied to any input text– Senses can only be resolved to those
defined in Wikipedia
• Another idea to overcome the lexical mismatch: augment Wikipedia articles with contexts of corresponding word senses – Cf. anchor text aggregation [Metzler et al. ’09] and
[Cucerzan ‘07]
CGC for WSD
Taxonomy Kernel
77
![Page 78: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/78.jpg)
An alternative approach to named entity disambiguation [Cucerzan ‘07]
• Applicable to any text, not solely that from Wikipedia
– Sense inventory still limited to Wikipedia
• The system performed both NE identificationand disambiguation
• World knowledge used– Known entities (Wikipedia articles)
– Their class (Person, Location, Organization)
– Known surface forms (redirection pages + anchor text)
– Words that describe or co-occur with the entity
– Categories, lists, enumeration, tables
CGC for WSD78
![Page 79: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/79.jpg)
Restricted document representation:aggregation of features of all identified surface forms
for all entity names
CGC for WSD79
Entities that can be
referred to as “Columbia”
![Page 80: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/80.jpg)
Key idea: joint disambiguation
• Maximize the agreement – between the entity vectors and the document, and
– between the categories of the entities
• Example: {Columbia, Discovery}– Space Shuttle Columbia & Space Shuttle Discovery
– Columbia Pictures & Discovery Investigations
– LIST_astronomical_topics, CAT_manned_spacecraft, CAT_space_shuttles
• If the “one sense per discourse” assumption is violated – shrink the context iteratively
CGC for WSD80
![Page 81: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/81.jpg)
Results [Cucerzan ‘07]
• Baseline: most frequent Wikipedia sense
• Wikipedia dataset (labeled by construction)
– 86.2% 88.3%
• MSNBC news stories (human labeling)
– 51.7% 91.4%
– Surface forms common in news are highly ambiguous
– Popularity in Wikipedia ≠ popularity in news
CGC for WSD
Blackberry
China
81
![Page 82: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/82.jpg)
Case Study: Searching Tweets
3/28/2016 CS 572: Information Retrieval. Spring 2016 82
![Page 83: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/83.jpg)
Twitter Statistics (2014)
• 284 million monthly active users
• 500 million tweets are sent per day
• More than 300 billion tweets have been sent since company founding in 2006
• Tweets-per-second record: one-second peak of 143,199 TPS.
• More than 2 billion search queries per day.
3/28/2016 CS 572: Information Retrieval. Spring 2016 83
![Page 84: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/84.jpg)
Twitter Search Architecture
3/28/2016 CS 572: Information Retrieval. Spring 2016 84
![Page 85: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/85.jpg)
Twitter Search Architecture
3/28/2016 CS 572: Information Retrieval. Spring 2016 85
![Page 86: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/86.jpg)
Twitter Search Architecture
3/28/2016 CS 572: Information Retrieval. Spring 2016 86
![Page 87: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/87.jpg)
Active Index Segments: Write Friendly
3/28/2016 CS 572: Information Retrieval. Spring 2016 87
![Page 88: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/88.jpg)
Twitter Search Architecture
3/28/2016 CS 572: Information Retrieval. Spring 2016 88
![Page 89: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/89.jpg)
Twitter Lucene Extensions
3/28/2016 CS 572: Information Retrieval. Spring 2016 89
![Page 90: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/90.jpg)
Twitter Lucene Extensions (cont’d)
3/28/2016 CS 572: Information Retrieval. Spring 2016 90
![Page 91: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/91.jpg)
Lucene Extensions: Storage
3/28/2016 CS 572: Information Retrieval. Spring 2016 91
![Page 92: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/92.jpg)
In-Memory Real Time Index
3/28/2016 CS 572: Information Retrieval. Spring 2016 92
![Page 93: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/93.jpg)
Revisit Old Idea: Probabilistic Skip Lists
3/28/2016 CS 572: Information Retrieval. Spring 2016 93
![Page 94: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/94.jpg)
3/28/2016 CS 572: Information Retrieval. Spring 2016 94
![Page 95: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/95.jpg)
Top Entities (cont’d)
3/28/2016 CS 572: Information Retrieval. Spring 2016 95
![Page 96: CS 572: Information Retrieval - Emory Universityeugene/cs572/lectures/lecture19-social_media.pdf• Concepts as word senses (way beyond WordNet) –More comprehensive lexicons, gazetteers](https://reader034.fdocuments.us/reader034/viewer/2022043018/5f3a42e3ca5d0510f81a1e97/html5/thumbnails/96.jpg)
Resources
• CGC tutorial at WSDM 2012:http://www.mathcs.emory.edu/~eugene/wsdm2012-cgc-tutorial/
• Real-time Search at Twitter: https://www.youtube.com/watch?v=cbksU0qkFm0
http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.pdf
http://www.slideshare.net/lucidworks/search-at-twitter-presented-by-michael-busch-twitter
3/28/2016 CS 572: Information Retrieval. Spring 2016 96