Post on 21-Dec-2015
COE Quarterly Technical Exchange, June 10th 2008 1
Using MapReduce Using MapReduce for Scalable Coreference for Scalable Coreference
ResolutionResolution
Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed and Tan Xu
HLT COE andUMIACS Laboratory for Computational Linguistics and Information Processing
COE Quarterly Technical Exchange, June 10th 2008 2
COE ACE System
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Within-DocWithin-DocCoref.Coref.
PairsPairsFilteringFiltering
FeatureFeatureGenerationGeneration ClusteringClustering
English PipelineEnglish Pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Within-DocWithin-DocCoref.Coref.
FeatureFeatureGenerationGeneration ClusteringClustering
Arabic PipelineArabic Pipeline
ContextContextFeaturesFeatures
ConversationalConversationalGenreGenre
FeaturesFeatures
COE Quarterly Technical Exchange, June 10th 2008 3
Roadmap
1.1. Context FeaturesContext Features Pairwise similarity Efficient vs. effectiveness Generating features for ACE
2.2. Conversational-genre FeaturesConversational-genre Features New generative model Joint Resolution Evaluation using ACE-Usenet
COE Quarterly Technical Exchange, June 10th 2008 4
Context FeaturesClose friends and colleagues of Cheney -- including former Gen. Brent Scowcroft, who was national security adviser when Cheney was Gerald Ford's chief of staff and George H. W. Bush's defense secretary -- have been famously quoted they just don't recognize the Cheney they served along side and the Cheney of today who repeatedly made false assertions about the Iraq war and weapons of mass destruction.Now, an article in Vanity Fair Magazine by Todd S. Purdum has published a number of strikingly similar assessments from Clinton's former confidants -- plus medically authoritative guesswork speculating about how health problems of the sort Clinton experienced can change a person.But we avoid that trash talk to focus only on the real, striking changes in the public performances of Bill Clinton and Dick Cheney today. Compared to the way they were, back when they were greatly admired by those who knew them best, back in the day.
Once, ClintonClinton and Cheney were considered consummate political performers. Now they utter gaffes and commit blunders. And they leave the lasting impression that they just don't care about what you think about it.Once, they were smart and savvy strategic forces that always seemed to boost the political fortunes of their team (Clinton with sterling public performances; Cheney with rock-steady behind-the-scenes guidance). Now they have become liabilities to their causes, grand grist for late-night monologues, caricatures on "Saturday Night Live."
It barely seems credible now but there was a time when it seemed the Democratic nomination was Hillary Clinton's for the taking. The air of certainty in January was convincing when Clinton declared from a sofa at her Washington home: "I'm in and I'm in to win." Two Democratic senators and two former governors swiftly pulled out rather than get between Clinton and
White House. Then along came Barack Obama and the aura of inevitability that was crucial to Clinton's strategy vanished.
"The ClintonClinton campaign was meant to be shock and awe: big events in big states, sweep the board on Super Tuesday, overwhelm the less well-known competitors," said Chip Smith, who was deputy campaign manager for Al Gore in 2000. "Unfortunately, Obama uprooted that strategy. Inevitability isn't a viable strategy against a well-funded candidate with a powerful message." It is unclear whether there was anything Clinton could have done to stop a gifted politician such as Obama, once his early win in Iowa and prodigious fundraising ability established that he really did have a chance of winning the Democratic nomination.Clinton also may have destroyed any chance of a comeback after being caught out in her fib about coming under sniper fire while in Bosnia in the 1990s. The lie crystallised voter unease with Clinton, and held back chances of a grand comeback in Pennsylvania. In April, a Washington Post/ABC News poll found that 61% of American voters considered her dishonest and untrustworthy.
COE Quarterly Technical Exchange, June 10th 2008 5
Abstract Problem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
Goal: Scalable Pairwise Similarity
~10K docs ~50 million doc pairs
~140K entities ~10 billion entity pairs
COE Quarterly Technical Exchange, June 10th 2008 6
Solutions Trivial
Loads each vector o(N) times Loads each term t o(dft
2) times
Better Each term contributes only if appears in
Loads each term (with posting list) once Each term contributes o(dft
2)
Vt
dtdtji jiwwddsim ,,),(
ji dd
ji
jiddt
dtdtji wwddsim ,,),(
ji ddt
jiji ddtcontribtermddsim ),,(_),(
COE Quarterly Technical Exchange, June 10th 2008 7
Indexing (3-doc toy collection) Clinton
Barack
Cheney
Obama
Indexing
Standard IR Indexing
2
1
1
1
1
ClintonObamaClinton
1
1
ClintonCheney
ClintonBarackObama
COE Quarterly Technical Exchange, June 10th 2008 8
Pairwise Similarity(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs
Clinton
Barack
Cheney
Obama
2
1
1
1
1
1
1
22
22
11
1111
22
22 22
22
11
1133
11
COE Quarterly Technical Exchange, June 10th 2008 9
Pairwise Similarity (abstract)(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs
multiplymultiply
multiplymultiply
multiplymultiply
multiplymultiply
sumsum
sumsum
sumsum
term postings
term postings
term postings
term postings
similarity
similarity
similarity
GroupingGrouping
COE Quarterly Technical Exchange, June 10th 2008 10
MapReduce!
mapmap
mapmap
mapmap
mapmap
reducereduce
reducereduce
reducereduce
input
input
input
input
output
output
output
ShufflingShuffling
group values group values by keysby keys
(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce
COE Quarterly Technical Exchange, June 10th 2008 11
And indexing .. of course!
tokenizetokenize
tokenizetokenize
tokenizetokenize
tokenizetokenize
combinecombine
combinecombine
combinecombine
doc
doc
doc
doc
Posting list
Posting list
Posting list
ShufflingShuffling
group values group values by keysby keys
(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce
COE Quarterly Technical Exchange, June 10th 2008 12
Terms: Zipfian Distribution
term rank
do
c fr
eq (
df)
each term t contributes o(dft2) partial results
very few terms dominate the computations
most frequent term (“said”) 3%
most frequent 10 terms 15%
most frequent 100 terms 57%
most frequent 1000 terms 95%
~0.1% of total terms(99.9% df-cut)
COE Quarterly Technical Exchange, June 10th 2008 13
Efficiency (disk space)
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
0 10 20 30 40 50 60 70 80 90 100
Corpus Size (%)
Inte
rme
dia
te P
air
s (
bill
ion
s)
8 trillion intermediate pairs
Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk
Aquaint-2 Collection, ~ million doc
COE Quarterly Technical Exchange, June 10th 2008 14
Efficiency (disk space)
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
0 10 20 30 40 50 60 70 80 90 100
Corpus Size (%)
Inte
rmed
iate
Pai
rs (
bil
lio
ns)
df-cut at 99%df-cut at 99.9%df-cut at 99.99%df-cut at 99.999%no df-cut
8 trillionintermediate pairs
0.5 trillion intermediate pairs
Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk
Aquaint-2 Collection, ~ million doc
COE Quarterly Technical Exchange, June 10th 2008 15
EffectivenessEffect of df-cut on effectiveness
Medline04 - 909k abstracts- Ad-hoc retrieval
50
55
60
65
70
75
80
85
90
95
100
99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)
Re
lati
ve
P5
(%
)
Drop 0.1% of terms“Near-Linear” Growth
Fit on diskCost 2% in Effectiveness
Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk
For more details, Check “Pairwise Document Similarity in Large Collections with MapReduce”
at ACL 2008 (presented next week!)
COE Quarterly Technical Exchange, June 10th 2008 16
In ACE! ~10K docs
each document is a vector ~140K entities
each has multiple mentions each entity context is a vector
Generated 8 feature matrices (6 English + 2 Arabic)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Within-DocWithin-DocCoref.Coref.
PairsPairsFilteringFiltering
FeatureFeatureGenerationGeneration ClusteringClustering
English PipelineEnglish Pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Within-DocWithin-DocCoref.Coref.
FeatureFeatureGenerationGeneration ClusteringClustering
Arabic PipelineArabic Pipeline
COE Quarterly Technical Exchange, June 10th 2008 17
Roadmap
1.1. Context FeaturesContext Features Pairwise similarity Efficient vs. effectiveness Generating features for ACE
2.2. Conversational-genre FeaturesConversational-genre Features New generative model Joint Resolution Evaluation using ACE-Usenet
COE Quarterly Technical Exchange, June 10th 2008 18
Date: Wed Dec 20 08:57:00 EST 2000From: Kay Mann <kay.mann@enron.com>To: Mary Adams <mary.adams@enron.com>Subject: Re: tennis tomorrow!
Did Sue want Scott to join? Looks like the gamewill be too late for him.
Identity Resolution in Email
Sue
Identity Identity ResolutionResolution
Who?i.e., label with email address
COE Quarterly Technical Exchange, June 10th 2008 19
New Generative Model
1. Choose “personperson” c to mention
p(c)
2. Choose appropriate “contextcontext” X to mention c
p(X | c)
3. Choose a “mentionmention” l
p(l | X, c) ““sue”sue”
playingplayingtennistennis
COE Quarterly Technical Exchange, June 10th 2008 20
Context
Social ContextSocial Context
LocalLocalContextContext
Conversational Conversational ContextContext
Topical ContextTopical Context
COE Quarterly Technical Exchange, June 10th 2008 21
Single-Mention: 2-Step Solution
Prior DistributionPrior Distribution(1) Identity Modeling(1) Identity Modeling
Posterior DistributionPosterior Distribution
(2) Mention Resolution(2) Mention ResolutionEvidenceEvidence
COE Quarterly Technical Exchange, June 10th 2008 22
Improved ResultsEffectivness Comparison on Enron Collection
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MRR P@1
Heuristic
Generative
+8.9% +8.6%
For more details, Check “Resolving Personal Names in Email using Context Expansion”
at ACL 2008 (also presented next week!)
COE Quarterly Technical Exchange, June 10th 2008 23
Limitation!
socialconversational
social
topical
social
topical
topical
“Susan Scott”
“Sue”
“Suebob”
“sjhonson@enron.com”
“Susan”
“Susan Jones”
“Sue”
Joint Resolution!Joint Resolution!
Context-Free Resolution
COE Quarterly Technical Exchange, June 10th 2008 24
Joint Resolution
SpreadSpreadCurrent ResolutionCurrent Resolution
CombineCombineContext InfoContext Info
UpdateUpdateResolutionResolution
MentionGraph
COE Quarterly Technical Exchange, June 10th 2008 25
Joint Resolution
mapmap shuffleshuffle reducereduce
MentionGraph
MapReduce!MapReduce!
Work in Progress!Work in Progress!
COE Quarterly Technical Exchange, June 10th 2008 26
Roadmap
Context FeaturesContext Features Pairwise similarity Efficient vs. effectiveness Generating features for ACE
Conversational-genre FeaturesConversational-genre Features New generative model Joint Resolution Evaluation using ACE-Usenet
COE Quarterly Technical Exchange, June 10th 2008 27
Email Message
From: Machiavegli <machia@aol.com>To: Mark <mk@hotmail>Date: 29 Jan 2005 22:04:38 GMTSubject: The 1860 Presidential Election
In 1860 there was a four-way race between the Republican Party with AbrahamLincold, the Democratic Party with Stephen Douglas, the Southern DemocraticParty with John Breckenridge, and the Constitutional Union Party with JohnBell. Lincoln won a plurality with about 40% of the vote.WI it was only a two-way race between Lincoln and Douglas? I believe Douglaswould have won.This would have delayed secession and the Civil War.
receiver is email address
COE Quarterly Technical Exchange, June 10th 2008 28
Usenet Message
From: Machiavegli <machia@aol.com>Newsgroup: soc.history.what-ifDate: 29 Jan 2005 22:04:38 GMTSubject: The 1860 Presidential Election
In 1860 there was a four-way race between the Republican Party with AbrahamLincold, the Democratic Party with Stephen Douglas, the Southern DemocraticParty with John Breckenridge, and the Constitutional Union Party with JohnBell. Lincoln won a plurality with about 40% of the vote.WI it was only a two-way race between Lincoln and Douglas? I believe Douglaswould have won.This would have delayed secession and the Civil War.
newsgroup!
COE Quarterly Technical Exchange, June 10th 2008 29
ACE Usenet Document<DOCID> soc.history.what-if_20350205910 </DOCID><POSTER> Machiavegli </POSTER>
<POSTDATE> 29 Jan 2005 22:04:38 GMT </POSTDATE><SUBJECT> The 1860 Presidential Election </SUBJECT>
In 1860 there was a four-way race between the Republican Party with AbrahamLincold, the Democratic Party with Stephen Douglas, the Southern DemocraticParty with John Breckenridge, and the Constitutional Union Party with JohnBell. Lincoln won a plurality with about 40% of the vote.WI it was only a two-way race between Lincoln and Douglas? I believe Douglaswould have won.This would have delayed secession and the Civil War.
no email addresses in headers!
COE Quarterly Technical Exchange, June 10th 2008 30
Reconstruct from automatically From: Machiavegli <machia@aol.com>Newsgroup: soc.history.what-ifDate: 29 Jan 2005 22:04:38 GMTSubject: The 1860 Presidential Election
In 1860 there was a four-way race between the Republican Party with AbrahamLincold, the Democratic Party with Stephen Douglas, the Southern DemocraticParty with John Breckenridge, and the Constitutional Union Party with JohnBell. Lincoln won a plurality with about 40% of the vote.WI it was only a two-way race between Lincoln and Douglas? I believe Douglaswould have won.This would have delayed secession and the Civil War.
Got the address back!
COE Quarterly Technical Exchange, June 10th 2008 31
Handling it as @
From: Machiavegli <machia@aol.com>To: soc.history.what-if@usenet.comsoc.history.what-if@usenet.comDate: 29 Jan 2005 22:04:38 GMTSubject: The 1860 Presidential Election
In 1860 there was a four-way race between the Republican Party with AbrahamLincold, the Democratic Party with Stephen Douglas, the Southern DemocraticParty with John Breckenridge, and the Constitutional Union Party with JohnBell. Lincoln won a plurality with about 40% of the vote.WI it was only a two-way race between Lincoln and Douglas? I believe Douglaswould have won.This would have delayed secession and the Civil War.
handle group as receiver
COE Quarterly Technical Exchange, June 10th 2008 32
Feature Value: same label
sjhonson@hotmail.com sjhonson@hotmail.com
“Steph”
“Stephan”
“Stephan”
“S. Smith”
+1.0
Need for feature matrix (pairwise score)
COE Quarterly Technical Exchange, June 10th 2008 33
Feature Value: different labels
sjhonson@hotmail.com smith_s@aol.com
“Steph”
“Stephan”
“Stephan”
“S. Smith”
-1.0
Need for feature matrix (pairwise score)
COE Quarterly Technical Exchange, June 10th 2008 34
Conclusion
MapReduce can be applied to many HLT applications easy, cheap, and fast for distributed processing
e.g., scalable pairwise similarity for coreference resolution calls for new ways of thinking
Identity resolution in email new generative model yields improved accuracy
scalable joint resolution needed Usenet-ACE is new test collection
COE Quarterly Technical Exchange, June 10th 2008 35
Thank You!
COE Quarterly Technical Exchange, June 10th 2008 36
MapReduce and Text Analysis Computing pairwise similarity in large
collections Joint resolution of mentions in email
collections Search engines (of course!) Building language models Clustering applications Machine translation …