ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed,...
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed,...
ACL, June 2008 1
Pairwise Document Similarity Pairwise Document Similarity in Large Collections with in Large Collections with
MapReduceMapReduce
Tamer Elsayed, Jimmy Lin, and Douglas W. OardTamer Elsayed, Jimmy Lin, and Douglas W. Oard
University of Maryland, College ParkUniversity of Maryland, College ParkHuman Language Technology Center of Excellence Human Language Technology Center of Excellence
andandUMIACS CLIP LabUMIACS CLIP Lab
Pairwise Document Similarity in Large Collections with MapReduce 2
Abstract Problem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
0.200.300.540.210.000.340.340.130.74
Applications: Clustering Coreference resolution “more-like-that” queries
Pairwise Document Similarity in Large Collections with MapReduce 3
Trivial Solution
load each vector o(N) times load each term o(dft
2) times
scalable and efficient solutionfor large collections
Goal
Pairwise Document Similarity in Large Collections with MapReduce 4
Better Solution
Load weights for each term once Each term contributes o(dft
2) partial scores
Each term contributes only if appears in
Pairwise Document Similarity in Large Collections with MapReduce 5
MapReduce Framework
mapmap
mapmap
mapmap
mapmap
reducereduce
reducereduce
reducereduce
input
input
input
input
output
output
output
ShufflingShuffling
group values group values by: by: [[keyskeys]]
(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce
handles low-level details transparentlytransparently
(k2, [v2])(k1, v1)
[(k3, v3)][k2, v2]
Pairwise Document Similarity in Large Collections with MapReduce 6
reducereduce
Decomposition
Load weights for each term once Each term contributes o(dft
2) partial scores
Each term contributes only if appears in
mapmap
Pairwise Document Similarity in Large Collections with MapReduce 7
Standard Indexing
tokenizetokenize
tokenizetokenize
tokenizetokenize
tokenizetokenize
combinecombine
combinecombine
combinecombine
doc
doc
doc
doc
posting list
posting list
posting list
ShufflingShuffling
group values group values by: by: termsterms
(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce
Pairwise Document Similarity in Large Collections with MapReduce 8
Indexing (3-doc toy collection)
Clinton
Barack
Cheney
Obama
Indexing
2
1
1
1
1
ClintonObamaClinton 1
1
ClintonCheney
ClintonBarackObama
ClintonObamaClinton
ClintonCheney
ClintonBarackObama
Pairwise Document Similarity in Large Collections with MapReduce 9
Pairwise Similarity(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs
Clinton
Barack
Cheney
Obama
2
1
1
1
1
1
1
22
22
11
1111
22
22 22
22
11
1133
11
Pairwise Document Similarity in Large Collections with MapReduce 10
Pairwise Similarity (abstract)(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs
multiplymultiply
multiplymultiply
multiplymultiply
multiplymultiply
sumsum
sumsum
sumsum
term postings
term postings
term postings
term postings
similarity
similarity
similarity
ShufflingShuffling
group values group values by: by: pairspairs
Pairwise Document Similarity in Large Collections with MapReduce 11
Experimental Setup
0.16.0 Open source MapReduce implementation
Cluster of 19 machines Each w/ two processors (single core)
Aquaint-2 collection 906K documents
Okapi BM25 Subsets of collection
Pairwise Document Similarity in Large Collections with MapReduce 12
Efficiency (disk space)
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
0 10 20 30 40 50 60 70 80 90 100
Corpus Size (%)
Inte
rme
dia
te P
air
s (
bill
ion
s)
8 trillion intermediate pairs
Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk
Aquaint-2 Collection, ~ 906k docs
Pairwise Document Similarity in Large Collections with MapReduce 13
Terms: Zipfian Distribution
term rank
do
c fr
eq (
df)
each term t contributes o(dft2) partial results
very few terms dominate the computations
most frequent term (“said”) 3%
most frequent 10 terms 15%
most frequent 100 terms 57%
most frequent 1000 terms 95%
~0.1% of total terms(99.9% df-cut)
Pairwise Document Similarity in Large Collections with MapReduce 14
Efficiency (disk space)
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
0 10 20 30 40 50 60 70 80 90 100
Corpus Size (%)
Inte
rmed
iate
Pai
rs (
bil
lio
ns)
no df-cutdf-cut at 99.999%df-cut at 99.99%df-cut at 99.9%df-cut at 99%
8 trillionintermediate pairs
0.5 trillion intermediate pairs
Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk
Aquaint-2 Collection, ~ 906k doc
Pairwise Document Similarity in Large Collections with MapReduce 15
Effectiveness (recent work) Effect of df-cut on effectiveness
Medline04 - 909k abstracts- Ad-hoc retrieval
50
55
60
65
70
75
80
85
90
95
100
99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)
Re
lati
ve
P5
(%
)
Drop 0.1% of terms“Near-Linear” Growth
Fit on diskCost 2% in Effectiveness
Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk
Pairwise Document Similarity in Large Collections with MapReduce 16
Open source implementation
Java 1.5, 0.16.0
Available soon …
IvorIvoryy
Pairwise Document Similarity in Large Collections with MapReduce 17
Conclusion
Simple and efficient MapReduce solution Many HLT problems can also be “hadoopified”
E.g., Statistical MT (see paper in StatMT workshop)
Shuffling is critical df-cut controls efficiency vs. effectiveness tradeoff 99.9% df-cut achieves 98% relative accuracy
Pairwise Document Similarity in Large Collections with MapReduce 18
Future work
Apply to larger collections!
Develop analytical model
Measure effectiveness for different applications
Pairwise Document Similarity in Large Collections with MapReduce 19
Thank You!
Pairwise Document Similarity in Large Collections with MapReduce 20
Algorithm
Matrix must fit in memory Works for small collections
Otherwise: disk access optimization