ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed,...

20
ACL, June 2008 1 Pairwise Document Similarity Pairwise Document Similarity in Large Collections with in Large Collections with MapReduce MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland, College Park University of Maryland, College Park Human Language Technology Center of Excellence Human Language Technology Center of Excellence and and UMIACS CLIP Lab UMIACS CLIP Lab
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed,...

Page 1: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

ACL, June 2008 1

Pairwise Document Similarity Pairwise Document Similarity in Large Collections with in Large Collections with

MapReduceMapReduce

Tamer Elsayed, Jimmy Lin, and Douglas W. OardTamer Elsayed, Jimmy Lin, and Douglas W. Oard

University of Maryland, College ParkUniversity of Maryland, College ParkHuman Language Technology Center of Excellence Human Language Technology Center of Excellence

andandUMIACS CLIP LabUMIACS CLIP Lab

Page 2: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 2

Abstract Problem

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

0.200.300.540.210.000.340.340.130.74

Applications: Clustering Coreference resolution “more-like-that” queries

Page 3: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 3

Trivial Solution

load each vector o(N) times load each term o(dft

2) times

scalable and efficient solutionfor large collections

Goal

Page 4: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 4

Better Solution

Load weights for each term once Each term contributes o(dft

2) partial scores

Each term contributes only if appears in

Page 5: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 5

MapReduce Framework

mapmap

mapmap

mapmap

mapmap

reducereduce

reducereduce

reducereduce

input

input

input

input

output

output

output

ShufflingShuffling

group values group values by: by: [[keyskeys]]

(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce

handles low-level details transparentlytransparently

(k2, [v2])(k1, v1)

[(k3, v3)][k2, v2]

Page 6: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 6

reducereduce

Decomposition

Load weights for each term once Each term contributes o(dft

2) partial scores

Each term contributes only if appears in

mapmap

Page 7: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 7

Standard Indexing

tokenizetokenize

tokenizetokenize

tokenizetokenize

tokenizetokenize

combinecombine

combinecombine

combinecombine

doc

doc

doc

doc

posting list

posting list

posting list

ShufflingShuffling

group values group values by: by: termsterms

(a) Map(a) Map (b) Shuffle(b) Shuffle (c) Reduce(c) Reduce

Page 8: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 8

Indexing (3-doc toy collection)

Clinton

Barack

Cheney

Obama

Indexing

2

1

1

1

1

ClintonObamaClinton 1

1

ClintonCheney

ClintonBarackObama

ClintonObamaClinton

ClintonCheney

ClintonBarackObama

Page 9: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 9

Pairwise Similarity(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs

Clinton

Barack

Cheney

Obama

2

1

1

1

1

1

1

22

22

11

1111

22

22 22

22

11

1133

11

Page 10: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 10

Pairwise Similarity (abstract)(a) Generate pairs(a) Generate pairs (b) Group pairs(b) Group pairs (c) Sum pairs(c) Sum pairs

multiplymultiply

multiplymultiply

multiplymultiply

multiplymultiply

sumsum

sumsum

sumsum

term postings

term postings

term postings

term postings

similarity

similarity

similarity

ShufflingShuffling

group values group values by: by: pairspairs

Page 11: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 11

Experimental Setup

0.16.0 Open source MapReduce implementation

Cluster of 19 machines Each w/ two processors (single core)

Aquaint-2 collection 906K documents

Okapi BM25 Subsets of collection

Page 12: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 12

Efficiency (disk space)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Inte

rme

dia

te P

air

s (

bill

ion

s)

8 trillion intermediate pairs

Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ 906k docs

Page 13: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 13

Terms: Zipfian Distribution

term rank

do

c fr

eq (

df)

each term t contributes o(dft2) partial results

very few terms dominate the computations

most frequent term (“said”) 3%

most frequent 10 terms 15%

most frequent 100 terms 57%

most frequent 1000 terms 95%

~0.1% of total terms(99.9% df-cut)

Page 14: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 14

Efficiency (disk space)

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,000

9,000

0 10 20 30 40 50 60 70 80 90 100

Corpus Size (%)

Inte

rmed

iate

Pai

rs (

bil

lio

ns)

no df-cutdf-cut at 99.999%df-cut at 99.99%df-cut at 99.9%df-cut at 99%

8 trillionintermediate pairs

0.5 trillion intermediate pairs

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Aquaint-2 Collection, ~ 906k doc

Page 15: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 15

Effectiveness (recent work) Effect of df-cut on effectiveness

Medline04 - 909k abstracts- Ad-hoc retrieval

50

55

60

65

70

75

80

85

90

95

100

99.00 99.10 99.20 99.30 99.40 99.50 99.60 99.70 99.80 99.90 100.00df-cut (%)

Re

lati

ve

P5

(%

)

Drop 0.1% of terms“Near-Linear” Growth

Fit on diskCost 2% in Effectiveness

Hadoop, 19 PCs, each w/: 2 single-core processors, 4GB memory, 100GB disk

Page 16: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 16

Open source implementation

Java 1.5, 0.16.0

Available soon …

IvorIvoryy

Page 17: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 17

Conclusion

Simple and efficient MapReduce solution Many HLT problems can also be “hadoopified”

E.g., Statistical MT (see paper in StatMT workshop)

Shuffling is critical df-cut controls efficiency vs. effectiveness tradeoff 99.9% df-cut achieves 98% relative accuracy

Page 18: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 18

Future work

Apply to larger collections!

Develop analytical model

Measure effectiveness for different applications

Page 19: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 19

Thank You!

Page 20: ACL, June 20081 Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,

Pairwise Document Similarity in Large Collections with MapReduce 20

Algorithm

Matrix must fit in memory Works for small collections

Otherwise: disk access optimization