Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)

Post on 13-Sep-2014

1.262 views 2 download

Tags:

description

 

Transcript of Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)

Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M

University of Texas at Austin, Fall 2011

Lecture 6 September 29, 2011

Matt Lease

School of Information

University of Texas at Austin

ml at ischool dot utexas dot edu

Jason Baldridge

Department of Linguistics

University of Texas at Austin

Jasonbaldridge at gmail dot com

Acknowledgments

Course design and slides based on Jimmy Lin’s cloud computing courses at the University of Maryland, College Park

Some figures courtesy of the following excellent Hadoop books (order yours today!)

• Chuck Lam’s Hadoop In Action (2010)

• Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)

Today’s Agenda

• Automatic Spelling Correction

– Review: Information Retrieval (IR)

• Boolean Search

• Vector Space Modeling

• Inverted Indexing in MapReduce

– Probabilisitic modeling via noisy channel

• Index Compression

– Order inversion in MapReduce

• In-class exercise

• Hadoop: Pipelined & Chained jobs

Automatic Spelling Correction

Automatic Spelling Correction

Three main stages

Error detection

Candidate generation

Candidate ranking / choose best candidate

Usage cases

Flagging possible misspellings / spell checker

Suggesting possible corrections

Automatically correcting (inferred) misspellings

• “as you type” correction

• web queries

• real-time closed captioning

• …

Types of spelling errors

Unknown words: “She is their favorite acress in town.”

Can be identified using a dictionary…

…but could be a valid word not in the dictionary

Dictionary could be automatically constructed from large corpora

• Filter out rare words (misspellings, or valid but unlikely)…

• Why filter out rare words that are valid?

Unknown words violating phonotactics:

e.g. “There isn’t enough room in this tonw for the both of us.”

Given dictionary, could automatically construct “n-gram dictionary”

of all character n-grams known in the language

• e.g. English words don’t end with “nw”, so flag tonw

Incorrect homophone: “She drove their.”

Valid word, wrong usage; infer appropriateness from context

Typing errors reflecting kayout of leyboard

Candidate generation

How to generate possible corrections for acress?

Inspiration: how do people do it?

People may suggest words like actress, across, access, acres,

caress, and cress – what do these have in common?

What about “blam” and “zigzag”?

Two standard strategies for candidate generation

Minimum edit distance

• Generate all candidates within 1+ edit step(s)

• Possible edit operations: insertion, deletion, substitution, transposition, …

• Filter through a dictionary

• See Peter Norvig’s post: http://norvig.com/spell-correct.html

Character ngrams: see next slide…

Character ngram Spelling Correction

Information Retrieval (IR) model

Query=typo word

Document collection = dictionary (i.e. set of valid words)

Representation: word is set of character ngrams

Let’s use n=3 (trigram), with # to mark word start/end

Examples

across: [#ac, acr, cro, oss, ss#]

acress: [#ac, acr, cre, res, ess, ss#]

actress: [#ac, act, ctr, tre, res, ess, ss#]

blam: [#bl, bla, lam, am#]

mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]

Uhm, IR model???

Review…

Abstract IR Architecture

Documents Query

Results

Representation

Function

Representation

Function

Query Representation Document Representation

Comparison

Function Index

offline online

Document Boolean Representation

McDonald's slims down spuds

Fast-food chain to reduce certain types of fat in its french fries with new cooking oil.

NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.

But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.

But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.

Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.

McDonalds

fat

fries

new

french

Company

Said

nutrition

“Bag of Words”

Boolean Retrieval

dogs Doc 1

dolphins Doc 2

football Doc 3

football dolphins Doc 4

Inverted Index: Boolean Retrieval

one fish, two fish Doc 1

red fish, blue fish Doc 2

cat in the hat Doc 3

1

1

1

1

1

1

1 2 3

1

1

1

4

blue

cat

egg

fish

green

ham

hat

one

3

4

1

4

4

3

2

1

blue

cat

egg

fish

green

ham

hat

one

2

green eggs and ham Doc 4

1 red

1 two

2 red

1 two

Inverted Indexing via MapReduce

1 one

1 two

1 fish

one fish, two fish Doc 1

2 red

2 blue

2 fish

red fish, blue fish Doc 2

3 cat

3 hat

cat in the hat Doc 3

1 fish 2

1 one 1 two

2 red

3 cat

2 blue

3 hat

Shuffle and Sort: aggregate values by keys

Map

Reduce

Inverted Indexing in MapReduce

1: class Mapper

2: procedure Map(docid n; doc d)

3: H = new Set

4: for all term t in doc d do

5: H.add(t)

6: for all term t in H do

7: Emit(term t, n)

1: class Reducer

2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])

3: List P = docids.values()

4: Emit(term t; P)

Scalability Bottleneck

Desired output format: <term, [doc1, doc2, …]>

Just emitting each <term, docID> pair won’t produce this

How to produce this without buffering?

Side-effect: write directly to HDFS instead of emitting

Complications?

• Persistent data must be cleaned up if reducer restarted…

Using the Inverted Index

Boolean Retrieval: to execute a Boolean query

Build query syntax tree

For each clause, look up postings

Traverse postings and apply Boolean operator

Efficiency analysis

Start with shortest posting first

Postings traversal is linear (if postings are sorted)

• Oops… we didn’t actually do this in building our index…

( blue AND fish ) OR ham

blue fish

AND ham

OR

1

2 blue

fish 2

Inverted Indexing in MapReduce

1: class Mapper

2: procedure Map(docid n; doc d)

3: H = new Set

4: for all term t in doc d do

5: H.add(t)

6: for all term t in H do

7: Emit(term t, n)

1: class Reducer

2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])

3: List P = docids.values()

4: Emit(term t; P)

Inverted Indexing in MapReduce: try 2

1: class Mapper

2: procedure Map(docid n; doc d)

3: H = new Set

4: for all term t in doc d do

5: H.add(t)

6: for all term t in H do

7: Emit(term t, n)

1: class Reducer

2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])

3: List P = docids.values()

4: Sort(P)

5: Emit(term t; P)

1

fish

2

(Another) Scalability Bottleneck

Reducers buffers all docIDs associated with term (to sort)

What if term occurs in many documents?

Secondary sorting

Use composite key

Partition function

Key Comparator

Side-effect: write directly to HDFS as before…

Inverted index for spelling correction

Like search, spelling correction must be fast

How can we quickly identify candidate corrections?

II: Map each character ngram list of all words containing it

#ac -> { act, across, actress, acquire, … }

acr -> { across, acrimony, macro, … }

cre -> { crest, acre, acres, … }

res -> { arrest, rest, rescue, restaurant, … }

ess -> { less, lesson, necessary, actress, … }

ss# -> { less, mess, moss, across, actress, … }

How do we build the inverted index in MapReduce?

Exercise

Write a MapReduce algorithm for creating an inverted

index for trigram spelling correction, given a corpus

Exercise

Write a MapReduce algorithm for creating an inverted

index for trigram spelling correction, given a corpus

Also other alternatives, e.g. in-mapper combining, pairs

Is MapReduce even necessary for this?

Dictionary vs. token frequency

Map(String docid, String text):

for each word w in text:

for each trigram t in w:

Emit(t, w)

Reduce(String trigram, Iterator<Text> values):

Emit(trigram, values.toSet)

Spelling correction as Boolean search

Given inverted index, how to find set of possible corrections?

Compute union of all words indexed by any of its character ngrams

= Boolean search

• Query “acress” “#ac OR acr OR cre OR res OR ess OR ss# “

Are all corrections equally likely / good?

Ranked Information Retrieval

Order documents by probability of relevance

Estimate relevance of each document to the query

Rank documents by relevance

How do we estimate relevance?

Vector space paradigm

Approximate relevance by vector similarity (e.g. cosine)

Represent queries and documents as vectors

Rank documents by vector similarity to the query

Vector Space Model

Assumption: Documents that are “close” in vector space

“talk about” the same things

t1

d2

d1

d3

d4

d5

t3

t2

θ

φ

Retrieve documents based on how close the document

vector is to the query vector (i.e., similarity ~ “closeness”)

Similarity Metric

Use “angle” between the vectors

Given pre-normalized vectors, just compute inner product

n

i ki

n

i ji

n

i kiji

kj

kj

kj

ww

ww

dd

ddddsim

1

2

,1

2

,

1 ,,),(

kj

kj

dd

dd

)cos(

n

i kijikjkj wwddddsim1 ,,),(

Boolean Character ngram correction

Boolean Information Retrieval (IR) model

Query=typo word

Document collection = dictionary (i.e. set of valid words)

Representation: word is set of character ngrams

Let’s use n=3 (trigram), with # to mark word start/end

Examples

across: [#ac, acr, cro, oss, ss#]

acress: [#ac, acr, cre, res, ess, ss#]

actress: [#ac, act, ctr, tre, res, ess, ss#]

blam: [#bl, bla, lam, am#]

mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]

Ranked Character ngram correction

Vector space Information Retrieval (IR) model

Query=typo word

Document collection = dictionary (i.e. set of valid words)

Representation: word is vector of character ngram value

Rank candidate corrections according to vector similarity (cosine)

Trigram Examples

across: [#ac, acr, cro, oss, ss#]

acress: [#ac, acr, cre, res, ess, ss#]

actress: [#ac, act, ctr, tre, res, ess, ss#]

blam: [#bl, bla, lam, am#]

mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]

Spelling Correction in Vector Space

Assumption: Words that are “close together” in ngram

vector space have similar orthography

t1

d2

d1

d3

d4

d5

t3

t2

θ

φ

Therefore, retrieve words in the dictionary based on how

close the word is to the typo (i.e., similarity ~ “closeness”)

Ranked Character ngram correction

Vector space Information Retrieval (IR) model

Query=typo word

Document collection = dictionary (i.e. set of valid words)

Representation: word is vector of character ngram value

Rank candidate corrections according to vector similarity (cosine)

Trigram Examples

across: [#ac, acr, cro, oss, ss#]

acress: [#ac, acr, cre, res, ess, ss#]

actress: [#ac, act, ctr, tre, res, ess, ss#]

blam: [#bl, bla, lam, am#]

mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]

“value” here expresses relative importance of different

vector components for the similarity comparison

Use simple count here, what else might we do?

IR Term Weighting

Term weights consist of two components

Local: how important is the term in this document?

Global: how important is the term in the collection?

Here’s the intuition:

Terms that appear often in a document should get high weights

Terms that appear in many documents should get low weights

How do we capture this mathematically?

Term frequency (local)

Inverse document frequency (global)

TF.IDF Term Weighting

i

jijin

Nw logtf ,,

jiw ,

ji ,tf

N

in

weight assigned to term i in document j

number of occurrence of term i in document j

number of documents in entire collection

number of documents with term i

2

1

1

2

1

1

1

1

1

1

1

Inverted Index: TF.IDF

2

1

2

1

1

1

1 2 3

1

1

1

4

1

1

1

1

1

1

2

1

tf

df

blue

cat

egg

fish

green

ham

hat

one

1

1

1

1

1

1

2

1

blue

cat

egg

fish

green

ham

hat

one

1 1 red

1 1 two

1 red

1 two

one fish, two fish Doc 1

red fish, blue fish Doc 2

cat in the hat Doc 3

green eggs and ham Doc 4

3

4

1

4

4

3

2

1

2

2

1

Inverted Indexing via MapReduce

1 one

1 two

1 fish

one fish, two fish Doc 1

2 red

2 blue

2 fish

red fish, blue fish Doc 2

3 cat

3 hat

cat in the hat Doc 3

1 fish 2

1 one 1 two

2 red

3 cat

2 blue

3 hat

Shuffle and Sort: aggregate values by keys

Map

Reduce

1

1

2

1

1

2 2

1 1

1

1

1

1

1

1

2

Inverted Indexing via MapReduce (2)

1 one

1 two

1 fish

one fish, two fish Doc 1

2 red

2 blue

2 fish

red fish, blue fish Doc 2

3 cat

3 hat

cat in the hat Doc 3

1 fish 2

1 one 1 two

2 red

3 cat

2 blue

3 hat

Shuffle and Sort: aggregate values by keys

Map

Reduce

Inverted Indexing: Pseudo-Code

Further exaccerbates earlier scalability issues …

Ranked Character ngram correction

Vector space Information Retrieval (IR) model

Query=typo word

Document collection = dictionary (i.e. set of valid words)

Representation: word is vector of character ngram value

Rank candidate corrections according to vector similarity (cosine)

Trigram Examples

across: [#ac, acr, cro, oss, ss#]

acress: [#ac, acr, cre, res, ess, ss#]

actress: [#ac, act, ctr, tre, res, ess, ss#]

blam: [#bl, bla, lam, am#]

mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]

“value” here expresses relative importance of different

vector components for the similarity comparison

What else might we do? TF.IDF for character n-grams?

TF.IDF for character n-grams

Think about what makes an ngram more discriminating

e.g. in acquire, acq and cqu are more indicative than qui and ire.

Schematically, we want something like:

• acquire: [ #ac, acq, cqu, qui, uir, ire, re# ]

Possible solution: TF-IDF, where

TF is the frequency of the ngram in the word

IDF is the number of words the ngram occurs in in the vocabulary

Correction Beyond Orthography

So far we’ve focused on orthography alone

The context of a typo also tells us a great deal

How can we compare contexts?

Correction Beyond Orthography

So far we’ve focused on orthography alone

The context of a typo also tells us a great deal

How can we compare contexts?

Idea: use the co-occurrence matrices built during HW2

We have a vector of co-occurrence counts for each word

Extract a similar vector for the typo given its immediate context

• “She is their favorite acress in town.”

acress: [ she:1, is:1, their:1, favorite:1, in:1, town:1 ]

Possible enhancement: make vectors sensitive to word order

Combining evidence

We have orthographic similarity and contextual similarity

We can do a simple weighted combination of the two, e.g.:

How to do this more efficiently?

Compute top candidates based on simOrth

Take top k for consideration with simContext

…or other way around…

The combined model might also be expressed by a similar

probabilistic model…

),(1),(),( kjkjkj ddsimContextddsimOrthdddsimCombine

March 22, 2005 42

Paradigm: Noisy-Channel Modeling

)|()(maxarg)|(maxarg SOPSPOSPsSS

Want to recover most likely latent (correct) source

word underlying the observed (misspelled) word

P(S): language model gives probability distribution

over possible (candidate) source words

P(O|S): channel model gives probability of each

candidate source word being “corrupted” into the

observed typo

Noisy Channel Model for correction

We want to rank candidates by P(cand | typo)

Using Bayes law, the chain rule, an independence

assumption, and logs, we have:

)|(log)|(log

)|()|(

)()|()|(

),()|(

),(),|(

),,(

),(

),,(),|(

contextcandPcandtypoP

contextcandPcandtypoP

contextPcontextcandPcandtypoP

contextcandPcandtypoP

contextcandPcontextcandtypoP

contexttypocandP

contexttypoP

contexttypocandPcontexttypocandP

Probabilistic vs. vector space model

Both measure orthographic & contextual “fit” of the

candidate given the typo and its usage context

Noisy channel:

IR approach:

Both can benefit from “big” data (i.e. bigger samples)

Better estimates of probabilities and population frequencies

Usual probabilistic vs. non-probabilistic tradeoffs

Principled theory and methodology for modeling and estimation

How to extend the feature space to include additional information?

• Typing haptics (key proximity)? Cognitive errors (e.g. homonyms)?

)|(log)|(log),|( contextcandPcandtypoPcontexttypocandP

),(1),(),( kjkjkj ddsimContextddsimOrthdddsimCombine

Index Compression

2 1 3 1 2 3

2 1 3 1 2 3

Postings Encoding

1 fish 9 21 34 35 80 …

1 fish 8 12 13 1 45 …

Conceptually:

In Practice:

•Instead of document IDs, encode deltas (or d-gaps) • But it’s not obvious that this save space…

Overview of Index Compression

Byte-aligned vs. bit-aligned

Non-parameterized bit-aligned

Unary codes

(gamma) codes

(delta) codes

Parameterized bit-aligned

Golomb codes

Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!

But First... General Data Compression

Run Length Encoding

7 7 7 8 8 9 = (7, 3), (8,2), (9,1)

Binary Equivalent

0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 = 6, 1, 3, 2, 3

Good with sparse binary data

Huffman Coding

Optimal when data is distributed by negative powers of two

e.g. P(a)= ½, P(b) = ¼, P(c)=1/8, P(d)=1/8

• a = 0, b = 10, c= 110, d=111

Prefix codes: no codeword is the prefix of another codeword

• If we read 0, we know it’s an “a” following bits are a new codeword

• Similarly 10 is a b (no other codeword starts with 10), etc.

• Prefix is 1* (i.e. path to internal nodes is all 1s, output on leaves)

Unary Codes

Encode number as a run of 1s, specifically…

x 1 coded as x-1 1s, followed by zero bit terminator

1 = 0

2 = 10

3 = 110

4 = 1110

...

Great for small numbers… horrible for large numbers

Overly-biased for very small gaps

codes

x 1 is coded in two parts: unary length : offset

Start with binary encoded, remove highest-order bit = offset

Length is number of binary digits, encoded in unary

Concatenate length + offset codes

Example: 9 in binary is 1001

Offset = 001

Length = 4, in unary code = 1110

code = 1110:001

Another example: 7 (111 in binary)

• offset=11, length=3 (110 in unary) code = 110:11

Analysis

Offset = log x

Length = log x +1

Total = 2 log x +1 (97 bits, 75 bits, …)

codes

As with codes, two parts: unary length & offset

Offset is same as before

Length is encoded by its code

Example: 9 (=1001 in binary)

Offset = 001

Length = 4 (100), offset=00, length 3 = 110 in unary

• code=110:00

code = 110:00:001

Comparison

codes better for smaller numbers

codes better for larger numbers

Golomb Codes

x 1, parameter b

x encoded in two parts

Part 1: q = ( x - 1 ) / b , code q + 1 in unary

Part 2: remainder r<b, r = x - qb – 1 coded in truncated binary

Truncated binary defines prefix code

if b is a power of 2

• easy case: truncated binary = regular binary

else

• First 2^(log b + 1) – b values encoded in log b bits

• Remaining values encoded in log b + 1 bits

Let’s see some examples

Golomb Code Examples

b = 3, r = [0:2]

First 2^(log 3 + 1) – 3 = 2^2 – 3 = 1 values, in log 3 = 1 bit

First 1 value in 1 bit: 0

Remaining 3-1=2 values in 1+1=2 bits with prefix 1: 10, 11

b = 5, r = [0:4]

First 2^(log 5 + 1) – 5 = 2^3 – 5 = 3 values, in log 5 = 2 bits

First 3 values in 2 bits: 00, 01, 10

Remaining 5-3=2 values in 2+1=3 bits with prefix 11: 110, 111

• Two prefix bits needed since single leading 1 already used in “10”

b = 6, r = [0:5]

First 2^(log 6 + 1) – 6 = 2^3 – 6 = 2 values, in log 6 = 2 bits

First 2 values in 2 bits: 00, 01

Remaining 6-2=4 values in 2+1=3 bits with prefix 1: 100, 101, 110, 111

Comparison of Coding Schemes

1 0 0 0 0:0 0:00

2 10 10:0 100:0 0:10 0:01

3 110 10:1 100:1 0:11 0:100

4 1110 110:00 101:00 10:0 0:101

5 11110 110:01 101:01 10:10 0:110

6 111110 110:10 101:10 10:11 0:111

7 1111110 110:11 101:11 110:0 10:00

8 11111110 1110:000 11000:000 110:10 10:01

9 111111110 1110:001 11000:001 110:11 10:100

10 1111111110 1110:010 11000:010 1110:0 10:101

Unary Golomb

b=3 b=6

Witten, Moffat, Bell, Managing Gigabytes (1999)

See Figure 4.5 in Lin & Dyer p. 77 for b=5 and b=10

Index Compression: Performance

Witten, Moffat, Bell, Managing Gigabytes (1999)

Unary 262 1918

Binary 15 20

6.51 6.63

6.23 6.38

Golomb 6.09 5.84

Bible TREC

Use Golomb for d-gaps, codes for term frequencies

Optimal b 0.69 (N/df): Different b for every term!

Bible: King James version of the Bible; 31,101 verses (4.3 MB)

TREC: TREC disks 1+2; 741,856 docs (2070 MB)

Comparison of Index Size (bits per pointer)

[2,4]

[9]

[1,8,22]

[23]

[8,41]

[2,9,76]

[2,4]

[9]

[1,8,22]

[23]

[8,41]

[2,9,76]

2

1

3

1

2

3

Where are we without compression?

1 fish

9

21

(values) (key)

34

35

80

1 fish

9

21

(values) (keys)

34

35

80

fish

fish

fish

fish

fish

How is this different? • Let the framework do the sorting

• Directly write postings to disk

• Term frequency implicitly stored

Index Compression in MapReduce

Need df to compress posting for each term

How do we compute df?

Count the # of postings in reduce(), then compress

Problem?

Order Inversion Pattern

In the mapper:

Emit “special” key-value pairs to keep track of df

In the reducer:

Make sure “special” key-value pairs come first: process them to

determine df

Remember: proper partitioning!

Getting the df: Modified Mapper

one fish, two fish Doc 1

1 fish [2,4]

(value) (key)

1 one [1]

1 two [3]

fish [1]

one [1]

two [1]

Input document…

Emit normal key-value pairs…

Emit “special” key-value pairs to keep track of df…

Getting the df: Modified Reducer

1 fish

9

[2,4]

[9]

21 [1,8,22]

(value) (key)

34 [23]

35 [8,41]

80 [2,9,76]

fish

fish

fish

fish

fish

Write postings directly to disk

fish [63] [82] [27] …

First, compute the df by summing contributions

from all “special” key-value pair…

Compress postings incrementally as they arrive

Important: properly define sort order to make

sure “special” key-value pairs come first!

Where have we seen this before?

In-class Exercise

Exercise: where have all the ngrams gone?

For each observed (word) trigram in collection, output its observed (docID, wordIndex) locations

Input

Output Possible Tools:

* pairs/stripes?

* combining?

* secondary sorting?

* order inversion?

* side effects?

one fish two

one fish two fish Doc 1

one fish two salmon Doc 2

two fish two fish Doc 3

fish two fish

fish two salmon

two fish two

[(1,1),(2,1)]

[(1,2),(3,2)]

[(2,2)]

[(3,1)]

Exercise: shingling Given observed (docID, wordIndex) ngram locations For each document, for each of its ngrams (in order), give a list of the ngram locations for that ngram Input Possible Tools: * pairs/stripes? Output * combining? * secondary sorting? * order inversion? * side effects?

Doc 1

fish two fish

fish two salmon

two fish two

[ [(1,1),(2,1)], [(1,2),(3,2)] ]

[(1,2),(3,2)]

[(2,2)]

[(3,1)]

one fish two [(1,1),(2,1)]

Doc 2 [ [(1,1),(2,1)], [(2,2)] ]

Doc 3 [ [(3,1)], [(1,2),(3,2)] ]

Exercise: shingling (2) How can we recognize when longer ngrams are aligned across documents?

Example

doc 1: a b c d e

doc 2: a b c d f

doc 3: e b c d f

doc 4: a b c d e

Find “a b c d” in docs 1 2 and 4,

“b c d f” in 2 & 3

“a b c d e” in 1 and 4

class Alignment int index // start position in this document int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram; int otherID // ID of other document int otherIndex // start position in other document class NgramExtender Set<Alignment> alignments = empty set index=0; NgramExtender(int docID) { _docID = docID } close() { foreach Alignment a, emit(_docID, a) } AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document

... @inproceedings{Kolak:2008,

author = {Kolak, Okan and Schilit, Bill N.},

title = {Generating links by mining quotations},

booktitle = {19th ACM conference on Hypertext and

hypermedia},

year = {2008},

pages = {117--126}

}

class Alignment int index // start position in this document int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram; int otherID // ID of other document int otherIndex // start position in other document class NgramExtender Set<Alignment> alignments = empty set index=0; NgramExtender(int docID) { _docID = docID } close() { foreach Alignment a, emit(_docID, a) } AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document ++index; foreach Alignment a in alignments Ngram next = new Ngram(a.otherID, a.otherIndex + a.length) if (ngrams.contains(next)) // extend alignment a.length += 1; ngrams.remove(next) else // terminate alignment emit _docID, (a); alignments.remove(a) foreach ngram in ngrams alignments.add( new Alignment( index, 1, ngram.docID, ngram.otherIndex )

Sequences of MapReduce Jobs

Building more complex MR algorithms

Monolithic single Map + single Reduce

What we’ve done so far

Fitting all computation to this model can be difficult and ugly

We generally strive for modularization when possible

What else can we do?

Pipeline: [Map Reduce] [Map Reduce] … (multiple sequential jobs)

Chaining: [Map+ Reduce Map*]

• 1 or more Mappers

• 1 reducer

• 0 or more Mappers

Pipelined Chain: [Map+ Reduce Map*] [Map+ Reduce Map*] …

Express arbitrary dependencies between jobs

Modularization and WordCount

General benefits of modularization

Re-use for easier/faster development

Consistent behavior across applications

Easier/faster to maintain/extend for benefit of many applications

Even basic word count can be broken down

Pre-processing

• How will we tokenize? Perform stemming? Remove stopwords?

Main computation: count tokenized tokens and group by word

Post-processing

• Transform the values? (e.g. log-damping)

Let’s separate tokenization into its own module

Many other tasks can likely benefit

First approach: pipeline…

Pipeline WordCount Modules

Count Observer Mapper

• List[String] -> List[(String, Int)]

• E.g. (10032, [“the”, “10”, “cats”, “sleep”] -> [(“the”,1), (“10”, 1), (“cats”,1), (“sleep”,1)]

LongSumReducer

• Sum token counts

• E.g. (“sleep”, [1, 5, 2]) -> (“sleep”, 8)

Tokenize Tokenizer Mapper

• String -> List[String]

• Keep doc ID key

• E.g.(10032, “the 10 cats sleep”) -> (10032, [“the”, “10”, “cats”, “sleep”])

No Reducer

Pipeline WordCount in Hadoop

Two distinct jobs: tokenize and count

Data sharing between jobs via persistent output

Can use combiners and partitioners as usual (won’t bother here)

Let’s use SequenceFileOutputFormat rather than TextOutputFormat

sequence of binary key-value pairs; faster / smaller

tokenization output will stick around unless we delete it

Tokenize job

Just a mapper, no reducer: conf.setNumReduceTasks(0) or IdentityReducer

Output goes to directory we specify

Files will be read back in by the counting job

Output is array of tokens

We need to make a suitable Writable for String arrays

Count job

Input types defined by the input SequenceFile (don’t need to be specified)

Mapper is trivial

observes tokens from incoming data

Key: (docid) & Value: (Array of Strings, encoded as a Writable)

Pipeline WordCount (old Hadoop API)

Configuration conf = new Configuration();

String tmpDir1to2 = "/tmp/intermediate1to2";

// Tokenize job

JobConf tokenizationJob = new JobConf(conf);

tokenizationJob.setJarByClass(PipelineWordCount.class);

FileInputFormat.setInputPaths(tokenizationJob, new Path(inputPath));

FileOutputFormat.setOutputPath(tokenizationJob, new Path(tmpDir1to2));

tokenizationJob.setOutputFormat(SequenceFileOutputFormat.class);

tokenizationJob.setMapperClass(AggressiveTokenizerMapper.class);

tokenizationJob.setOutputKeyClass(LongWritable.class);

tokenizationJob.setOutputValueClass(TextArrayWritable.class);

tokenizationJob.setNumReduceTasks(0);

// Count job

JobConf countingJob = new JobConf(conf);

countingJob.setJarByClass(PipelineWordCount.class);

countingJob.setInputFormat(SequenceFileInputFormat.class);

FileInputFormat.setInputPaths(countingJob, new Path(tmpDir1to2));

FileOutputFormat.setOutputPath(countingJob, new Path(outputPath));

countingJob.setMapperClass(TrivialWordObserver.class);

countingJob.setReducerClass(MapRedIntSumReducer.class);

countingJob.setOutputKeyClass(Text.class);

countingJob.setOutputValueClass(IntWritable.class);

countingJob.setNumReduceTasks(reduceTasks);

JobClient.runJob(tokenizationJob);

JobClient.runJob(countingJob);

Pipeline jobs in Hadoop

Old API

JobClinet.runJob(..) does not return until job finishes

New API

Use Job rather than JobConf

Use job.waitForCompletion instead of JobClient.runJob

Why Old API?

In 0.20.2, chaining only possible under old API

We want to re-use the same components for chaining (next…)

Chaining in Hadoop

Map+ Reduce Map*

1 or more Mappers

• Can use IdentityMapper

1 reducer

• No reducers: conf.setNumReduceTasks(0)?

0 or more Mappers

Usual combiners and partitioners

By default, data passed between

Mappers by usual writing of

intermediate data to disk

Can always use side-effects…

There is a better, built-in way to bypass

this and pass (Key,Value) pairs by

reference instead

• Requires different Mapper semantics!

Mapper 1 Mapper 1

Intermediates Intermediates

Reducer Reducer

Mapper 2 Mapper 2

Mapper 3 Mapper 3

Persistent Output Persistent Output

Hadoop: ChainMapper & ChainReducer

Below JobConf objects (deprecated in Hadoop 0.20.2).

No undeprecated replacement in 0.20.2…

Examples here work for later versions with small changes

Configuration conf = new Configuration();

JobConf job = new JobConf(conf);

...

boolean passByRef = false; // pass output (Key,Value) pairs to next Mapper by reference?

JobConf map1Conf = new JobConf(false);

ChainMapper.addMapper(job, Map1.class, Map1InputKey.class, Map1InputValue.class,

Map1OutputKey.class, Map1OutputValue.class, passByRef, map1Conf);

JobConf map2Conf = new JobConf(false);

ChainMapper.addMapper(job, Map2.class, Map1OutputKey.class, Map1OutputValue.class,

Map2OutputKey.class, Map2OutputValue.class, passByRef, map2Conf);

JobConf reduceConf = new JobConf(false);

ChainReducer.setReducer(job, Reducer.class, Map2OutputKey.class, Map2OutputValue.class,

ReducerOutputKey.class, ReducerOutputValue.class, passByRef, reduceConf)

JobConf map3Conf = new JobConf(false);

ChainReducer.addMapper (job, Map3.class, ReducerOutputKey.class, ReducerOutputValue.class,

Map3OutputKey.class, Map3OutputValue.class, passByRef, map3Conf)

JobClient.runJob(job);

Chaining in Hadoop

Let’s continue our running example:

Mapper 1: Tokenize

Mapper 2: Observe (count) words

Reducer: same IntSum reducer as always

Mapper 3 Log-dampen counts

• We didn’t have this in our pipeline example but we’ll add here…

Chained Tokenizer + WordCount

// Set up configuration and intermediate directory location

Configuration conf = new Configuration();

JobConf chainJob = new JobConf(conf);

chainJob.setJobName("Chain job");

chainJob.setJarByClass(ChainWordCount.class); // single jar for all Mappers and Reducers…

chainJob.setNumReduceTasks(reduceTasks);

FileInputFormat.setInputPaths(chainJob, new Path(inputPath));

FileOutputFormat.setOutputPath(chainJob, new Path(outputPath));

// pass output (Key,Value) pairs to next Mapper by reference?

boolean passByRef = false;

JobConf map1 = new JobConf(false); // tokenization

ChainMapper.addMapper(chainJob, AggressiveTokenizerMapper.class,

LongWritable.class, Text.class,

LongWritable.class, TextArrayWritable.class, passByRef, map1);

JobConf map2 = new JobConf(false); // Add token observer job

ChainMapper.addMapper(chainJob, TrivialWordObserver.class,

LongWritable.class, TextArrayWritable.class,

Text.class, LongWritable.class, passByRef, map2);

JobConf reduce = new JobConf(false); // Set the int sum reducer

ChainReducer.setReducer(chainJob, LongSumReducer.class, Text.class, LongWritable.class,

Text.class, LongWritable.class, passByRef, reduce);

JobConf map3 = new JobConf(false); // log-scaling of counts

ChainReducer.addMapper(chainJob, ComputeLogMapper.class, Text.class, LongWritable.class,

Text.class, FloatWritable.class, passByRef, map3);

JobClient.runJob(chainJob);

Hadoop Chaining: Pass by Reference

Chaining allows possible optimization

Chained mappers run in same JVM thread, so opportunity to avoid

serialization to/from disk with pipelined jobs

Also lesser benefit of avoiding extra object destruction / construction

Gotchas

OutputCollector.collect(K k, V v) promises not alter the content of k and v

But if Map1 passes (k,v) by reference to Map2 via collect(),

Map2 may alter (k,v) & thereby violate the contract

What to do?

Option 1: Honor the contract – don’t alter input (k,v) in Map2

Option 2: Re-negotiate terms – don’t re-use (k,v) in Map1 after collect()

Document carefully to avoid later changes silently breaking this…

Setting Dependencies Between Jobs

JobControl and Job provide the mechanism

New API: no JobConf, create Job from Configuration, …

// create jobconf1 and jobconf2 as appropriate

// …

Job job1=new Job(jobconf1)

Job job2=new Job(jobconf2);

job2.addDependingJob(job1);

JobControl jbcntrl=new JobControl("jbcntrl");

jbcntrl.addJob(job1);

jbcntrl.addJob(job2);

jbcntrl.run()

Higher Level Abstractions

Pig: language and execution environment for expressing

MapReduce data flows. (pretty much the standard)

See White, Chapter 11

Cascading: another environment with a higher level of

abstraction for composing complex data flows

See White, Chapter 16, pp 539-552

Cascalog: query language based on Cascading that uses

Clojure (a JVM-based LISP variant)

Word count in Cascalog

Certainly more concise – though you need to grok the syntax.

(?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/ count ?count))