Distance functions and IE -2 William W. Cohen CALD.

Distance functions and IE -2

William W. Cohen

Announcements

• March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin Markov nets– 9:30 am in NSH 1507– open to public - tell your friends!

• Datasets: – some public extraction data is (I hope readable) on /afs/cs/project/extract-learn/repository

• Writeups:– nothing today– “distance metrics for text” – three papers - due next Monday, 3/22

Record linkage: definition

• Record linkage: determine if pairs of data records describe the same entity – I.e., find record pairs that are co-referent– Entities: usually people (or organizations or…)– Data records: names, addresses, job titles, birth

dates, …

• Main applications: – Joining two heterogeneous relations– Removing duplicates from a single relation

The data integration problem

• Control flow (modulo details about querying– Extract (author, department) pairs from DB1

– Extract (department ,www server) pairs from DB2

– Execute the two-step plan to get paper:

• author -> department -> wwwServer

– two steps means matching (linking, integrating, deduping, ....) department names in DB1/DB2

– issues are completely different if user is executing a one-step plan:

• one-step plan is retrieval

String distance metrics: Levenshtein

• Edit-distance metrics– Distance is shortest sequence of edit

commands that transform s to t.– Simplest set of operations:

• Copy character from s over to t

• Delete a character in s (cost 1)

• Insert a character in t (cost 1)

• Substitute one character for another (cost 1)

– This is “Levenshtein distance”

Computing Levenshtein distance – 4

D(i,j) = minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete

C O H E N

M 1 2 3 4 5

C 1 2 3 4 5

C 2 3 3 4 5

O 3 2 3 4 5

H 4 3 2 3 4

N 5 4 3 3 3

A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

Smith-Waterman distance - 2

D(i,j) = max

0 //start overD(i-1,j-1) - d(si,tj) //subst/copyD(i-1,j) - G //insertD(i,j-1) - G //delete

d(c,c) = -2

d(c,d) = +1

C O H E N

M -1 -2 -3 -4 -5

C 0 0 -1 -2 -3

C +1 0 -1 -2 -3

O -1 +2 +1 0 -1

H -2 +1 +4 +3 +2

N -3 0 +3 +3 +5

D(i,j) = max

0 //start overD(i-1,j-1) - d(si,tj) //subst/copyD(i-1,j) - G //insertD(i,j-1) - G //delete

d(c,c) = -2

d(c,d) = +1

C O H E N

M 0 0 0 0 0

C 0 0 0 0 0

C +1 0 0 0 0

O 0 +2 +1 0 0

H 0 +1 +4 +3 +2

N 0 0 +3 +3 +5

c o h e n d o r f

m 0 0 0 0 0 0 0 0 0

c 1 0 0 0 0 0 0 0 0

c 0 0 0 0 0 0 0 0 0

o 0 2 1 0 0 0 2 1 0

h 0 1 4 3 2 1 1 1 0

n 0 0 3 3 5 4 3 2 1

s 0 0 2 2 4 4 3 2 1

k 0 0 1 1 3 3 3 2 1

i 0 0 0 0 2 2 2 2 1

dist=5

Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)

• String s=A1 A2 ... AK, string t=B1 B2 ... BL

• sim’ is editDistance scaled to [0,1]

• Monge-Elkan’s “recursive matching scheme” is average maximal similarity of Ai to Bj:

Results: S-W from Monge & Elkan

Affine gap distances

• Smith-Waterman fails on some pairs that seem quite similar:

William W. Cohen

William W. ‘Don’t call me Dubya’ Cohen

Intuitively, a single long insertion is “cheaper” than a lot of short insertions

Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions

Affine gap distances - 2

• Idea: – Current cost of a “gap” of n characters: nG– Make this cost: A + (n-1)B, where A is cost of

“opening” a gap, and B is cost of “continuing” a gap.

D(i,j) = maxD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)-1 //insertD(i,j-1)-1 //delete

IS(i,j) = max D(i-1,j) - AIS(i-1,j) - B

IT(i,j) = max D(i,j-1) - AIT(i,j-1) - B

Best score in which si is aligned with a ‘gap’

Best score in which tj is aligned with a ‘gap’

D(i-1,j-1) + d(si,tj)

IS(I-1,j-1) + d(si,tj)

IT(I-1,j-1) + d(si,tj)

-d(si,tj) D

IT-d(si,tj)

-d(si,tj)

Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)

• Goal is to match data like this:

Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)

• Hand-tuned edit distance

• Lower costs for affine gaps

• Even lower cost for affine gaps near a “.”

• HMM-based normalization to group title, author, booktitle, etc into fields (as in Borkar et al)

Affine gap distances – experiments

TFIDF Edit Distance

Cora 0.751 0.839

OrgName1 0.925 0.633

0.366 0.950

Orgname2 0.958 0.571

0.778 0.912

Restaurant 0.981 0.827

0.967 0.867

Parks 0.976 0.967

0.967 0.967

TFIDF distance for data integration

Experiments with WHIRL

Three ways to deal with output of IE systems

• Method 1.– Do the best you can at mapping the output into a

conventional database (or KR system) with a natural schema (info about people, events, etc)

– Answer any questions with the existing DB

• Method 2.– Given a query, try and see how much the answer can be

constrained by information derived from IE (somehow or other

– Probably requires some sort of uncertain reasoning.

• Birds: r(birdName,soundDescription) and 5 short descriptions of sounds (“an owl hooting”)

• Movies r(movieName,review) and 5 long, 5 short plot descriptions (“sci-fi comedy”, “serious czech movie”, ...)

Soft joins with “incompatible schemas”

WHIRL as a classification-learner

Classification with unlabeled “Background” instances

Example: instances are paper titles, background instances are paper abstracts

Very very short examples

Very short examples

Classifying short newswire headlines

Inference in WHIRL

• “Best-first” search: pick state s that is “best” according to f(s)

• Suppose graph is a tree, and for all s, s’, if s’ is reachable from s then f(s)>=f(s’). Then A* outputs the globally best goal state s* first, and then next best, ...

Inference in WHIRL

• Explode p(X1,X2,X3): find all DB tuples <p,a1,a2,a3> for p and bind Xi to ai.

• Constrain X~Y: if X is bound to a and Y is unbound, – find DB column C to which

Y should be bound– pick a term t in X, find

proper inverted index for t in C, and bind Y to something in that index

• Keep track of t’s used previously, and don’t allow Y to contain one.

Inference in WHIRL

Summary

• WHIRL finds the top k answers to a query• Queries tend to be easy because either they’re

– unconstrained (e.g. 2-way similarity join) => easy to find 100 or so “good” answers

– highly constrained (e.g. restricted sim join, multi-way join, classification query, ....) => easy to present all the “reasonable” answers to a user

• Data integration usually considers matching two lists of entity descriptions in the abstract– unconstrained, sometimes under constrained (what is a

match to the end user?) – i.e., we don’t know what the final query, and hence final constraints, will turn out to be.

– this is evaluated a lot in experiments, but in an ideal world it would not the “wrong” problem

Distance functions and IE -2 William W. Cohen CALD.

Documents

Transcript of Distance functions and IE -2 William W. Cohen CALD.

Holistic Web Page Classification William W. Cohen Center for Automated Learning and Discovery (CALD) Carnegie-Mellon University.

Reference services for the CALD community

Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.

Pressure Vessel Design (CALD Series II Training Material)

MCH - Conference Consumer Participation and CALD Communities

HIV and CALD communities - siren.org.auHIV and CALD communities Strengthening the health promotion partnership SiREN Symposium 2016. AFAO’s CALD work 2009 –current: African Australian

Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen Feb 8 IE Lecture.

CALD - Common Mistakes

Writing letters cald

Presentation of stigma in CALD clients in Victoria

Near-optimal Sensor Placements: Maximizing Information ...reports-archive.adm.cs.cmu.edu/anon/cald/CMU-CALD-05-110.pdf · Maximizing Information while Minimizing Communication Cost

Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.

HIV and CALD communities - SiREN€¦ · HIV and CALD communities Strengthening the health promotion partnership SiREN Symposium 2016. AFAO’s CALD work 2009 –current: African

CALD Framework · YPRL recognises the need for a CALD framework as a consideration in strategic and operational planning. YPRL is committed to implementing a CALD framework and will

Towards Best Practice for Older Persons from CALD Backgrounds · for Older Persons from CALD Backgrounds ... dev’t as major priority ... • Bilingual GP’s

Working with MinorThird: Lesson 3: Advanced Topics William W. Cohen CALD.

CALD Resources CALD Resources Culturally and Linguistically Diverse Supporting our workforce in responding to cultural diversity for NGO, primary and secondary.

A Century Of Progress On Information Integration: A Mid-Term Report William W. Cohen Center for Automated Learning and Discovery (CALD), Carnegie Mellon.

Responding to CALD Learners

IE by Candidate Classification: Jansche & Abney, Cohen et al William Cohen 1/19/03.