Distance functions and IE -2 William W. Cohen CALD.
-
Upload
ella-dawson -
Category
Documents
-
view
214 -
download
2
Transcript of Distance functions and IE -2 William W. Cohen CALD.
![Page 1: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/1.jpg)
Distance functions and IE -2
William W. Cohen
CALD
![Page 2: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/2.jpg)
Announcements
• March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin Markov nets– 9:30 am in NSH 1507– open to public - tell your friends!
• Datasets: – some public extraction data is (I hope readable) on /afs/cs/project/extract-learn/repository
• Writeups:– nothing today– “distance metrics for text” – three papers - due next Monday, 3/22
![Page 3: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/3.jpg)
Record linkage: definition
• Record linkage: determine if pairs of data records describe the same entity – I.e., find record pairs that are co-referent– Entities: usually people (or organizations or…)– Data records: names, addresses, job titles, birth
dates, …
• Main applications: – Joining two heterogeneous relations– Removing duplicates from a single relation
![Page 4: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/4.jpg)
The data integration problem
• Control flow (modulo details about querying– Extract (author, department) pairs from DB1
– Extract (department ,www server) pairs from DB2
– Execute the two-step plan to get paper:
• author -> department -> wwwServer
– two steps means matching (linking, integrating, deduping, ....) department names in DB1/DB2
– issues are completely different if user is executing a one-step plan:
• one-step plan is retrieval
![Page 5: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/5.jpg)
String distance metrics: Levenshtein
• Edit-distance metrics– Distance is shortest sequence of edit
commands that transform s to t.– Simplest set of operations:
• Copy character from s over to t
• Delete a character in s (cost 1)
• Insert a character in t (cost 1)
• Substitute one character for another (cost 1)
– This is “Levenshtein distance”
![Page 6: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/6.jpg)
Computing Levenshtein distance – 4
D(i,j) = minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete
C O H E N
M 1 2 3 4 5
C 1 2 3 4 5
C 2 3 3 4 5
O 3 2 3 4 5
H 4 3 2 3 4
N 5 4 3 3 3
A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)
![Page 7: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/7.jpg)
Smith-Waterman distance - 2
D(i,j) = max
0 //start overD(i-1,j-1) - d(si,tj) //subst/copyD(i-1,j) - G //insertD(i,j-1) - G //delete
G = 1
d(c,c) = -2
d(c,d) = +1
C O H E N
M -1 -2 -3 -4 -5
C 0 0 -1 -2 -3
C +1 0 -1 -2 -3
O -1 +2 +1 0 -1
H -2 +1 +4 +3 +2
N -3 0 +3 +3 +5
![Page 8: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/8.jpg)
Smith-Waterman distance - 3
D(i,j) = max
0 //start overD(i-1,j-1) - d(si,tj) //subst/copyD(i-1,j) - G //insertD(i,j-1) - G //delete
G = 1
d(c,c) = -2
d(c,d) = +1
C O H E N
M 0 0 0 0 0
C 0 0 0 0 0
C +1 0 0 0 0
O 0 +2 +1 0 0
H 0 +1 +4 +3 +2
N 0 0 +3 +3 +5
![Page 9: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/9.jpg)
Smith-Waterman distance - 5
c o h e n d o r f
m 0 0 0 0 0 0 0 0 0
c 1 0 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0 0 0
o 0 2 1 0 0 0 2 1 0
h 0 1 4 3 2 1 1 1 0
n 0 0 3 3 5 4 3 2 1
s 0 0 2 2 4 4 3 2 1
k 0 0 1 1 3 3 3 2 1
i 0 0 0 0 2 2 2 2 1
dist=5
![Page 10: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/10.jpg)
Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)
• String s=A1 A2 ... AK, string t=B1 B2 ... BL
• sim’ is editDistance scaled to [0,1]
• Monge-Elkan’s “recursive matching scheme” is average maximal similarity of Ai to Bj:
![Page 11: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/11.jpg)
Results: S-W from Monge & Elkan
![Page 12: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/12.jpg)
Affine gap distances
• Smith-Waterman fails on some pairs that seem quite similar:
William W. Cohen
William W. ‘Don’t call me Dubya’ Cohen
Intuitively, a single long insertion is “cheaper” than a lot of short insertions
Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions
![Page 13: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/13.jpg)
Affine gap distances - 2
• Idea: – Current cost of a “gap” of n characters: nG– Make this cost: A + (n-1)B, where A is cost of
“opening” a gap, and B is cost of “continuing” a gap.
![Page 14: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/14.jpg)
Affine gap distances - 3
D(i,j) = maxD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)-1 //insertD(i,j-1)-1 //delete
IS(i,j) = max D(i-1,j) - AIS(i-1,j) - B
IT(i,j) = max D(i,j-1) - AIT(i,j-1) - B
Best score in which si is aligned with a ‘gap’
Best score in which tj is aligned with a ‘gap’
D(i-1,j-1) + d(si,tj)
IS(I-1,j-1) + d(si,tj)
IT(I-1,j-1) + d(si,tj)
![Page 15: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/15.jpg)
Affine gap distances - 4
-B
-B
-d(si,tj) D
IS
IT-d(si,tj)
-d(si,tj)
-A
-A
![Page 16: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/16.jpg)
Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)
• Goal is to match data like this:
![Page 17: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/17.jpg)
Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)
• Hand-tuned edit distance
• Lower costs for affine gaps
• Even lower cost for affine gaps near a “.”
• HMM-based normalization to group title, author, booktitle, etc into fields (as in Borkar et al)
![Page 18: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/18.jpg)
Affine gap distances – experiments
TFIDF Edit Distance
Cora 0.751 0.839
0.721
OrgName1 0.925 0.633
0.366 0.950
Orgname2 0.958 0.571
0.778 0.912
Restaurant 0.981 0.827
0.967 0.867
Parks 0.976 0.967
0.967 0.967
![Page 19: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/19.jpg)
TFIDF distance for data integration
Experiments with WHIRL
![Page 20: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/20.jpg)
![Page 21: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/21.jpg)
Three ways to deal with output of IE systems
• Method 1.– Do the best you can at mapping the output into a
conventional database (or KR system) with a natural schema (info about people, events, etc)
– Answer any questions with the existing DB
• Method 2.– Given a query, try and see how much the answer can be
constrained by information derived from IE (somehow or other
– Probably requires some sort of uncertain reasoning.
![Page 22: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/22.jpg)
![Page 23: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/23.jpg)
![Page 24: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/24.jpg)
![Page 25: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/25.jpg)
![Page 26: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/26.jpg)
![Page 27: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/27.jpg)
![Page 28: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/28.jpg)
![Page 29: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/29.jpg)
![Page 30: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/30.jpg)
![Page 31: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/31.jpg)
![Page 32: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/32.jpg)
![Page 33: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/33.jpg)
• Birds: r(birdName,soundDescription) and 5 short descriptions of sounds (“an owl hooting”)
• Movies r(movieName,review) and 5 long, 5 short plot descriptions (“sci-fi comedy”, “serious czech movie”, ...)
![Page 34: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/34.jpg)
![Page 35: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/35.jpg)
Soft joins with “incompatible schemas”
![Page 36: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/36.jpg)
![Page 37: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/37.jpg)
WHIRL as a classification-learner
![Page 38: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/38.jpg)
![Page 39: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/39.jpg)
![Page 40: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/40.jpg)
Classification with unlabeled “Background” instances
Example: instances are paper titles, background instances are paper abstracts
![Page 41: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/41.jpg)
![Page 42: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/42.jpg)
Very very short examples
Very short examples
Classifying short newswire headlines
![Page 43: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/43.jpg)
Inference in WHIRL
• “Best-first” search: pick state s that is “best” according to f(s)
• Suppose graph is a tree, and for all s, s’, if s’ is reachable from s then f(s)>=f(s’). Then A* outputs the globally best goal state s* first, and then next best, ...
![Page 44: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/44.jpg)
Inference in WHIRL
• Explode p(X1,X2,X3): find all DB tuples <p,a1,a2,a3> for p and bind Xi to ai.
• Constrain X~Y: if X is bound to a and Y is unbound, – find DB column C to which
Y should be bound– pick a term t in X, find
proper inverted index for t in C, and bind Y to something in that index
• Keep track of t’s used previously, and don’t allow Y to contain one.
![Page 45: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/45.jpg)
Inference in WHIRL
![Page 46: Distance functions and IE -2 William W. Cohen CALD.](https://reader030.fdocuments.us/reader030/viewer/2022032802/56649e005503460f94ae8bb8/html5/thumbnails/46.jpg)
Summary
• WHIRL finds the top k answers to a query• Queries tend to be easy because either they’re
– unconstrained (e.g. 2-way similarity join) => easy to find 100 or so “good” answers
– highly constrained (e.g. restricted sim join, multi-way join, classification query, ....) => easy to present all the “reasonable” answers to a user
• Data integration usually considers matching two lists of entity descriptions in the abstract– unconstrained, sometimes under constrained (what is a
match to the end user?) – i.e., we don’t know what the final query, and hence final constraints, will turn out to be.
– this is evaluated a lot in experiments, but in an ideal world it would not the “wrong” problem