Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen...

59
Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration and the Web Joint work with Frank Lin, John Wong, Natalie Glance, Charles Schafer, Roy Tromble

Transcript of Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen...

Page 1: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Matching and Clustering Croduct Descriptions using Learned

Similarity Metrics

William W. CohenGoogle & CMU

2009 IJCAI Workshop on Information Integration and the Web

Joint work with Frank Lin, John Wong, Natalie Glance, Charles Schafer, Roy Tromble

Page 2: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Scaling up Information Integration

• Small scale integration– Few relations, attributes, information sources, …

– Integrate using knowledge-based approaches

• Medium scale integration– More relations, attributes, information sources, …

– Statistical approaches work for entity matching (e.g., TFIDF) …

• Large scale integration– Many relations, attributes, information sources, …

– Statistical approaches appropriate for more tasks

– Scalability issues are crucial

Page 3: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Scaling up Information Integration

• Outline:– Product search as a large-scale II task– Issue: determining identity of products with context-

sensitive similarity metrics– Scalable clustering techniques– Conclusions

Page 4: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Google Product Search: A Large-Scale II Task

Page 5: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

The Data

• Feeds from merchants Attribute/value data

•where attribute & data can be any strings• The web

Merchant sites, their content and organization Review sites, their content and organization Images, videos, blogs, links, …

• User behavior Searches & clicks

Page 6: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Challenges: Identifying Bad Data

• Spam detection• Duplicate merchants• Porn detection• Bad merchant names• Policy violations

Page 7: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Challenges: Structured data from the web

• Offers from merchants• Merchant reviews• Product reviews• Manufacturer specs• ...

Page 8: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Challenges: Understanding Products

• Catalog construction• Canonical description, feature values, price ranges, ....

• Taxonomy construction• Nerf gun is a kind of toy, not a kind of gun

• Opinion and mentions of products on the web

• Relationships between products

• Accessories, compatible replacements, • Identity

Page 9: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Google Product Search: A Large-Scale II Task

Page 10: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Challenges: Understanding Offers

• Identity• Category• Brand name• Model number• Price• Condition• ...

Plausible baseline for determining if two products are identical:1) pick a feature set2) measure similarity with cosine/IDF, ...3) threshold appropriately

Page 11: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Challenges: Understanding Offers

• Identity• Category• Brand name• Model number• Price• Condition• ...

Plausible baseline for determining if two products are identical:1) pick a feature set2) measure similarity with cosine/IDF, ...3) threshold appropriately

Advantages of cosine/IDF:• Robust: works well for many types of entities• Very fast to compute sim(x,y)• Very fast to find y: sim(x,y) > θ using inverted indices• Extensive prior work on similarity joins• Setting IDF weights

• requires no labeled data• requires only one pass over the unlabeled data• easily parallelized

Page 12: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Product similarity: challenges

• Similarity can be high for descriptions of distinct items:

o AERO TGX-Series Work Table -42'' x 96'' Model 1TGX-4296 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop edge ...

o AERO TGX-Series Work Table -42'' x 48'' Model 1TGX-4248 All tables shipped KD AEROSPEC- 1TGX Tables are Aerospec Designed. In addition to above specifications; - All four sides have a V countertop .. 

• Similarity can be low for descriptions of identical items:

o Canon Angle Finder C 2882A002 Film Camera Angle Finders Right Angle Finder C (Includes ED-C & ED-D Adapters for All SLR Cameras) Film Camera Angle Finders & Magnifiers The Angle Finder C lets you adjust  ...

o  CANON 2882A002 ANGLE FINDER C FOR EOS REBEL® SERIES PROVIDES A FULL SCREEN IMAGE SHOWS EXPOSURE DATA BUILT-IN DIOPTRIC ADJUSTMENT COMPATIBLE WITH THE CANON® REBEL, EOS & REBEL EOS SERIES.

Page 13: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Product similarity: challenges

• Linguistic diversity and domain-dependent technical specs:o "Camera angle finder" vs "right angle finder", "Dioptric adjustment“;

"Aerospec Designed", "V countertop edge", ...• Labeled training data is not easy to produce for subdomains• Imperfect and/or poorly adopted standards for identifiers • Different levels of granularity in descriptions

• Brands, manufacturer, …o Product vs. product serieso Reviews of products vs. offers to sell products

• Each merchant is different: intra-merchant regularities can dominate the intra-product regularities

Page 14: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Clustering objects from many sources

• Possible approaches– 1) Model the inter- and intra- source variability directly

(e.g., Bagnell, Blei, McCallum UAI2002; Bhattachrya & Getoor SDM 2006); latent variable for source-specific effects

– Problem: model is larger and harder to evaluate

Page 15: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Clustering objects from many sources

• Possible approaches– 1) Model the inter- and intra- source variability directly

– 2) Exploit background knowledge and use constrained clustering:

• Each merchant's catalogs is duplicate-free

• If x and y are from the same merchant constrain cluster so that CANNOT-LINK(x,y)

– More realistically: locally dedup each catalog and use a soft constraint on clustering

• E.g., Oyama &Tanaka, 2008 - distance metric learned from cannot-link constraints only using quadratic programming

• Problem: expensive for very large datasets

Page 16: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Scaling up Information Integration

• Outline:– Product search as a large-scale II task

– Issue: determining identity of products• Merging many catalogs to construct a larger catalog

• Issues arising from having many source catalogs

• Possible approaches based on prior work

• A simple scalable approach to exploiting many sources– Learning a distance metric

• Experiments with the new distance metric

– Scalable clustering techniques

– Conclusions

Page 17: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Clustering objects from many sources

Here: adjust the IDF importance weights for f using an easily-computed statistic CX(f).

• ci is source (“context”) of item xi (the selling merchant)

• Df is set of items with feature f

• xi ~ Df is uniform draw

• nc,f is #items from c with feature f

plus smoothing

Page 18: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Clustering objects from many sources

Here: adjust the IDF importance weights for f using an easily-computed statistic CX(f).

• ci is source of item xi

• Df is set of items with feature f

• xi ~ Df is uniform draw

• nc,f is #items from c with feature f

plus smoothing

Page 19: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Clustering objects from many sources

Here: adjust the IDF importance weights for f using an easily-computed statistic CX(f).

)(IDF)(CX)(CX.IDF fff

Page 20: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Motivations

• Theoretical: CX(f) related to naïve Bayes weights for a classifier of pairs of items (x,y):– Classification task: is the pair intra- or inter-source?– Eliminating intra-source pairs enforces CANNOT-LINK

constraints; using naïve Bayes classifier approximates this– Features of pair (x,y) are all common features of item x and

item y– Training data: all intra- and inter-source pairs

• Don’t need to enumerate them explicitly

• Experimental: coming up!

Page 21: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Smoothing the CX(f) weights

1. When estimating Pr( _ | xi,xj ), use a Beta distribution with (α,β)=(½,½).

2. When estimating Pr( _ | xi,xj ) for f use a Beta distribution with (α,β) computed from (μ,σ)

– Derived empirically using variant (1) on features “like f”—i.e., from the same dataset, same type, …

3. When computing cosine distance, add “correction” γ

Page 22: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Efficiency of setting CX.IDF

• Traditional IDF:– One pass over the dataset to derive weights

• Estimation with (α,β)=(½,½) :– One pass over the dataset to derive weights– Map-reduce can be used– Correcting with fixed γ adds no learning overhead

• Smoothing with “informed priors”:– Two passes over the dataset to derive weights– Map-reduce can be used

Page 23: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Scaling up Information Integration

• Outline:– Product search as a large-scale II task

– Issue: determining identity of products• Merging many catalogs to construct a larger catalog

• Issues arising from having many source catalogs

• Possible approaches based on prior work

• A simple scalable approach to exploiting many sources– Learning a distance metric

• Experiments with the new distance metric

– Scalable clustering techniques

– Conclusions

Page 24: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Warmup: Experiments with k-NN classification

• Classification vs matching: – better-understood problem with fewer “moving parts”

• Nine small classification datasets – from Cohen & Hirsh, KDD 1998

– instances are short, name-like strings

• Use class label as context (metric learning)– equivalent to MUST-LINK constraints

– stretch same-context features in “other” direction• Heavier weight for features that co-occur in same-context pairs

• CX-1.IDF weighting (aka IDF/CX).

Page 25: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Experiments with k-NN classification

Procedure: • learn similarity metric from (labeled) training data• for test instances, find closest k=30 items in training data and predict distance-weighted majority class• predict majority class in training data if no neighbors with similarity > 0

Page 26: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Experiments with k-NN classification

Ratio of k-NN error to baseline k-NN errorLower is better* Statistically significantly better than baseline

(α,β)=(½,½) (α,β) from (μ,σ)

Page 27: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Experiments Matching Bibliography Data

• Scraped LaTex *.bib files from the web:– 400+ files with 100,000+ bibentries– All contain the phrase “machine learning”– Generated 3,000,000 “weakly similar” pairs of bibentries– Scored and ranked the pairs with IDF, CX.IDF, …

• Used paper URLs and/or DOIs to assess precision– About 3% have useful identifiers– Pairings between these 3% can be assessed as right/wrong

• Smoothing done using informed priors– Unsmoothed weights averaged over all tokens in a specific

bibliography entry field (e.g., author)

• Data is publicly available

Page 28: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Matching performance for bibliography entries

(α,β)=(½,½)

(α,β) from (μ,σ)Baseline IDF

Interpolated precision versus rank (γ=10, R<10k)

Page 29: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Known errors versus rank (γ=10, R<10k)

(α,β)=(½,½)

(α,β) from (μ,σ)

Baseline IDF

Page 30: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Matching performance for bibliography entries - at higher recall

Errors versus rank (γ=10, R>10k)

Page 31: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Experiments Matching Product Data

• Data from >700 web sites, merchants, hand-built catalogs• Larger number of instances: > 40M• Scored and ranked > 50M weakly similar pairs• Hand-tuned feature set

– But tuned on an earlier version of the data

• Used hard identifiers (ISBN, GTIN, UPC) to assess accuracy– More than half have useful hard identifiers– Most hard identifiers appear only once or twice

Page 32: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Experiments Matching Product Data

(α,β)=(½,½)

(α,β) from (μ,σ)

Baseline IDF

Page 33: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Experiments with product data

(α,β)=(½,½)(α,β) from (μ,σ)

Baseline IDF

Page 34: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Scaling up Information Integration

• Outline:– Product search as a large-scale II task– Issue: determining identity of products with context-

sensitive similarity metrics– Scalable clustering techniques (w/ Frank Lin)

• Background on spectral clustering techniques

• A fast approximate spectral technique

• Theoretical justification

• Experimental results

– Conclusions

Page 35: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Spectral Clustering: Graph = Matrix

AB

C

FD

E

GI

HJ

A B C D E F G H I J

A 1 1 1

B 1 1

C 1

D 1 1

E 1

F 1 1 1

G 1

H 1 1 1

I 1 1 1

J 1 1

Page 36: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Spectral Clustering: Graph = MatrixTransitively Closed Components = “Blocks”

AB

C

FD

E

GI

HJ

A B C D E F G H I J

A _ 1 1 1

B 1 _ 1

C 1 1 _

D _ 1 1

E 1 _ 1

F 1 1 1 _

G _ 1 1

H _ 1 1

I 1 1 _ 1

J 1 1 1 _

Of course we can’t see the “blocks” unless the nodesare sorted by cluster…

Page 37: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Spectral Clustering: Graph = MatrixVector = Node Weight

H

A B C D E F G H I J

A _ 1 1 1

B 1 _ 1

C 1 1 _

D _ 1 1

E 1 _ 1

F 1 1 1 _

G _ 1 1

H _ 1 1

I 1 1 _ 1

J 1 1 1 _

AB

C

FD

E

GI

J

A

A 3

B 2

C 3

D

E

F

G

H

I

J

M

M v

Page 38: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Spectral Clustering: Graph = MatrixM*v1 = v2 “propogates weights from neighbors”

A B C D E F G H I J

A _ 1 1 1

B 1 _ 1

C 1 1 _

D _ 1 1

E 1 _ 1

F 1 1 _

G _ 1 1

H _ 1 1

I 1 1 _ 1

J 1 1 1 _

AB

C

FD

E

I

A 3

B 2

C 3

D

E

F

G

H

I

J

M

M v1

A 2*1+3*1+0*1

B 3*1+3*1

C 3*1+2*1

D

E

F

G

H

I

J

v2* =

Page 39: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

A B C D E F G H I J

A _ .5 .5 .3

B .3 _ .5

C .3 .5 _

D _ .5 .3

E .5 _ .3

F .3 .5 .5 _

G _ .3 .3

H _ .3 .3

I .5 .5 _ .3

J .5 .5 .3 _

AB

C

FD

E

I

A 3

B 2

C 3

D

E

F

G

H

I

J

M

W v1

A 2*.5+3*.5+0*.3

B 3*.3+3*.5

C 3*.33+2*.5

D

E

F

G

H

I

J

v2* =W: normalized so columns sum to 1

Page 40: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

M

eigenvaluer with eigenvectoan is : vvvW

[Shi & Meila, 2002]

λ2

λ3

λ4

λ5,6,7,….

λ1e1

e2

e3

“eigengap”

Page 41: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

M

eigenvaluer with eigenvectoan is : vvvW

[Shi & Meila, 2002]

e2

e3

-0.4 -0.2 0 0.2

-0.4

-0.2

0.0

0.2

0.4

xx x xx x

yyyy

y

xx xxx x

zzzzz z

zzzz z e1

e2

Page 42: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

M

eigenvaluer with eigenvectoan is : vvvW

If Wis connected but roughly block diagonal with k blocks then• the top eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks

Page 43: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Spectral Clustering: Graph = MatrixW*v1 = v2 “propogates weights from neighbors”

M

eigenvaluer with eigenvectoan is : vvvW

If W is connected but roughly block diagonal with k blocks then• the “top” eigenvector is a constant vector • the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks

Spectral clustering:• Find the top k+1 eigenvectors v1,…,vk+1

• Discard the “top” one• Replace every node a with k-dimensional vector xa = <v2(a),…,vk+1 (a) >

• Cluster with k-means

Page 44: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Spectral Clustering: Pros and Cons

• Elegant, and well-founded mathematically• Works quite well when relations are

approximately transitive (like similarity)• Very noisy datasets cause problems

– “Informative” eigenvectors need not be in top few

– Performance can drop suddenly from good to terrible

• Expensive for very large datasets– Computing eigenvectors is the bottleneck

• There is a very scalable way to compute the top eigenvector

Page 45: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Aside: power iteration to compute the top eigenvector

• Let v0 be almost any vector

• Repeat until convergence (c is a normalizer):

– vt = cW*vt-1

• This is how PageRank is computed– For a different W

• This converges to the top eigenvector– Which in this case is constant

– But …

Page 46: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Convergence of PI for a clustering problem:*each box is rescaled to same vertical range

smal

lla

rger

Page 47: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Explanation: ???

ei is i-th eigenvector

Page 48: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Explanation

Page 49: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Explanation

(converges to zero even more quickly)

(converges to zero quickly)

Page 50: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Explanation

eigenvectors are piecewise constant across the clusters and some pair of constants is really different for each pair of clusters

Page 51: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Explanation: the signal approximates spectral clustering’s distance

2

1

lo

t

cR

space means-k in the distance ),( baspec

…but all the pic(a,b) distances are in a small radius:

Page 52: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

PIC: Power Iteration Clustering

Details:• run k-means 10 times and pick best output

• by intra-cluster similarity)• stopping condition: acceleration < 10-5/n

Page 53: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Experimental Results

Page 54: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Experimental results: best-case assignment of class labels to clusters

Page 55: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Experiments: run time and scalability

Time in millisec

Page 56: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Experiments: run time and scalability

Page 57: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Summary

• Large-scale integration:• New statistical approaches: assuming huge numbers of objects,

relations, and sources of information• Simplicity and scalability is crucial

• CX.IDF is an extension of IDF weighting• Exploits statistics in data merged from many locally-deduped

sources, a very common integration scenario• Weights can be “learned” without labeling• Weight “learning” requires 2-3 passes over the data• Errors are reduced significantly relative to IDF

• 20% lower error on average for classification• Up to 65% lower error in matching tasks at high recall levels• Very high precision possible at lower recall levels

Page 58: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Summary

• Large-scale integration:• New statistical approaches: assuming huge numbers of objects,

relations, and sources of information• Simplicity and scalability is crucial

• CX.IDF is an extension of IDF weighting• Simple, scalable, parallelizable

• PIC is a very scalable clustering method• Formally, works when spectral techniques work• Experimentally, often better than traditional spectral methods• Based on power iteration on a normalized matrix with early

stopping• Experimentally, linear time• Easy to implement and efficient• Very easily parallelized

Page 59: Matching and Clustering Croduct Descriptions using Learned Similarity Metrics William W. Cohen Google & CMU 2009 IJCAI Workshop on Information Integration.

Questions...?