Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… ·...

21
Introduction Background Methodology Summary Entity Resolution, Clustering Author References Vlad Shchogolev [email protected] May 1, 2007 Vlad Shchogolev [email protected] Entity Resolution

Transcript of Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… ·...

Page 1: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Entity Resolution, Clustering AuthorReferences

Vlad [email protected]

May 1, 2007

Vlad Shchogolev [email protected] Entity Resolution

Page 2: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Outline

1 IntroductionWhat is Entity Resolution?Motivation

2 BackgroundFormal DefinitionEfficieny ConsiderationsMeasuring Text SimilarityOther approaches

3 MethodologyClustering author referencesLearning distance functionClustering experiment

Vlad Shchogolev [email protected] Entity Resolution

Page 3: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

What is Entity Resolution?Motivation

Set-Up

Given: a table of records, each record has a number of“fields”.Problem: decide which pairs of records refer to the same“entity”.Variation: given two tables, pair up matching records

Vlad Shchogolev [email protected] Entity Resolution

Page 4: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

What is Entity Resolution?Motivation

A new name for an old research area

Record linkageOriginally studied by Dunn, 1946Formalized by Fellegi and Sunter, 1969

Merge/purge problemData matching, object identity problemCoreference resolution, reference reconciliation, etc.

Vlad Shchogolev [email protected] Entity Resolution

Page 5: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

What is Entity Resolution?Motivation

Why is this useful?

Historical researchMedical researchGovernment record-keepingTracking terroristsWal-MartYahoo Local, Google LocalBibliographic citations

Vlad Shchogolev [email protected] Entity Resolution

Page 6: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Formal Definition

Two datasets A and B, and let (a, b) ∈ A × B.Also define M as the set of matched pairs (a = b) and U asthe set of unmatched pairs.For a given pair, there is a list of comparison values for(c1, . . . cn) where ck is the comparison value for the k th

field of the item.

Vlad Shchogolev [email protected] Entity Resolution

Page 7: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Probabilistic Model, Fellegi and Sunter

Define quantities mk = P{ck = 0|(a, b) ∈ M} anduk = P{ck = 0|(a, b) ∈ U}Assume independence assumption (!)Weight assigned to each component is:

wk =

{log(mk/uk ) if ck = 0,log(1 − mk )/(1 − uk )) if ck = 1.

Equivalent to Naive Bayes

Vlad Shchogolev [email protected] Entity Resolution

Page 8: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Canopies

Need methods to find candidate pairsMcCallum, Nigam, Ungar introduced “canopies” approach:clustering is performed in two stages, starting with a roughstage that divides data into overlapping subsetsIf clustering measures distance to cluster using clustercentroid, and canopy is larger than true cluster, nothing islost.

Vlad Shchogolev [email protected] Entity Resolution

Page 9: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Clustering bibliographic references

Goal was to compute the citation graph for research papersFirst pass: used a fast TF-IDF approachSecond pass: used an expensive string edit distancecomputation, combined with a HMM for field extractionResults in equally good accuracy, but orders of magnitudefaster

Vlad Shchogolev [email protected] Entity Resolution

Page 10: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Measuring Text Similarity

Several methods exist to construct a similarity measure for text:edit distance (customizable costs)Jaro’s algorithm (transpositions)character N-gramsTF-IDFstring kernelsterm-vector dot productsoundex

Research typically finds that no single method is best

Vlad Shchogolev [email protected] Entity Resolution

Page 11: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Refinements

Minton, Nanjo et al. introduced “transformation graphs” tohandle higher level concepts:

synonymsmisspellingabbreviationacronymconcatenation

Again, a Naive Bayes approach is used to learn weights fortransformations.

Vlad Shchogolev [email protected] Entity Resolution

Page 12: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Formal DefinitionEfficiency ConsiderationsMeasuring Text SimilarityOther approaches

Domain-independent approach

Monge and Elkan suggest a “domain-independent”approachEach record is treat as a single long stringSimilarity is measured using edit distance

Vlad Shchogolev [email protected] Entity Resolution

Page 13: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Clustering author references

A related problem to bibliographic referencesExperiment run on a large biology research corpusGoal is to determine when two matching names (e.g.Smith J.) refer to the same personExtra structure: social network consisting of co-authorshipedges

Vlad Shchogolev [email protected] Entity Resolution

Page 14: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Learning distance function

Similar to distance metric learning paper, but not inEuclidean spaceNot domain-independentDomain knowledge used to pick from one of 3 comparisonfunctions for each field:

equalityset intersectioncharacter N-gram similarity

Most important feature: number of common co-authors

Vlad Shchogolev [email protected] Entity Resolution

Page 15: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Clustering experiment

Clustering name references can be useful for judgingco-authorship importanceTried experiments with simple Greedy AgglomerativeClustering.Measured within-cluster dispersion at each clustering step:

Data is clustered into k clusters C1, . . . CkSum of pairwise distances: Dr =

∑i,j∈Cr

di,j

Dispersion measure: Wk =∑k

r=11

2nrDr

Vlad Shchogolev [email protected] Entity Resolution

Page 16: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Gap Statistic

Can we estimate the true number of clusters?Define Gap Statistic (Tibshirani et al. 2000)

Gapn(k) = E∗n (log(Wk ))− log(Wk )

Expectation is over a sample from reference distributionThe quantity log(Wk ) can be thought of as log-likelihood

Vlad Shchogolev [email protected] Entity Resolution

Page 17: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Reference distribution

Gapn(k) = E∗n (log(Wk ))− log(Wk )

For uniform distribution, expectation should decrease atthe rate (2/p) log k .Better: a uniform distribution over a box align with theprincipal components.Goal is to produce evidence against the null model (singlecluster)

Vlad Shchogolev [email protected] Entity Resolution

Page 18: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Estimation Procedure

Sample B Monte Carlo reference datasets from referencedistributionCluster each sampled datasetEstimate the gap:

Gap(k) = (1/B)∑

b

log(W ∗kb)− log(Wk )

Pick smallest k such that Gap(k) > Gap(k + 1)− sk+1

Vlad Shchogolev [email protected] Entity Resolution

Page 19: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Estimation Procedure

Estimation performs poorly; reference distribution notappropriate?Euclidean distance metric not appropriate for highdimensionsHowever: have evidence that Wk graph has some signal

Vlad Shchogolev [email protected] Entity Resolution

Page 20: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Clustering authorsLearning distance functionClustering experiment

Two within-cluster dispersion graphs

Vlad Shchogolev [email protected] Entity Resolution

Page 21: Entity Resolution, Clustering Author Referencesjebara/6998/presentations/VladPresentation.… · Introduction Background Methodology Summary Clustering authors Learning distance function

IntroductionBackground

MethodologySummary

Summary

Rethink cluster estimation for this class of problemsDistance metric learning works well, but...Should take advantage of additional structure

Vlad Shchogolev [email protected] Entity Resolution