Name Phylogeny - A Generative Model of String Variation

53
Name Phylogeny A Generative Model of String Variation Nicholas Andrews, Jason Eisner and Mark Dredze Department of Computer Science, Johns Hopkins University EMNLP 2012 – Thursday, July 12

Transcript of Name Phylogeny - A Generative Model of String Variation

Page 1: Name Phylogeny - A Generative Model of String Variation

Name PhylogenyA Generative Model of String Variation

Nicholas Andrews, Jason Eisner and Mark Dredze

Department of Computer Science,Johns Hopkins University

EMNLP 2012 – Thursday, July 12

Page 2: Name Phylogeny - A Generative Model of String Variation

Outline

Introduction

Generative Model

Mutation Model

Inference

Experiments

Future Work

Page 3: Name Phylogeny - A Generative Model of String Variation

What’s a name phylogeny?

A fragment of a “name phylogeny” learned by our model

Khawaja Gharibnawaz Muinuddin Hasan Chisty

Khwaja Gharib Nawaz

Khwaja Muin al-Din Chishti

Ghareeb Nawaz

Khwaja Moinuddin Chishti

Khwaja gharibnawazMuinuddin Chishti

Thomas Ruggles Pynchon, Jr.

Thomas Ruggles Pynchon Jr.

Thomas R. Pynchon, Jr.

Thomas R. Pynchon Jr.

Thomas R. Pynchon

Thomas Pynchon, Jr.

Thomas Pynchon Jr.

I Each edge corresponds to a “mutation”

Page 4: Name Phylogeny - A Generative Model of String Variation

Problem: organizing disorganized collections of strings

Barack Obama

Obama

President Barack Obama

Barack

Barrackbarack obama

Hillary Clinton

Clinton

Bill Clinton

billBill

Barry

Vice President Clinton

Billy

Hillary

will clinton

Hillary Rodham Clinton

Mitt RomneyBarack Obama Sr

Romney

Willard M. Romney

Governor Mitt Romney

Mr. Romney

mittMitt rommey

clinton

William Clinton

barak

President Bill Clinton

President

Barack H. Obama

Ms. Clinton

Page 5: Name Phylogeny - A Generative Model of String Variation

Problem: organizing disorganized collections of strings

Barack Obama

Obama

President Barack Obama

Barack

Barrackbarack obama

Hillary ClintonClinton

Bill Clinton

bill Bill

Barry

Vice President Clinton

Billy

Hillary

will clinton

Hillary Rodham Clinton

Mitt RomneyBarack Obama Sr

Romney

Willard M. Romney

Governor Mitt Romney

Mr. RomneymittMitt rommey

clinton

William Clinton

barak

President Bill Clinton

President

Barack H. Obama

Ms. Clinton

Page 6: Name Phylogeny - A Generative Model of String Variation

Challenges

I Name variation: the same entity may have different names,and a good measure of “similarity” between strings may notbe available (This work)

I Disambiguation: different entities may have names incommon, requiring the use of context to disambiguatebetween them

Barack Obama

Obama

President Barack Obama

Barack

Barrackbarack obama

Hillary ClintonClinton

Bill Clinton

bill Bill

Barry

Vice President Clinton

Billy

Hillary

will clinton

Hillary Rodham Clinton

Mitt RomneyBarack Obama Sr

Romney

Willard M. Romney

Governor Mitt Romney

Mr. RomneymittMitt rommey

clinton

William Clinton

barak

President Bill Clinton

President

Barack H. Obama

Ms. Clinton

Page 7: Name Phylogeny - A Generative Model of String Variation

How does a name phylogeny help?

1. Organizes name variants into connected components (clusters)

Khawaja Gharibnawaz Muinuddin Hasan Chisty

Khwaja Gharib Nawaz

Khwaja Muin al-Din Chishti

Ghareeb Nawaz

Khwaja Moinuddin Chishti

Khwaja gharibnawazMuinuddin Chishti

Thomas Ruggles Pynchon, Jr.

Thomas Ruggles Pynchon Jr.

Thomas R. Pynchon, Jr.

Thomas R. Pynchon Jr.

Thomas R. Pynchon

Thomas Pynchon, Jr.

Thomas Pynchon Jr.

2. Align names as “mutations” of one another

Khawaja Gharibnawaz Muinuddin Hasan Chisty

Khwaja Gharib Nawaz

Khwaja Muin al-Din Chishti

Ghareeb Nawaz

Khwaja Moinuddin Chishti

Khwaja gharibnawazMuinuddin Chishti

Thomas Ruggles Pynchon, Jr.

Thomas Ruggles Pynchon Jr.

Thomas R. Pynchon, Jr.

Thomas R. Pynchon Jr.

Thomas R. Pynchon

Thomas Pynchon, Jr.

Thomas Pynchon Jr.

3. We can estimate a mutation model given a phylogeny, and amutation model gives a distribution over phylogenies (→ EM)

Page 8: Name Phylogeny - A Generative Model of String Variation

Outline

Introduction

Generative Model

Mutation Model

Inference

Experiments

Future Work

Page 9: Name Phylogeny - A Generative Model of String Variation

Generative Model

We propose a generative model for string variation explaining thereasons for name variation.

...x10001 = Mitt Romneyx10002 = President Barack Obamax10003 = Barack Obamax10004 = Secretary of State Hillary Clintonx10005 = Hillary Clintonx10006 = Barack Obamax10007 = Clintonx10008 = Obama...

What are the sources of variation for names?

Page 10: Name Phylogeny - A Generative Model of String Variation

Copying a previous mention

We can copy a name seen before....x10001 = Mitt Romneyx10002 = President Barack Obamax10003 = Barack Obamax10004 = Secretary of State Hillary Clintonx10005 = Hillary Clintonx10006 = Barack Obamax10007 = Clintonx10008 = Obama...x100001 = Barack Obama

Procedure:

I Select a previous name mention uniformly at random

I Decide to copy it with probability 1− µ

Page 11: Name Phylogeny - A Generative Model of String Variation

Mutating a previous mention

We can mutate a name seen before....x10001 = Mitt Romneyx10002 = President Barack Obamax10003 = Barack Obamax10004 = Secretary of State Hillary Clintonx10005 = Hillary Clintonx10006 = Barack Obamax10007 = Clintonx10008 = Obama...x100001 = Mitt

Procedure:

I Select a previous name mention uniformly at random

I Decide to mutate it with probability µ

I Sample a mutation from p(· | Mitt Romney)

Page 12: Name Phylogeny - A Generative Model of String Variation

Generating a new name

We can generate a new name.

...x10001 = Mitt Romneyx10002 = President Barack Obamax10003 = Barack Obamax10004 = Secretary of State Hillary Clintonx10005 = Hillary Clintonx10006 = Barack Obamax10007 = Clintonx10008 = Obama...x100001 = Joe Biden

Procedure:

I Select ♦ with probability proportional to α (a “pseudocount”)I Sample a new name from p(· | ♦)

I A character language model

Page 13: Name Phylogeny - A Generative Model of String Variation

Generative model summary

To generate the next name mention:

1. Pick an existing name mention w with probability 1/(α + k)

1.1 Copy w verbatim with probability 1− µ1.2 Mutate w with probability µ

2. Decide to talk about a new entity with probability α/(α + k)

2.1 Generate a name for it

Page 14: Name Phylogeny - A Generative Model of String Variation

Generative model in action

x10001 = Mitt Romneyx10002 = President Barack Obamax10003 = Barack Obamax10004 = Secretary of State Hillary Clintonx10005 = Hillary Clintonx10006 = Barack Obamax10007 = Clinton

Mitt Romney President Barack Obama Secretary of State Hillary Clinton

Barack Obama Hillary Clinton

Barack Obama Clinton

Obama

x10008 = Obama

...

Page 15: Name Phylogeny - A Generative Model of String Variation

Generative model in action

Mitt Romney President Barack Obama Secretary of State Hillary Clinton

Barack Obama Hillary Clinton

Barack Obama Clinton

Obama

x10008 = Obamax10009 = Mitt

Mitt

x10001 = Mitt Romneyx10002 = President Barack Obamax10003 = Barack Obamax10004 = Secretary of State Hillary Clintonx10005 = Hillary Clintonx10006 = Barack Obamax10007 = Clinton

...

Page 16: Name Phylogeny - A Generative Model of String Variation

Generative model in action

Mitt Romney President Barack Obama Secretary of State Hillary Clinton

Barack Obama Hillary Clinton

Barack Obama Clinton

Obama

x10008 = Obamax10009 = Mittx10010 = Barack

Mitt

Barack

x10001 = Mitt Romneyx10002 = President Barack Obamax10003 = Barack Obamax10004 = Secretary of State Hillary Clintonx10005 = Hillary Clintonx10006 = Barack Obamax10007 = Clinton

...

Page 17: Name Phylogeny - A Generative Model of String Variation

Generative model in action

Mitt Romney President Barack Obama Secretary of State Hillary Clinton

Barack Obama Hillary Clinton

Barack Obama Clinton

Obama

x10008 = Obamax10009 = Mittx10010 = Barackx10011 = Barry

Mitt

Barack

Barry

x10001 = Mitt Romneyx10002 = President Barack Obamax10003 = Barack Obamax10004 = Secretary of State Hillary Clintonx10005 = Hillary Clintonx10006 = Barack Obamax10007 = Clinton

...

Page 18: Name Phylogeny - A Generative Model of String Variation

Generative model in action

Mitt Romney President Barack Obama Secretary of State Hillary Clinton

Barack Obama Hillary Clinton

Barack Obama Clinton

Obama

x10008 = Obamax10009 = Mittx10010 = Barackx10011 = Barryx10012 = Hillary Clinton

Mitt

Barack

BarryHillary Clinton

...

x10001 = Mitt Romneyx10002 = President Barack Obamax10003 = Barack Obamax10004 = Secretary of State Hillary Clintonx10005 = Hillary Clintonx10006 = Barack Obamax10007 = Clinton

Page 19: Name Phylogeny - A Generative Model of String Variation

A few observations

I The proposed generative model is clearly naiveI No model of discourse or of name structure

I The pseudocount α controls the likelihood of new names

I We assume a low mutation probability µ, so that most namesare copied from earlier frequent names

Page 20: Name Phylogeny - A Generative Model of String Variation

Outline

Introduction

Generative Model

Mutation Model

Inference

Experiments

Future Work

Page 21: Name Phylogeny - A Generative Model of String Variation

Name variation as mutations

“Mutations” capture different types of name variation:

1. Transcription errors: Barack → barack

2. Misspellings: Barack → Barrack

3. Abbreviations: Barack Obama → Barack O.

4. Nicknames: Barack → Barry

5. Dropping words: Barack Obama → Barack

Page 22: Name Phylogeny - A Generative Model of String Variation

Mutation via probabilistic finite-state transducers

The mutation model is a probabilistic finite-state transducerwith four character operations: copy, substitute, delete,insert

I Character operations are conditioned on the right inputcharacter

I Latent regions of contiguous edits

I Back-off smoothing

Transducer parameters θ determine the probability of being indifferent regions, and of the different character operations

Page 23: Name Phylogeny - A Generative Model of String Variation

Example: Mutating a name

Mr. Robert Kennedy

Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y $M r . _[

Beginning of edit region

Example mutation

Page 24: Name Phylogeny - A Generative Model of String Variation

Example: Mutating a name

Mr. Robert Kennedy

Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y $M r . _[B

1 substitution operation: (R, B)

Example mutation

Page 25: Name Phylogeny - A Generative Model of String Variation

Example: Mutating a name

Mr. Robert Kennedy

Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y $M r . _[B o b

2 copy operations: (ε, o), (ε, b)

Example mutation

Page 26: Name Phylogeny - A Generative Model of String Variation

Example: Mutating a name

Mr. Robert Kennedy

Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y $M r . _[B o b

3 deletion operations: (e,ε), (r,ε), (t, ε)

Example mutation

Page 27: Name Phylogeny - A Generative Model of String Variation

Example: Mutating a name

Mr. Robert Kennedy

Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y$M r . _[B o b b y

2 insertion operations: (ε,b), (ε,y)

Example mutation

Page 28: Name Phylogeny - A Generative Model of String Variation

Example: Mutating a name

Mr. Robert Kennedy

Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y $M r . _[B o b b y]

End of edit region

Example mutation

Page 29: Name Phylogeny - A Generative Model of String Variation

Example: Mutating a name

Mr. Robert Kennedy

Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y $M r . _[B o b b y]_ K e n n e d y $

Example mutation

Page 30: Name Phylogeny - A Generative Model of String Variation

Outline

Introduction

Generative Model

Mutation Model

Inference

Experiments

Future Work

Page 31: Name Phylogeny - A Generative Model of String Variation

Inference

Input: An unaligned corpus of names (“bag-of-words”)

I The order in which the tokens were generated is unknown

I No “inputs” or “outputs” are known for the mutation model

Barack Obama

Obama

President Barack Obama

Barack

Barrackbarack obama

Hillary Clinton

Clinton

Bill Clinton

billBill

Barry

Vice President Clinton

Billy

Hillary

will clinton

Hillary Rodham Clinton

Mitt RomneyBarack Obama Sr

Romney

Willard M. Romney

Governor Mitt Romney

Mr. Romney

mittMitt rommey

clinton

William Clinton

barak

President Bill Clinton

President

Barack H. Obama

Ms. Clinton

Output: A distribution over name phylogenies parametrized bytransducer parameters θ

Page 32: Name Phylogeny - A Generative Model of String Variation

Observed vs unobserved names

Could there be latent forms in the phylogeny?

?

Khwaja Gharib Nawaz

Khwaja Muin al-Din Chishti

Ghareeb Nawaz Khwaja gharibnawazMuinuddin Chishti

?

Page 33: Name Phylogeny - A Generative Model of String Variation

Observed vs unobserved names

Khawaja Gharibnawaz Muinuddin Hasan Chisty

Khwaja Gharib Nawaz

Khwaja Muin al-Din Chishti

Ghareeb Nawaz

Khwaja Moinuddin Chishti

Khwaja gharibnawazMuinuddin Chishti

What we'd like to do:

Khwaja Gharib Nawaz

Khwaja Muin al-Din Chishti

Ghareeb Nawaz Khwaja gharibnawazMuinuddin Chishti

What we actually do:

Page 34: Name Phylogeny - A Generative Model of String Variation

Type phylogeny vs token phylogeny

The generative model is over tokens (name mentions)

Ehud Barak President Barack Obama Secretary of State Hillary Clinton

Barack Obama Hillary Clinton

Barack Obama Clinton

Obama

Barak

Barack

Barry

Hillary Clinton

Barry

But we do type-level inference for the following reasons:

1. Allows faster inference

2. Allows type-level supervision

Page 35: Name Phylogeny - A Generative Model of String Variation

Type phylogeny vs token phylogeny

We collapse all copy edges into a single vertex

President Barack Obama Secretary of State Hillary Clinton

BARACK OBAMA (2) HILLARY CLINTON (2)

Clinton

Obama

Barack

BARRY (2)

Ehud Barak

Barak

Barry

I The first token in each collapsed vertex is a mutation, andthe rest are copies

I Every edge in the phylogeny now corresponds to a mutation

I Approximation: disallow multiple tokens of the same type tobe derived from mutations

Page 36: Name Phylogeny - A Generative Model of String Variation

Scoring phylogenies

The weight of a single phylogeny is the product of the weight of itsedges ∏

y∈Yδ(y | pa(y))

What should the edge weights be?

Page 37: Name Phylogeny - A Generative Model of String Variation

Edge weights

I New names: edges from ♦ to a name x :

δ(x | ♦) = α · p(x | ♦)

I Mutations: edges from a name x to a name y :

δ(y | x) = µ · p(y | x) · nxny + 1

Approximation: Edges weights are not quite edge factored. We aremaking an approximation of the form

E∏y

δ(y | pa(y)) ≈∏y

Eδ(y | pa)

Page 38: Name Phylogeny - A Generative Model of String Variation

Inference via EM

Iterate until convergence:

1. E-step: Given θ, compute a distribution over namephylogenies

2. M-step: Re-estimate transducer parameters θ given marginaledge probabilities.

I This step sums over alignments for each (x , y) string pairusing forward-backward

I Each (x , y) pair may be viewed as a training example weightedby the marginal probability of the edge from x to y

Page 39: Name Phylogeny - A Generative Model of String Variation

E-step: marginalizing over latent variables

The latent variables in the model are:

1. Name phylogeny (spanning tree) relating names as inputsand/or outputs

2. Character alignments from potential input names x to outputnames y

We use the Matrix-Tree theorem for directed graphs (Tutte, 1984)to efficiently evaluate marginal probabilities:

1. Partition function (sum over phylogenies)

2. Edge marginals

Page 40: Name Phylogeny - A Generative Model of String Variation

Speed of inference

Two main slowdowns:

I The complexity of the E-step is dominated by the O(n3) (forn names) matrix inversion required to compute the edgemarginals cxy .

I The M-step sums over alignments for O(n2) input-outputpairs

Approximation: To speed up inference, we prune edges (setδ(y | x) = 0) for names with no trigrams in common

Page 41: Name Phylogeny - A Generative Model of String Variation

Outline

Introduction

Generative Model

Mutation Model

Inference

Experiments

Future Work

Page 42: Name Phylogeny - A Generative Model of String Variation

Data preparation

We used English Wikipedia (2011) to create lists of name variants

1. Wikipedia redirects are human-curated pages to resolvecommon name variants to the correct page (unambiguously)

2. We use Freebase to restrict to redirects for Person entities

3. We applied some further filters to remove redirects that wereclearly not names (e.g. numbers)

4. We use LDC Gigaword to obtain a frequency for each namevariant

Page 43: Name Phylogeny - A Generative Model of String Variation

Sample Wikipedia redirects

Ho Chi Minh, Ho chi mihn, Ho-Chi Minh, Ho Chih-minh

Guy Fawkes, Guy fawkes, Guy faux, Guy Falks, Guy Faukes, Guy Fawks, Guy foxe, Guy Falkes

Nicholas II of Russia, Nikolai Aleksandrovich Romanov, Nicholas Alexandrovich of Russia, Nicolas II

Bill Gates, Lord Billy, Bill Gates, BillGates, Billy Gates, William Gates III, William H. Gates

William Shakespeare, William shekspere, William shakspeare, Bill Shakespear

Bill Clinton, Billll Clinton, William Jefferson Blythe IV, Bill J. Clinton, William J Clinton

Page 44: Name Phylogeny - A Generative Model of String Variation

Wikipedia as supervision

We use Wikipedia name lists for supervision and evaluation

I Treat page redirects as “gold” mutations of the page title:Ho Chi Minh → Ho chi mihnHo Chi Minh → Ho-Chi MinhHo Chi Minh → Ho Chih-minh

I Each list of redirects is cluster of names belonging to thesame entity

I No ambiguous names (by construction)

Page 45: Name Phylogeny - A Generative Model of String Variation

Experiment 1: Transducer log-likelihood

Data:

I 1500 entities (roughly 6000 names) for train

I 1500 different entities (roughly 6000 names) for test

Procedure:I At train time

1. Initialize transducer parameters θ using different amounts ofsupervision (up to 250 entities)

2. Run EM for 10 iterations to re-estimate θ3. α = 1.0, µ = 0.1

I At test time

1. Evaluate log-likelihood of the transducer on all “gold” pairsfrom the test set

Page 46: Name Phylogeny - A Generative Model of String Variation

Experiment 1: Mutation model log-likelihood

,0 1 2 3 4 5 6 7 8 9

EM iteration240000

230000

220000

210000

200000

190000

180000

170000

160000

150000He

ld o

ut lo

g-lik

elih

ood

sup=0sup=5sup=25sup=100sup=250

Page 47: Name Phylogeny - A Generative Model of String Variation

Experiment 2: Ranking

Data: same as beforeProcedure:

I At train time

1. Estimate transducer parameters θ2. α = 1.0, µ = 0.1

I At test time

1. For each Wikipedia person page in the test set, produce aranking of all test aliases

2. Compute mean reciprocal rank (MRR) over all such rankings

Page 48: Name Phylogeny - A Generative Model of String Variation

Experiment 2: Ranking

15000.60

0.65

0.70

0.75

0.80

0.85

MRR

jwinklevsup10semi10unsupsup

I For each article name in the test corpus, produce a ranking ofredirects

I The rankings are evaluated using mean reciprocal rank

Page 49: Name Phylogeny - A Generative Model of String Variation

Outline

Introduction

Generative Model

Mutation Model

Inference

Experiments

Future Work

Page 50: Name Phylogeny - A Generative Model of String Variation

Future Work

I More sophisticated mutation modelsI Incorporate internal name structure

I Incorporate context in the generative storyI Cross-lingual experiments

I Each vertex labeled with a language, allowing systematicrelationships between languages

I Other potential applicationsI Derivational morphologyI ParaphraseI TransliterationI Historical linguisticsI Bibliographic entry variation

Page 51: Name Phylogeny - A Generative Model of String Variation

Experiment 3 (preliminary): Precision/Recall

Procedure:I At train time

1. Estimate transducer parameters θ using EM2. Find the best spanning tree given θ

I At test time

1. Attach held-out names to the most likely vertex in the inferredspanning tree

2. Evaluate precision and recall for the connected component

Page 52: Name Phylogeny - A Generative Model of String Variation

Experiment 3 (preliminary): Example attachment

Khawaja Gharibnawaz Muinuddin Hasan Chisty

Khwaja Gharib Nawaz

Khwaja Muin al-Din Chishti

Ghareeb Nawaz

Khwaja Moinuddin Chishti

Khwaja gharibnawazMuinuddin Chishti

Thomas Ruggles Pynchon, Jr.

Thomas Ruggles Pynchon Jr.

Thomas R. Pynchon, Jr.

Thomas R. Pynchon Jr.

Thomas R. Pynchon

Thomas Pynchon, Jr.

Thomas Pynchon Jr.

Thomas Ruggles

? ?

I Held-out names can attach to any vertex in the treeI Including ♦

I Attachment weights given by edge weights δ(y |x)

Page 53: Name Phylogeny - A Generative Model of String Variation

Experiment 3 (preliminary): Results

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Prec

isio

n

0% supervised1% supervised8% supervised24% supervised100% supervised