Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21,...

58
Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan

Transcript of Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21,...

Page 1: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Discriminative Methods with Structure

Simon Lacoste-Julien

UC Berkeley

joint work with:

March 21, 2008

Fei Sha

Ben Taskar

Dan Klein

Mike Jordan

Page 2: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

« Discriminative method »

Decision theoretic framework: Loss:

Decision function:

Risk

Contrast funtion

Page 3: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

« with structure » on outputs:

Handwritingrecognition

Input Output

brace

huge!

Machinetranslation

‘Ce n'est pas un autreproblème de classification.’

‘This is not another classification problem.’

Page 4: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

« with structure » on inputs:

text documents

….. ……. … ………

…. .... .... .... ... .. ...... .

.

..... ...........

latent variable model

new representati

on

classification

Page 5: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Structure on outputs:Discriminative Word

Alignmentproject

(joint work with Ben Taskar, Dan Klein and Mike Jordan)

Page 6: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Word Alignment

What is the anticipated cost of collecting fees under the new proposal?

En vertu des nouvelles propositions, quel est le coût prévu de perception des droits?

x yWhat

is the

anticipated

costof

collecting fees

under the

new proposal

?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de les droits?

Key step in most machine translation systems

Page 7: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Overview

Review of large-margin word alignment [Taskar et al. EMNLP 05]

Two new extensions to the basic model: Fertility features First order interactions using quadratic

assignment

Results on Hansards dataset

Page 8: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Feature-Based Alignment

Features: Association

MI = 3.2Dice = 4.1

Lexical pairID(proposal, proposition) =

1 Position in sentence

AbsDist = 5RelDist = 0.3

OrthographyExactMatch = 0Similarity = 0.8

ResourcesPairInDictionary

Other Models (IBM2, IBM4)

Whatis

theanticipate

dcost

ofcollecting

fees under

the new

proposal?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?

j

k

Page 9: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Scoring Whole Alignments

Whatis

theanticipate

dcost

ofcollecting

fees under

the new

proposal?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?

j

k

Page 10: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Prediction as a Linear Program

Still guaranteed to have integral solutions y

Degreeconstraint

Whatis

theanticipate

dcost

ofcollecting

fees under

the new

proposal?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?

j

krelaxation

Page 11: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Learning w

Supervised training data

Training methods Maximum likelihood/entropy Perceptron Maximum margin

Page 12: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Maximum Likelihood/Entropy

Probabilistic approach:

Problem: denominator is #P-complete[Valiant 79, Jerrum & Sinclair 93]

Can’t find maximum likelihood parameters

Page 13: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

(Averaged) PerceptronPerceptron for structured output [Collins 2002]:

For each example ,

Predict:

Update:

Output averaged parameters:

Page 14: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Large Margin Estimation

Equivalent min-max formulation[Taskar et al 04,05]

Simple LP

true score other score loss

Page 15: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Min-max formulation - QP

LP duality

QP of polynomial

size!

=> Mosek

Page 16: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Experimental Setup French Canadian Hansards Corpus Word-level aligned

200 sentence pairs (training data) 37 sentence pairs (validation data) 247 sentence pairs (test data)

Sentence-level aligned 1M sentence pairs Generate association-based features Learn unsupervised IBM Models

Learn using Large Margin

Evaluate alignment quality using standard AER (Alignment Error Rate) [similar to F1]

Page 17: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Old Results200 train/247 test split

IBM model 4 (intersected) 6.5 98 / 88%

Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%

AER Prec / Rec

Page 18: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Improving basic model

We would like to model:

Fertility: Alignments are not necessarily 1-to-1

First-order interactions: Alignments are mostly locally diagonal: would

like to score depending on its neighbors

Strategy: extensions keeping prediction model as a LP

Page 19: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Modeling Fertility

Example of node feature: for word w, fraction of time it had fertility > k on the training set

fertility penalty

Page 20: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Fertility Results200 train/247 test split

IBM model 4 (intersected) 6.5 98 / 88%

Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%

AER Prec / Rec

Page 21: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Fertility Results200 train/247 test split

IBM model 4 (intersected) 6.5 98 / 88%

Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%

+ model 4 + fertility 4.9 96 / 94%

AER Prec / Rec

Page 22: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Fertility example

Sure align.

Possible align.

Predicted align.

=

=

=

Page 23: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Modeling First Order Effects

Restrict:

monoticity

local inversion

local fertility

want:

relaxation:

Page 24: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Integer program

Quadratic assignment NP-complete; on real-world sentences (2 to 30 words)

takes a few seconds using Mosek (~1k variables)

Interestingly, in our dataset 80% of examples yield integer solution when

solved via linear relaxation same AER when using relaxation!

Page 25: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

New Results200 train/247 test split

IBM model 4 (intersected) 6.5 98 / 88%

Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%

AER Prec / Rec

Page 26: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

New Results200 train/247 test split

IBM model 4 (intersected) 6.5 98 / 88%

Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%

Basic + fertility + qap 6.1 94 / 93%

AER Prec / Rec

Page 27: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

New Results200 train/247 test split

IBM model 4 (intersected) 6.5 98 / 88%

Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%

Basic + fertility + qap 6.1 94 / 93%

+ fertility + qap + model 4 4.3 96 / 95%

AER Prec / Rec

Page 28: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

New Results200 train/247 test split

IBM model 4 (intersected) 6.5 98 / 88%

Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%

Basic + fertility + qap 6.1 94 / 93%

+ fertility + qap + model 4 4.3 96 / 95%

+ fertility + qap + model 4 + liang

3.8 97 / 96 %

AER Prec / Rec

Page 29: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Fert + qap example

Page 30: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Fert + qap example

Page 31: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Conclusions

Feature-based word alignment Efficient algorithms for supervised learning Exploit unsupervised data via features, other

models Surprisingly accurate with simple features Include fertility model and first order

interactions 38% AER reduction over intersected Model 4 Lowest published AER on this data set High recall alignments -> promising for MT

Page 32: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Structure on inputs:discLDA project

(work in progress)

(joint work with Fei Sha and Mike Jordan)

Page 33: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Unsupervised dimensionality reduction

text documents

….. ……. … ………

…. .... .... .... ... .. ...... .

.

..... ...........

latent variables

model

new representati

on

classification

Page 34: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Analogy: PCA vs. FDA

xxxx

xxx xx

x

ooooo

oooo

ooo

oooo

xxx

PCA direction

FDA direction

Page 35: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Goal: supervised dim. reduction

text documents

….. ……. … ………

…. .... .... .... ... .. ...... .

.

..... ...........

latent variables model with supervised information

new representati

on

classification

Page 36: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Review: LDA model

Page 37: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Discriminative version of LDA

Ultimately, want to learn discriminatively-> but high-dimensional non-convex objective, hard to

optimize!

Instead, propose to learn class-dependent linear transformation of common ‘s:

New generative model:

Equivalently, transformation on :

Page 38: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Simplex Geometry

xxx xx

x

oooo

word simplex

w3 w2

w1

topic simplex

xxx xx

x

oooo

w2

w1

w3

Page 39: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Interpretation 1

Shared topic vs. class-specific topic:

shared topics

class-specific topics

Page 40: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Interpretation 2

Generative model from T, add a new latent variable u:

Page 41: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Compare with AT model

Author-Topic model [Rosen-Zvi et al.

2004]

discLDA

Page 42: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Inference and learning

Page 43: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Learning

For fixed T, learn by sampling (z,u) [Rao-Blackwellized Gibbs sampling]

For fixed , update T using stochastic gradient ascent on conditional log-likelihood:

in an online fashion get approximate gradient using Monte Carlo EM use Harmonic Mean estimator to estimate

Currently, results are noisy…

Page 44: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Inference (dimensionality reduction)

Given learned T and : estimate using Harmonic Mean estimator

compute by marginalizing over y to get new

representation of document

Page 45: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Preliminary Experiments

Page 46: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

20 Newsgroup dataset

Used fixed T:

Get reduced representation -> train linear SVM on it

hence 110 topics for

11k train7.5k test

vocabulary: 50k

Page 47: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Classification results

discLDA + SVM: 20% error LDA + SVM: 25% error discLDA predictions: 20% error

Page 48: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Newsgroup embedding (LDA)

Page 49: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Newsgroup embedding (discLDA)

Page 50: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

using tSNE (on discLDA)

thanks to Laurens van der Maaten for figure! [Hinton’s group]

Page 51: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

using tSNE (on LDA)

thanks to Laurens van der Maaten for figure! [Hinton’s group]

Page 52: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Learned topics

Page 53: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Another embedding

NIPS papers vs. Psychology abstracts

LDA discLDA

Page 54: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

13 scenes dataset [Fei-Fei 2005]

train: 100 per category

test: 2558

Page 55: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Vocabulary (visual words)

Page 56: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Topics

Page 57: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.

Conclusion

fixed transformation T enables topic sharing & exploration

get reduced representation which preserves predictive power

noisy gradient estimates still work in progress will probably try variational approach instead

Page 58: Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21, 2008 Fei Sha Ben Taskar Dan Klein Mike Jordan.