Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21,...

Discriminative Methods with Structure

Simon Lacoste-Julien

UC Berkeley

joint work with:

March 21, 2008

Fei Sha

Ben Taskar

Dan Klein

Mike Jordan

« Discriminative method »

Decision theoretic framework: Loss:

Decision function:

Risk

Contrast funtion

« with structure » on outputs:

Handwritingrecognition

Input Output

brace

huge!

Machinetranslation

‘Ce n'est pas un autreproblème de classification.’

‘This is not another classification problem.’

« with structure » on inputs:

text documents

….. ……. … ………

…. .... .... .... ... .. ...... .

.

..... ...........

latent variable model

new representati

on

classification

Structure on outputs:Discriminative Word

Alignmentproject

(joint work with Ben Taskar, Dan Klein and Mike Jordan)

Word Alignment

What is the anticipated cost of collecting fees under the new proposal?

En vertu des nouvelles propositions, quel est le coût prévu de perception des droits?

x yWhat

is the

anticipated

costof

collecting fees

under the

new proposal

?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de les droits?

Key step in most machine translation systems

Overview

Review of large-margin word alignment [Taskar et al. EMNLP 05]

Two new extensions to the basic model: Fertility features First order interactions using quadratic

assignment

Results on Hansards dataset

Feature-Based Alignment

Features: Association

MI = 3.2Dice = 4.1

Lexical pairID(proposal, proposition) =

1 Position in sentence

AbsDist = 5RelDist = 0.3

OrthographyExactMatch = 0Similarity = 0.8

ResourcesPairInDictionary

Other Models (IBM2, IBM4)

Whatis

theanticipate

dcost

ofcollecting

fees under

the new

proposal?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?

j

k

Scoring Whole Alignments

Whatis

theanticipate

dcost

ofcollecting

fees under

the new

proposal?


j

k

Prediction as a Linear Program

Still guaranteed to have integral solutions y

Degreeconstraint

Whatis

theanticipate

dcost

ofcollecting

fees under

the new

proposal?


j

krelaxation

Learning w

Supervised training data

Training methods Maximum likelihood/entropy Perceptron Maximum margin

Maximum Likelihood/Entropy

Probabilistic approach:

Problem: denominator is #P-complete[Valiant 79, Jerrum & Sinclair 93]

Can’t find maximum likelihood parameters

(Averaged) PerceptronPerceptron for structured output [Collins 2002]:

For each example ,

Predict:

Update:

Output averaged parameters:

Large Margin Estimation

Equivalent min-max formulation[Taskar et al 04,05]

Simple LP

true score other score loss

Min-max formulation - QP

LP duality

QP of polynomial

size!

=> Mosek

Experimental Setup French Canadian Hansards Corpus Word-level aligned

200 sentence pairs (training data) 37 sentence pairs (validation data) 247 sentence pairs (test data)

Sentence-level aligned 1M sentence pairs Generate association-based features Learn unsupervised IBM Models

Learn using Large Margin

Evaluate alignment quality using standard AER (Alignment Error Rate) [similar to F1]

Old Results200 train/247 test split

IBM model 4 (intersected) 6.5 98 / 88%

Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%

AER Prec / Rec

Improving basic model

We would like to model:

Fertility: Alignments are not necessarily 1-to-1

First-order interactions: Alignments are mostly locally diagonal: would

like to score depending on its neighbors

Strategy: extensions keeping prediction model as a LP

Modeling Fertility

Example of node feature: for word w, fraction of time it had fertility > k on the training set

fertility penalty

Fertility Results200 train/247 test split


Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%

AER Prec / Rec

Fertility Results200 train/247 test split


Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%

+ model 4 + fertility 4.9 96 / 94%

AER Prec / Rec

Fertility example

Sure align.

Possible align.

Predicted align.

=

=

=

Modeling First Order Effects

Restrict:

monoticity

local inversion

local fertility

want:

relaxation:

Integer program

Quadratic assignment NP-complete; on real-world sentences (2 to 30 words)

takes a few seconds using Mosek (~1k variables)

Interestingly, in our dataset 80% of examples yield integer solution when

solved via linear relaxation same AER when using relaxation!

New Results200 train/247 test split


Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%

AER Prec / Rec



Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%

Basic + fertility + qap 6.1 94 / 93%

AER Prec / Rec



Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%


+ fertility + qap + model 4 4.3 96 / 95%

AER Prec / Rec



Basic 8.2 93 / 90%

+ model 4 5.1 98 / 92%


+ fertility + qap + model 4 4.3 96 / 95%

+ fertility + qap + model 4 + liang

3.8 97 / 96 %

AER Prec / Rec

Fert + qap example

Conclusions

Feature-based word alignment Efficient algorithms for supervised learning Exploit unsupervised data via features, other

models Surprisingly accurate with simple features Include fertility model and first order

interactions 38% AER reduction over intersected Model 4 Lowest published AER on this data set High recall alignments -> promising for MT

Structure on inputs:discLDA project

(work in progress)

(joint work with Fei Sha and Mike Jordan)

Unsupervised dimensionality reduction

text documents

….. ……. … ………

…. .... .... .... ... .. ...... .

.

..... ...........

latent variables

model

new representati

on

classification

Analogy: PCA vs. FDA

xxxx

xxx xx

x

ooooo

oooo

ooo

oooo

xxx

PCA direction

FDA direction

Goal: supervised dim. reduction

text documents

….. ……. … ………

…. .... .... .... ... .. ...... .

.

..... ...........

latent variables model with supervised information

new representati

on

classification

Review: LDA model

Discriminative version of LDA

Ultimately, want to learn discriminatively-> but high-dimensional non-convex objective, hard to

optimize!

Instead, propose to learn class-dependent linear transformation of common ‘s:

New generative model:

Equivalently, transformation on :

Simplex Geometry

xxx xx

x

oooo

word simplex

w3 w2

w1

topic simplex

xxx xx

x

oooo

w2

w1

w3

Interpretation 1

Shared topic vs. class-specific topic:

shared topics

class-specific topics

Interpretation 2

Generative model from T, add a new latent variable u:

Compare with AT model

Author-Topic model [Rosen-Zvi et al.

2004]

discLDA

Inference and learning

Learning

For fixed T, learn by sampling (z,u) [Rao-Blackwellized Gibbs sampling]

For fixed , update T using stochastic gradient ascent on conditional log-likelihood:

in an online fashion get approximate gradient using Monte Carlo EM use Harmonic Mean estimator to estimate

Currently, results are noisy…

Inference (dimensionality reduction)

Given learned T and : estimate using Harmonic Mean estimator

compute by marginalizing over y to get new

representation of document

Preliminary Experiments

20 Newsgroup dataset

Used fixed T:

Get reduced representation -> train linear SVM on it

hence 110 topics for

11k train7.5k test

vocabulary: 50k

Classification results

discLDA + SVM: 20% error LDA + SVM: 25% error discLDA predictions: 20% error

Newsgroup embedding (LDA)

Newsgroup embedding (discLDA)

using tSNE (on discLDA)

thanks to Laurens van der Maaten for figure! [Hinton’s group]

using tSNE (on LDA)

thanks to Laurens van der Maaten for figure! [Hinton’s group]

Learned topics

Another embedding

NIPS papers vs. Psychology abstracts

LDA discLDA

13 scenes dataset [Fei-Fei 2005]

train: 100 per category

test: 2558

Vocabulary (visual words)

Topics

Conclusion

fixed transformation T enables topic sharing & exploration

get reduced representation which preserves predictive power

noisy gradient estimates still work in progress will probably try variational approach instead

Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21,...

Documents

Transcript of Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21,...