Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21,...
-
Upload
jarvis-dodsworth -
Category
Documents
-
view
219 -
download
0
Transcript of Discriminative Methods with Structure Simon Lacoste-Julien UC Berkeley joint work with: March 21,...
Discriminative Methods with Structure
Simon Lacoste-Julien
UC Berkeley
joint work with:
March 21, 2008
Fei Sha
Ben Taskar
Dan Klein
Mike Jordan
« Discriminative method »
Decision theoretic framework: Loss:
Decision function:
Risk
Contrast funtion
« with structure » on outputs:
Handwritingrecognition
Input Output
brace
huge!
Machinetranslation
‘Ce n'est pas un autreproblème de classification.’
‘This is not another classification problem.’
« with structure » on inputs:
text documents
….. ……. … ………
…. .... .... .... ... .. ...... .
.
..... ...........
latent variable model
new representati
on
classification
Structure on outputs:Discriminative Word
Alignmentproject
(joint work with Ben Taskar, Dan Klein and Mike Jordan)
Word Alignment
What is the anticipated cost of collecting fees under the new proposal?
En vertu des nouvelles propositions, quel est le coût prévu de perception des droits?
x yWhat
is the
anticipated
costof
collecting fees
under the
new proposal
?
En vertu delesnouvelles propositions, quel est le coût prévu de perception de les droits?
Key step in most machine translation systems
Overview
Review of large-margin word alignment [Taskar et al. EMNLP 05]
Two new extensions to the basic model: Fertility features First order interactions using quadratic
assignment
Results on Hansards dataset
Feature-Based Alignment
Features: Association
MI = 3.2Dice = 4.1
Lexical pairID(proposal, proposition) =
1 Position in sentence
AbsDist = 5RelDist = 0.3
OrthographyExactMatch = 0Similarity = 0.8
ResourcesPairInDictionary
Other Models (IBM2, IBM4)
Whatis
theanticipate
dcost
ofcollecting
fees under
the new
proposal?
En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?
j
k
Scoring Whole Alignments
Whatis
theanticipate
dcost
ofcollecting
fees under
the new
proposal?
En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?
j
k
Prediction as a Linear Program
Still guaranteed to have integral solutions y
Degreeconstraint
Whatis
theanticipate
dcost
ofcollecting
fees under
the new
proposal?
En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?
j
krelaxation
Learning w
Supervised training data
Training methods Maximum likelihood/entropy Perceptron Maximum margin
Maximum Likelihood/Entropy
Probabilistic approach:
Problem: denominator is #P-complete[Valiant 79, Jerrum & Sinclair 93]
Can’t find maximum likelihood parameters
(Averaged) PerceptronPerceptron for structured output [Collins 2002]:
For each example ,
Predict:
Update:
Output averaged parameters:
Large Margin Estimation
Equivalent min-max formulation[Taskar et al 04,05]
Simple LP
true score other score loss
Min-max formulation - QP
LP duality
QP of polynomial
size!
=> Mosek
Experimental Setup French Canadian Hansards Corpus Word-level aligned
200 sentence pairs (training data) 37 sentence pairs (validation data) 247 sentence pairs (test data)
Sentence-level aligned 1M sentence pairs Generate association-based features Learn unsupervised IBM Models
Learn using Large Margin
Evaluate alignment quality using standard AER (Alignment Error Rate) [similar to F1]
Old Results200 train/247 test split
IBM model 4 (intersected) 6.5 98 / 88%
Basic 8.2 93 / 90%
+ model 4 5.1 98 / 92%
AER Prec / Rec
Improving basic model
We would like to model:
Fertility: Alignments are not necessarily 1-to-1
First-order interactions: Alignments are mostly locally diagonal: would
like to score depending on its neighbors
Strategy: extensions keeping prediction model as a LP
Modeling Fertility
Example of node feature: for word w, fraction of time it had fertility > k on the training set
fertility penalty
Fertility Results200 train/247 test split
IBM model 4 (intersected) 6.5 98 / 88%
Basic 8.2 93 / 90%
+ model 4 5.1 98 / 92%
AER Prec / Rec
Fertility Results200 train/247 test split
IBM model 4 (intersected) 6.5 98 / 88%
Basic 8.2 93 / 90%
+ model 4 5.1 98 / 92%
+ model 4 + fertility 4.9 96 / 94%
AER Prec / Rec
Fertility example
Sure align.
Possible align.
Predicted align.
=
=
=
Modeling First Order Effects
Restrict:
monoticity
local inversion
local fertility
want:
relaxation:
Integer program
Quadratic assignment NP-complete; on real-world sentences (2 to 30 words)
takes a few seconds using Mosek (~1k variables)
Interestingly, in our dataset 80% of examples yield integer solution when
solved via linear relaxation same AER when using relaxation!
New Results200 train/247 test split
IBM model 4 (intersected) 6.5 98 / 88%
Basic 8.2 93 / 90%
+ model 4 5.1 98 / 92%
AER Prec / Rec
New Results200 train/247 test split
IBM model 4 (intersected) 6.5 98 / 88%
Basic 8.2 93 / 90%
+ model 4 5.1 98 / 92%
Basic + fertility + qap 6.1 94 / 93%
AER Prec / Rec
New Results200 train/247 test split
IBM model 4 (intersected) 6.5 98 / 88%
Basic 8.2 93 / 90%
+ model 4 5.1 98 / 92%
Basic + fertility + qap 6.1 94 / 93%
+ fertility + qap + model 4 4.3 96 / 95%
AER Prec / Rec
New Results200 train/247 test split
IBM model 4 (intersected) 6.5 98 / 88%
Basic 8.2 93 / 90%
+ model 4 5.1 98 / 92%
Basic + fertility + qap 6.1 94 / 93%
+ fertility + qap + model 4 4.3 96 / 95%
+ fertility + qap + model 4 + liang
3.8 97 / 96 %
AER Prec / Rec
Fert + qap example
Fert + qap example
Conclusions
Feature-based word alignment Efficient algorithms for supervised learning Exploit unsupervised data via features, other
models Surprisingly accurate with simple features Include fertility model and first order
interactions 38% AER reduction over intersected Model 4 Lowest published AER on this data set High recall alignments -> promising for MT
Structure on inputs:discLDA project
(work in progress)
(joint work with Fei Sha and Mike Jordan)
Unsupervised dimensionality reduction
text documents
….. ……. … ………
…. .... .... .... ... .. ...... .
.
..... ...........
latent variables
model
new representati
on
classification
Analogy: PCA vs. FDA
xxxx
xxx xx
x
ooooo
oooo
ooo
oooo
xxx
PCA direction
FDA direction
Goal: supervised dim. reduction
text documents
….. ……. … ………
…. .... .... .... ... .. ...... .
.
..... ...........
latent variables model with supervised information
new representati
on
classification
Review: LDA model
Discriminative version of LDA
Ultimately, want to learn discriminatively-> but high-dimensional non-convex objective, hard to
optimize!
Instead, propose to learn class-dependent linear transformation of common ‘s:
New generative model:
Equivalently, transformation on :
Simplex Geometry
xxx xx
x
oooo
word simplex
w3 w2
w1
topic simplex
xxx xx
x
oooo
w2
w1
w3
Interpretation 1
Shared topic vs. class-specific topic:
shared topics
class-specific topics
Interpretation 2
Generative model from T, add a new latent variable u:
Compare with AT model
Author-Topic model [Rosen-Zvi et al.
2004]
discLDA
Inference and learning
Learning
For fixed T, learn by sampling (z,u) [Rao-Blackwellized Gibbs sampling]
For fixed , update T using stochastic gradient ascent on conditional log-likelihood:
in an online fashion get approximate gradient using Monte Carlo EM use Harmonic Mean estimator to estimate
Currently, results are noisy…
Inference (dimensionality reduction)
Given learned T and : estimate using Harmonic Mean estimator
compute by marginalizing over y to get new
representation of document
Preliminary Experiments
20 Newsgroup dataset
Used fixed T:
Get reduced representation -> train linear SVM on it
hence 110 topics for
11k train7.5k test
vocabulary: 50k
Classification results
discLDA + SVM: 20% error LDA + SVM: 25% error discLDA predictions: 20% error
Newsgroup embedding (LDA)
Newsgroup embedding (discLDA)
using tSNE (on discLDA)
thanks to Laurens van der Maaten for figure! [Hinton’s group]
using tSNE (on LDA)
thanks to Laurens van der Maaten for figure! [Hinton’s group]
Learned topics
Another embedding
NIPS papers vs. Psychology abstracts
LDA discLDA
13 scenes dataset [Fei-Fei 2005]
train: 100 per category
test: 2558
Vocabulary (visual words)
Topics
Conclusion
fixed transformation T enables topic sharing & exploration
get reduced representation which preserves predictive power
noisy gradient estimates still work in progress will probably try variational approach instead