Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models...
-
Upload
suzan-patrick -
Category
Documents
-
view
215 -
download
0
Transcript of Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models...
1
Section 6:Approximation-aware Training
Outline• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!
1. Models: Factor graphs can express interactions among linguistic structures.
2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.
3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates multiple factors.
5. Tweaked Algorithm: Finish in fewer steps and make the steps faster.
6. Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions.
7. Software: Build the model you want!
2
Outline• Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?• Then this tutorial is extremely practical for you!
1. Models: Factor graphs can express interactions among linguistic structures.
2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations.
3. Intuitions: What’s going on here? Can we trust BP’s estimates?4. Fancier Models: Hide a whole grammar and dynamic
programming algorithm within a single factor. BP coordinates multiple factors.
5. Tweaked Algorithm: Finish in fewer steps and make the steps faster.
6. Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions.
7. Software: Build the model you want!
3
4
Modern NLP
Linguistics
Mathematical
Modeling
Machine Learning
Combinatorial
Optimization
NLP
5
Machine Learning for NLPLinguistics
No semantic interpretatio
n
…
Linguistics
inspires the
structures we
want to predict
6
Machine Learning for NLPLinguistics Mathematica
l Modeling
pθ( ) = 0.50
pθ( ) = 0.25
pθ( ) = 0.10
pθ( ) = 0.01
Our model
defines a score for
each structure
…
7
Machine Learning for NLPLinguistics Mathematica
l Modeling
pθ( ) = 0.50
pθ( ) = 0.25
pθ( ) = 0.10
pθ( ) = 0.01
It also tells us what to optimize
Our model
defines a score for
each structure
…
8
Machine Learning for NLPLinguistics Mathematica
l Modeling
Machine Learning
Learning tunes the parameters of the model
x1:
y1:
x2:
y2:
x3:
y3:
Given training instances{(x1, y1), (x2, y2),…, (xn, yn)}
Find the best model parameters, θ
9
Machine Learning for NLPLinguistics Mathematica
l Modeling
Machine Learning
Learning tunes the parameters of the model
x1:
y1:
x2:
y2:
x3:
y3:
Given training instances{(x1, y1), (x2, y2),…, (xn, yn)}
Find the best model parameters, θ
10
Machine Learning for NLPLinguistics Mathematica
l Modeling
Machine Learning
Learning tunes the parameters of the model
x1:
y1:
x2:
y2:
x3:
y3:
Given training instances{(x1, y1), (x2, y2),…, (xn, yn)}
Find the best model parameters, θ
11
xnew:y*:
Machine Learning for NLPLinguistics Mathematica
l Modeling
Machine Learning
Combinatorial
Optimization
Inference finds
the best structure for a new sentence
xnew:y*:
• Given a new sentence, xnew
• Search over the set of all possible structures (often exponential in size of xnew)
• Return the highest scoring structure, y*
(Inference is usually called as a subroutine in
learning)
12
xnew:y*:
Machine Learning for NLPLinguistics Mathematica
l Modeling
Machine Learning
Combinatorial
Optimization
Inference finds
the best structure for a new sentence
xnew:y*:
• Given a new sentence, xnew
• Search over the set of all possible structures (often exponential in size of xnew)
• Return the Minimum Bayes Risk (MBR) structure, y*
(Inference is usually called as a subroutine in
learning)
13
Machine Learning for NLPLinguistics Mathematica
l Modeling
Machine Learning
Combinatorial
Optimization
Inference finds
the best structure for a new sentence
• Given a new sentence, xnew
• Search over the set of all possible structures (often exponential in size of xnew)
• Return the Minimum Bayes Risk (MBR) structure, y*
(Inference is usually called as a subroutine in
learning) 13
Easy Polynomial time NP-hard
14
Modern NLPLinguistics inspires the
structures we want to predict
It also tells us what to optimize
Our model defines a score
for each structure
Learning tunes the parameters
of the model
Inference finds the best structure for a new sentence
Linguistics Mathematical Modeling
Machine Learning
Combinatorial
Optimization
NLP
(Inference is usually called as a subroutine
in learning)
15
An Abstraction for ModelingMathematica
l Modeling
✔
✔
✔
✔
Now we can work at this
level of abstraction.
16
TrainingThus far, we’ve seen how to compute (approximate) marginals, given a factor graph…
…but where do the potential tables ψα come from?
– Some have a fixed structure (e.g. Exactly1, CKYTree)
– Others could be trained ahead of time (e.g. TrigramHMM)
– For the rest, we define them parametrically and learn the parameters!
Two ways to learn:
1. Standard CRF Training(very simple; often yields state-of-the-art results)
2. ERMA (less simple; but takes approximations and loss function into account)
17
Standard CRF Parameterization
Define each potential function in terms of a fixed set of feature functions:
Observed
variables
Predictedvariables
18
Standard CRF Parameterization
Define each potential function in terms of a fixed set of feature functions:
time flies like an arrow
n ψ2 v ψ4 p ψ6 d ψ8 n
ψ1 ψ3 ψ5 ψ7 ψ9
19
Standard CRF Parameterization
Define each potential function in terms of a fixed set of feature functions:
n
ψ1
ψ2 v
ψ3
ψ4 p
ψ5
ψ6 d
ψ7
ψ8 n
ψ9
time likeflies an arrow
npψ10
vpψ12
ppψ11
sψ13
20
What is Training?
That’s easy:
Training = picking good model parameters!
But how do we know if the model parameters are any “good”?
21
Machine Learning
Conditional Log-likelihood Training
1. Choose modelSuch that derivative in #3 is ea
2. Choose objective: Assign high probability to the things we observe and low probability to everything else3. Compute derivative by hand using the chain rule
4. Replace exact inference by approximate inference
22
Conditional Log-likelihood Training
1. Choose model Such that derivative in #3 is easy
2. Choose objective: Assign high probability to the things we observe and low probability to everything else3. Compute derivative by hand using the chain rule
4. Replace exact inference by approximate inference
Machine Learning
We can approximate the factor marginals by the (normalized) factor beliefs from BP!
24
What’s wrong with the usual approach?
If you add too many factors, your predictions might get worse!
• The model might be richer, but we replace the true marginals with approximate marginals (e.g. beliefs computed by BP)
• Approximate inference can cause gradients for structured learning to go awry! (Kulesza & Pereira, 2008).
25
What’s wrong with the usual approach?
Mistakes made by Standard CRF Training:1. Using BP (approximate)2. Not taking loss function into account3. Should be doing MBR decoding
Big pile of approximations……which has tunable
parameters.
Treat it like a neural net, and run backprop!
26
Modern NLPLinguistics inspires the
structures we want to predict
It also tells us what to optimize
Our model defines a score
for each structure
Learning tunes the parameters
of the model
Inference finds the best structure for a new sentence
Linguistics Mathematical Modeling
Machine Learning
Combinatorial
Optimization
NLP
(Inference is usually called as a subroutine
in learning)
27
Empirical Risk Minimization
1. Given training data:
2. Choose each of these:– Decision function
– Loss function
Examples: Linear regression, Logistic regression, Neural Network
Examples: Mean-squared error, Cross Entropy
x1:
y1:
x2:
y2:
x3:
y3:
28
Empirical Risk Minimization
1. Given training data:
3. Define goal:
2. Choose each of these:– Decision function
– Loss function
4. Train with SGD:(take small steps opposite the gradient)
29
1. Given training data:
3. Define goal:
2. Choose each of these:– Decision function
– Loss function
4. Train with SGD:(take small steps opposite the gradient)
Empirical Risk Minimization
30
Conditional Log-likelihood Training
1. Choose model Such that derivative in #3 is easy
2. Choose objective: Assign high probability to the things we observe and low probability to everything else3. Compute derivative by hand using the chain rule
4. Replace true inference by approximate inference
Machine Learning
31
What went wrong?How did we compute these approximate marginal probabilities anyway?
By Belief Propagation of course!
Machine Learning
Error Back-Propagation
32Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation
33Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation
34Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation
35Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation
36Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation
37Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation
38Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation
39Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation
40Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation
41
y3
P(y3=noun|x)
μ(y1y2)=μ(y3y1)*μ(y4y1)
ϴ
Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation• Applying the chain rule of
differentiation over and over.• Forward pass:– Regular computation (inference +
decoding) in the model (+ remember intermediate quantities).
• Backward pass:– Replay the forward pass in reverse,
computing gradients.
42
43
Background: Backprop through time
Recurrent neural network:
BPTT: 1. Unroll the computation over time
(Robinson & Fallside, 1987)
(Werbos, 1988)
(Mozer, 1995)
a xt
bt
xt+1
yt+1
a x1
b1
x2
b2
x3
b3
x4
y4
2. Run backprop through the resulting feed-forward network
44
What went wrong?How did we compute these approximate marginal probabilities anyway?
By Belief Propagation of course!
Machine Learning
45
ERMAEmpirical Risk Minimization under Approximations (ERMA)• Apply Backprop through
time to Loopy BP• Unrolls the BP
computation graph• Includes inference,
decoding, loss and all the approximations along the way
(Stoyanov, Ropson, & Eisner, 2011)
46
ERMA1. Choose model to be
the computation with all its approximations
2. Choose objective to likewise include the approximations
3. Compute derivative by backpropagation (treating the entire computation as if it were a neural network)
4. Make no approximations!(Our gradient is exact)
Machine Learning
Key idea: Open up the black box!
(Stoyanov, Ropson, & Eisner, 2011)
47
ERMA
Empirical Risk Minimization
Machine Learning
Key idea: Open up the black box!
Minimum Bayes Risk (MBR) Decoder
(Stoyanov, Ropson, & Eisner, 2011)
48
Approximation-aware LearningWhat if we’re using Structured BP instead of regular BP?• No problem, the
same approach still applies!
• The only difference is that we embed dynamic programming algorithms inside our computation graph.
Machine Learning
Key idea: Open up the black box!
(Gormley, Dredze, & Eisner, 2015)
49
Connection to Deep Learning
y
exp(Θyf(x))
(Gormley, Yu, & Dredze, In submission)
50
Empirical Risk Minimization under Approximations (ERMA)
Approximation AwareNo Yes
Loss Aware
No
Yes SVMstruct
[Finley and Joachims, 2008]
M3N [Taskar et al., 2003]
Softmax-margin [Gimpel & Smith, 2010]
ERMA
MLE
Figure from (Stoyanov & Eisner, 2012)