Edinburgh MT lecture 11: Neural language models
-
Upload
alopezfoo -
Category
Technology
-
view
447 -
download
1
Transcript of Edinburgh MT lecture 11: Neural language models
![Page 1: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/1.jpg)
“Neural” language models
![Page 2: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/2.jpg)
Probabilistic language models
• Language models play the role of ...
• a judge of grammaticality
• a judge of semantic plausibility
• an enforcer of stylistic consistency
![Page 3: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/3.jpg)
![Page 4: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/4.jpg)
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
![Page 5: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/5.jpg)
What is the probability of a sentence?
• Requirements
• Assign a probability to every sentence (i.e., string of words)
• Questions
• How many sentences are there in English?
• Too many :)
![Page 6: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/6.jpg)
What is the probability of a sentence?
• Requirements
• Assign a probability to every sentence (i.e., string of words)
• Questions
• How many sentences are there in English?
• Too many :)
X
e2⌃⇤
pLM(e) = 1
pLM(e) � 0 8e 2 ⌃⇤
![Page 7: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/7.jpg)
p(e1)⇥p(e2 | e1)⇥p(e3 | e1, e2)⇥p(e4 | e1, e2, e3)⇥· · ·⇥p(e` | e1, e2, . . . , e`�2, e`�1)
pLM(e) =p(e1, e2, e3, . . . , e`)
=
n-gram LMs
Vector-valued random variable
![Page 8: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/8.jpg)
p(e1)⇥p(e2 | e1)⇥p(e3 | e1, e2)⇥p(e4 | e1, e2, e3)⇥· · ·⇥p(e` | e1, e2, . . . , e`�2, e`�1)
pLM(e) =p(e1, e2, e3, . . . , e`)
n-gram LMs
⇡
![Page 9: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/9.jpg)
p(e1)⇥p(e2 | e1)⇥p(e3 | e1, e2)⇥p(e4 | e1, e2, e3)⇥· · ·⇥p(e` | e1, e2, . . . , e`�2, e`�1)
pLM(e) =p(e1, e2, e3, . . . , e`)
n-gram LMs
⇡
= p(e1 | START)⇥Y
i=2
p(ei | ei�1)⇥ p(STOP | e`)
![Page 10: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/10.jpg)
Outcome pthe 0.6
and 0.04
said 0.009
says 0.00001
of 0.1
why 0.1
Why 0.00008
restaurant 0.0000008
destitute 0.00000064
Outcome pthe 0.01
and 0.01
said 0.003
says 0.009
of 0.002
why 0.003
Why 0.0006
restaurant 0.2
destitute 0.1
p(· | in) p(· | the)
As we’ve seen, probability tables like this are the workhorses of language and translation modeling.
![Page 11: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/11.jpg)
Whence parameters?
![Page 12: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/12.jpg)
Whence parameters?
Estimation.
![Page 13: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/13.jpg)
p(x | y) = p(x, y)
p(y)
pMLE(x) =count(x)
N
pMLE(x, y) =count(x, y)
N
pMLE(x | y) = count(x, y)
N
⇥ N
count(y)
=
count(x, y)
count(y)
![Page 14: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/14.jpg)
p(x | y) = p(x, y)
p(y)
pMLE(x) =count(x)
N
pMLE(x, y) =count(x, y)
N
pMLE(x | y) = count(x, y)
N
⇥ N
count(y)
=
count(x, y)
count(y)
![Page 15: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/15.jpg)
p(x | y) = p(x, y)
p(y)
pMLE(x) =count(x)
N
pMLE(x, y) =count(x, y)
N
pMLE(x | y) = count(x, y)
N
⇥ N
count(y)
=
count(x, y)
count(y)
pMLE(call | friends) = count(friends call)
count(friends)
![Page 16: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/16.jpg)
LM Evaluation• Extrinsic evaluation: build a new language model, use it
for some task (MT, ASR, etc.)
• Intrinsic: measure how good we are at modeling language
We will use perplexity to evaluate models
Given: w, pLM
PPL = 21
|w| log2 pLM(w)
0 PPL 1
![Page 17: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/17.jpg)
Perplexity• Generally fairly good correlations with BLEU for n-gram
models
• Perplexity is a generalization of the notion of branching factor
• How many choices do I have at each position?
• State-of-the-art English LMs have PPL of ~100 word choices per position
• A uniform LM has a perplexity of
• Humans do much better
• ... and bad models can do even worse than uniform!
|⌃|
![Page 18: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/18.jpg)
MLE & Perplexity
• What is the lowest (best) perplexity possible for your model class?
• Compute the MLE!
• Well, that’s easy...
![Page 19: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/19.jpg)
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
![Page 20: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/20.jpg)
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
MLE
MLE
![Page 21: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/21.jpg)
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
MLE
MLE
![Page 22: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/22.jpg)
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
-2.07101
-2.07101
MLE
MLE
![Page 23: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/23.jpg)
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
-2.07101
-2.07101
-3.32231
-∞
MLE
MLE
![Page 24: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/24.jpg)
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
-2.07101
-2.07101
-3.32231
-∞
-0.271271
-2.54562
MLE
MLE
![Page 25: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/25.jpg)
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
-2.07101
-2.07101
-3.32231
-∞
-0.271271
-2.54562
-4.961
-4.961
MLE
MLE
![Page 26: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/26.jpg)
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
-2.07101
-2.07101
-3.32231
-∞
-0.271271
-2.54562
-4.961
-4.961
-1.96773
-1.96773
MLE
MLE
![Page 27: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/27.jpg)
START my
p(my | START)
friends
⇥p(friends | my)
call⇥p(call | friends)
me
⇥p(me | call)
Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)
START my
p(my | START)
friends
⇥p(friends | my)
dub me Alex
⇥p(Alex | me)
STOP
⇥p(STOP | Alex)⇥p(dub | friends) ⇥p(me | dub)
-3.65172
-3.65172
-2.07101
-2.07101
-3.32231
-∞
-0.271271
-2.54562
-4.961
-4.961
-1.96773
-1.96773
MLE
MLE
MLE assigns probability zeroto unseen events
![Page 28: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/28.jpg)
Zeros• Two kinds of zero probs:
• Sampling zeros: zeros in the MLE due to impoverished observations
• Structural zeros: zeros that should be there. Do these really exist?
• Just because you haven’t seen something, doesn’t mean it doesn’t exist.
• In practice, we don’t like probability zero, even if there is an argument that it is a structural zero.
![Page 29: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/29.jpg)
Smoothing
p(e) > 0 8e 2 ⌃⇤
Smoothing an refers to a family of estimation techniques that seek to model important general patterns in data while avoiding modeling noise or samplingartifacts. In particular, for language modeling, we seek
We will assume that is known and finite.⌃
![Page 30: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/30.jpg)
Add- Smoothingp ⇠ Dirichlet(↵)
xi ⇠ Categorical(p) 81 i |x|
Assuming this model, what is the most probable value of p, having observed training data x?
(bunch of calculus - read about it on Wikipedia)
p
⇤x
=
count(x) + ↵
x
� 1
N +
Px
0 (↵x
0 � 1)
8↵x
> 1
↵
![Page 31: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/31.jpg)
• Simplest possible smoother
• Surprisingly effective in many models
• Does not work well for language models
• There are procedures for dealing with 0 < alpha < 1
• When might these be useful?
Add- Smoothing↵
![Page 32: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/32.jpg)
Interpolation
• “Mixture of MLEs”
p(dub | my friends) =�3pMLE(dub | my friends)
+ �2pMLE(dub | friends)+ �1pMLE(dub)
+ �01
|⌃|
Where do the lambdas come from?
![Page 33: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/33.jpg)
DiscountingDiscounting adjusts the frequencies of observedevents downward to reserve probability for thethings that have not been observed.
Note only when f(w3 | w1, w2) > 0 count(w1, w2, w3) > 0
0 f⇤(w3 | w1, w2) f(w3 | w1, w2)
We introduce a discounted frequency:
The total discount is the zero-frequency probability:
�(w1, w2) = 1�X
w0
f⇤(w0 | w1, w2)
![Page 34: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/34.jpg)
Back-off
pBO(w3 | w1, w2) =
(f⇤
(w3 | w1, w2) if f⇤(w3 | w1, w2) > 0
↵w1,w2 ⇥ �(w1, w2)⇥ pBO(w3 | w1, w2) otherwise{“Back-off weight”
Question 1: how do we discount?
Recursive formulation of probability:
![Page 35: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/35.jpg)
Kneser-Ney Discounting• State-of-the-art in language modeling for 15 years
• Two major intuitions
• Some contexts have lots of new words
• Some words appear in lots of contexts
• Procedure
• Only register a lower-order count the first time it is seen in a backoff context
• Example: bigram model
• “San Francisco” is a common bigram
• But, we only count the unigram “Francisco” the first time we see the bigram “San Francisco” - we change its unigram probability
![Page 36: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/36.jpg)
Back-off
pBO(w3 | w1, w2) =
(f⇤
(w3 | w1, w2) if f⇤(w3 | w1, w2) > 0
↵w1,w2 ⇥ �(w1, w2)⇥ pBO(w3 | w1, w2) otherwise{“Back-off weight”
Question 1: how do we discount?
Recursive formulation of probability:
Question 2: how many parameters?
![Page 37: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/37.jpg)
x1
x2
x3
x4
x5
x
“Neurons”
![Page 38: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/38.jpg)
x1
x2
x3
x4
x5
x
“Neurons”
“While the brain metaphor is intriguing, it is also distracting and cumbersome to manipulate
mathematically. We therefore switch to using more concise
mathematical notation.”(Goldberg, 2015)
![Page 39: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/39.jpg)
w1
w2
w3
w4
w5
x1
x2
x3
x4
x5
x w
“Neurons”
![Page 40: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/40.jpg)
w1
w2
w3
w4
w5
x1
x2
x3
x4
x5
y
y = x ·w
x
“Neurons”
w
![Page 41: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/41.jpg)
w1
w2
w3
w4
w5
x1
x2
x3
x4
x5
y
y = x ·w
x
“Neurons”
w
![Page 42: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/42.jpg)
w1
w2
w3
w4
w5
x1
x2
x3
x4
x5
y
y = x ·wy = g(x ·w)
x
“Neurons”
w
![Page 43: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/43.jpg)
w1
w2
w3
w4
w5
x1
x2
x3
x4
x5
y
x
“Neurons”
w
![Page 44: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/44.jpg)
x1
x2
x3
x4
x5
x
“Neural” Networks
![Page 45: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/45.jpg)
x1
x2
x3
x4
x5
x
w4,1
“Neural” Networks
![Page 46: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/46.jpg)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
y = x
>W
“Neural” Networks
![Page 47: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/47.jpg)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
y = x
>W
y = g(x>W)
“Neural” Networks
![Page 48: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/48.jpg)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
y = x
>W
y = g(x>W)
“Neural” Networks
What about probabilities?Let each be an outcome.yi
![Page 49: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/49.jpg)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
y = x
>W
y = g(x>W)
g(u)i =expuiPi0 expui0
“Soft max”
“Neural” Networks
What about probabilities?Let each be an outcome.yi
![Page 50: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/50.jpg)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
![Page 51: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/51.jpg)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V
![Page 52: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/52.jpg)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V z
z1
z2
z3
z4
![Page 53: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/53.jpg)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V z
z1
z2
z3
z4
z = g(y>V)
![Page 54: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/54.jpg)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V z
z1
z2
z3
z4
z = g(y>V)
z = g(h(x>W)>V)
![Page 55: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/55.jpg)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V z
z1
z2
z3
z4
z = g(y>V)
z = g(h(x>W)>V)
z = g(Vh(Wx))
![Page 56: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/56.jpg)
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V z
z1
z2
z3
z4
z = g(y>V)
z = g(h(x>W)>V)
z = g(Vh(Wx))
Note:if g(x) = h(x) = x
z = (VW)| {z }U
x
![Page 57: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/57.jpg)
Design Decisions
• How to represent inputs and outputs?
• Neural architecture?
• How many layers? (Requires non-linearities to improve capacity!)
• How many neurons?
• Recurrent or not?
• What kind of non-linearities?
![Page 58: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/58.jpg)
Representing Language
• “One-hot” vectors
• Each position in a vector corresponds to a word type
• Sequence of words, sequence of vectors
• Bag of words: multiple vectors
• Distributed representations
• Vectors encode “features” of input words (character n-grams, morphological features, etc.)
![Page 59: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/59.jpg)
Training Neural Networks
• Neural networks are supervised models - you need a set of inputs paired with outputs
• Algorithm
• Run for a while
• Give input to the network, see what it predicts
• Compute loss(y,y*) and (sub)gradient with respect to parameters. Use the chain rule, aka “back propagation”
• Update parameters (SGD, AdaGrad, LBFGS, etc.)
• Algorithm is automated, just need to specify model structure
![Page 60: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/60.jpg)
Bengio et al. (2003)
p(e) =
|e|Y
i=1
p(ei | ei�n+1, . . . , ei�1)
p(ei | ei�n+1, . . . , ei�1) =
ei
ei�1
ei�2
ei�3
C
C
C
W V
tanhsoftmax
x=x
![Page 61: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/61.jpg)
Bengio et al. (2003)
![Page 62: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/62.jpg)
Devlin et al. (2014)
• Turn Bengio et al. (2003) into a translation model
• Conditional model; generate the next English word conditioned on
• The previous n English words you generated
• The aligned source word, and its m neighbors
![Page 63: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/63.jpg)
eiei�1
ei�2
ei�3
C
C
C
W V
tanh
softmax
x=x
p(e | f,a) =|e|Y
i=1
p(ei | ei�2, ei�1, fai�1, fai , fai+1)
p(ei | ei�2, ei�1, fai�1, fai , fai+1) =
D
D
D
fai
fai�1
fai+1
![Page 64: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/64.jpg)
Devlin et al. (2014)
![Page 65: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/65.jpg)
Summary of neural LMs
• Two problems in standard statistical models
• We don’t condition on enough stuff
• We don’t know what features to use when we condition on lots of structure
• Punchline: Neural networks let us condition on a lot of stuff without an exponential growth in parameters
![Page 66: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/66.jpg)
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
![Page 67: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/67.jpg)
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
=
|e|Y
i=1
p(ei | ei�1, ..., e1)⇥|f |Y
j=1
p(fj | fj�1, ..., f1, e)
![Page 68: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/68.jpg)
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
=
|e|Y
i=1
p(ei | ei�1, ..., e1)⇥|f |Y
j=1
p(fj | fj�1, ..., f1, e)
Conditional language model
![Page 69: Edinburgh MT lecture 11: Neural language models](https://reader036.fdocuments.us/reader036/viewer/2022062503/58f0b44b1a28ab9e078b45c9/html5/thumbnails/69.jpg)
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
=
|e|Y
i=1
p(ei | ei�1, ..., e1)⇥|f |Y
j=1
p(fj | fj�1, ..., f1, e)
WHAT IF I TOLD YOU
YOU CAN ENCODE A SENTENCE IN A FIXED-DIMENSIONAL VECTOR