Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based...
Transcript of Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based...
Speech and Language ProcessingLecture 5
Neural network based acoustic and language models
Information and Communications Engineering Course
Takahiro Shinozaki
2019/11/5 1
Lecture Plan (Shinozaki’s part)
1. 10/18 (remote)Speech recognition based on GMM, HMM, and N-gram
2. 10/25 (remote)Maximum likelihood estimation and EM algorithm
3. 11/4 (@TAIST)Bayesian network and Bayesian inference
4. 11/4 (@TAIST)Variational inference and sampling
5. 11/5 (@TAIST)Neural network based acoustic and language models
6. 11/5 (@TAIST)Weighted finite state transducer (WFST) and speech decoding 2
I gives the first 6 lectures about speech recognition.Through these lectures, the backbone of the latest speech recognition techniques is explained.
Today’s Topic
• Answers for the previous exercises• Neural network based acoustic and language
models
3
Answers for the Previous Exercises
4
Exercise 4.1
• When p(x) and y = f(x) are given as follows, obtain distribution q(y)
5
( ) ( )xyxxp −−=∈= 1log,1,01)(
( ) ( )ydydxyx
yxyx
−=−−=
∞=→==→=
exp,exp1
1,00)exp()()( y
dydxxpyq −==
x
# of
occ
urre
nce Histogram of x
y
Histogram of y#
of o
ccur
renc
e
Exercise 4.2• When p(x) and y = f(x) are given as follows, obtain
distribution q(y)
6
( ) ( ) 43,,1,0|2
exp21)(
2
+=∞∞−∈=
−= xyxxNxxp
π
31,
34
=−
=dydxyx
( ) ( )22
2
23,4|
34
21exp
321)()( yNy
dydxxpyq =
−−==
π
x
# of
occ
urre
nce
Histogram of x
y
Histogram of y
# of
occ
urre
nce
Exercise 4.3
• Show that N(xA|xB, 1) = N(xB|xA,1), where N(x|m,v) is the Gaussian distribution with mean m and variance v
7
( ) ( )
( ) ( )1,|21exp
21
21exp
211,|
2
2
ABAB
BABA
xxNxx
xxxxN
=
−−=
−−=
π
π
Neural network
8
Multi Layer Perceptron (MLP)
• Unit of MLP
• MLP consists of multiple layers of the units
9
+= ∑
iii bxwhy
h: activation functionw: weightb:bias
2x
y
ix1x
1x 2x nx
1y myOutput layer
Input layer
Hidden layersMultiple layers
Activation Functions
10
( ) ≤
=otherwise0
0if1 xxh
( ) ( )xxh
−+=
exp11
( ) xxh =Linear function
Unit step function
Sigmoidfunction
( ) { }xxh ,0max=hinge function
Softmax Function
• For N variables zi, softmax function is:
• Properties of softmax• Positive
• Sum is one
• Example
11
( )∑
=
jj
ii z
zzh)exp(
)exp(
( )izh<0
( ) 0.11
=∑=
N
iizh
Expresses a probability distribution
( ) ( ) ( ) ( ) 0.2595 0.7054, 0.0351,,, 321 == zhzhzhZh1,2,1-,, 321 == zzzZ
16,8,12,, 321 == zzzZ ( ) ( ) ( ) ( ) 0.0180 0.0003, 0.9817,,, 321 == zhzhzhZh
Exercise 5.1
• Let h be a softmax function having inputs z1, z2,…,zN.
• Prove that
12
( )∑
=
jj
ii z
zzh)exp(
)exp(
( ) 0.11
=∑=
N
iizh
( ) 1)exp(
)exp(
)exp()exp(
11===
∑∑
∑∑∑==
jj
iiN
ij
j
iN
ii z
z
zzzh
Forward Propagation
• Compute the output of MLP step by step from the input layer to the output layer
13
E.g. softmax layer
E.g. sigmoid layer
E.g. sigmoid layer
Input vector
Parameters of Neural Network
• The weights and a bias of each unit need training before the network is used
14
( )xw ⋅=
+= ∑
h
bxwhyi
ii
2x
y
Nx
1x 11w
2w Nwb h: activation function
w: weight vectorw=(w1,w2,…,wN,b)
X: input vectorx=(x1,x2,…,xN,1)
The bias b can be regarded as one of the weights whose input takes a constant value 1.0
Principle of NN Training
15
Training set
Reference output vector
Input vector
Adjust parameters of MLP so as to minimize the error
Output by MLP
Definitions of Errors
• Sum of square error• Used when output layer uses linear functions
• Cross-entropy• Used when the output layer is a softmax
16
( ) ( ) 2,21∑ −=
nnn tWXyWE
ntXW
n
n
:Set of weights in MLP
:Vector of a training sample (input)
:Vector of a training sample (output)
:Index of training samples
( ) ( ){ }∑∑−=n k
nnknk WXytWE ,ln
ktnk
:Reference output (Takes 1 if unit kcorresponds to correct output, 0 otherwise):Index of output unit
Gradient Descent
• An iterative optimization method
17
f(x)
x
( )t
tt xxfxx
∂∂
−=+ ε1ε :Learning rate
(small positive value)
x0x1x2xNInitial value
( )0xxx
xf
=∂∂
MLP Training by Gradient Descent
• Define an error measure E(W) for training samples
• Initialize parameters W={w1, w2,…, wM}• Repeatedly update the parameter set using
gradient descent
18
( ) ( ) 2,21∑ −=
nnn tWXyWEe.g.
( ) ( ) ( )( )twwi
ii
iiwWEtwtw
=∂
∂−=+ ε1
Chain Rule of Differentiation
xy
yz
xz
∂∂
∂∂
=∂∂
19
)()(
xgyyfz
==
g
f
x
y
z
When x,y,z are scalars:
When x,y,z are vectors:
2121321 ,,,,,, zzzyyyxxxx ===
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
=
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
3
2
2
2
1
2
3
1
2
1
1
1
2
2
1
2
2
1
1
1
3
2
2
2
1
2
3
1
2
1
1
1
xy
xy
xy
xy
xy
xy
yz
yz
yz
yz
xz
xz
xz
xz
xz
xz
Jacobian matrixxy
yz
xz
∂∂
∂∂
=∂∂ The same rule holds using Jacobian matrix
When There Are Branches
20
)()(
),(
22
11
21
xgyxgy
yyfz
===
g1
f
x
y1
z
g2
y2
xy
yz
xy
yz
xz
∂∂
∂∂
+∂∂
∂∂
=∂∂ 2
2
1
1
g1
f
x
y1
z
y2
f
x
y1
z
g2
y2
Variations: xxg = )(1 Cxg = )(2
(independent of x)xy
yz
yz
xz
∂∂
∂∂
+∂∂
=∂∂ 2
21
xy
yz
xz
∂∂
∂∂
=∂∂ 1
1
Back Propagation(BP)
21
r
( )4344 , wyfy =
( )3233 , wyfy =
( )2122 , wyfy =
( )111 , wxfy =
x
( )ryEErr ,4=
( )344 max ywsofty ⋅=Ex.:
( )233 ywsigmoidy ⋅=Ex.:
f2
f1
f3
f4
E
x
w1
w2
w3
w4
rref
Input1y
2y
3y
4y
Err
4
4
44 wf
fErr
wErr
∂∂
∂∂
=∂∂
3
3
33 wf
fErr
wErr
∂∂
∂∂
=∂∂
2
2
22 wf
fErr
wErr
∂∂
∂∂
=∂∂
1
1
11 wf
fErr
wErr
∂∂
∂∂
=∂∂
3
4
43 ff
fErr
fErr
∂∂
∂∂
=∂∂
4fErr∂∂
2
3
32 ff
fErr
fErr
∂∂
∂∂
=∂∂
1
2
21 ff
fErr
fErr
∂∂
∂∂
=∂∂
①obtain value of each node by forward propagation
②Obtain derivatives by backward propagation
Output
Feed-Forward Neural Network
22
• When the network structure is a DAG, it is called feed-forward network• The nodes are ordered in a line so that all connections have the same direction• The forward/backward propagation can be efficiently applied
1
2
3
4
Exercise 5.2
When h(y) and y(x) are given as follows, obtain
23
( ) ( )yyh
−+=
exp11
baxy +=
xh∂∂
( ) ( ) ( )( )( )
( )( )( )( )( )
( )( ) ( )baxhbaxhabax
baxa
ay
ybaxxyyx
yyh
xh
++−=+−++−
=
−+−
=+∂∂
−+∂
∂=
∂∂
∂∂
=∂∂
1exp1
expexp1
expexp1
1
2
2
Recurrent Neural Network (RNN)
• Neural network having a feedback• Expected to be more powerful modeling performance
than feed-forward MLP, but the training is more difficult
24
Delay
Input
Output
Input layer
Output layer
Hidden layers
Unfolding of RNN to Time Axis
25
D
UnfoldThrough Time
Input feature sequenceTime
Reference vector sequence
Training of RNN by BP Through Time (BPTT)
26
y1 y2 y3 y4
Input(Regard the input sequence as an input)
Output(Regard the output sequence as an output)
Input sequence
Output sequence Back-propagation
h1 h2 h3 h4
x1 x2 x3 x4
y1 y2 y3 y4
h4
h3
h2
h1
x1 x2 x3 x4
Apply BP to the unfolded network
tx
⊗
1−ty1−tc
Long Short-Term Memory (LSTM)
27
LSTM
A type of RNN addressing the gradient vanishing problem
tx
ty
Delay Delay
tc
1−tc
1−ty
tc
ty
⊗
⊕
σ
tanh σ
⊗
tanh σ
Output gate
Input gate
forget gate
⊗
⊕
σ
tanhTanh layer with affine transform
Sigmoid layer with affine transform
Pointwise multiplication
Sum
Convolutional Neural Network (CNN)
28
1 3 3 4 2 13 5 2 1 3 5
Input
Filter (1)
Filter (2)
Filter (3)
Activation map (1)
5 4 5
Pooling
Convolution Layer Pooling Layer
Next convolution layer etc.
Activation map (2)
Activation map (N)A type of feed-forward neural network with parameter sharing and connection constraint
A filter is shifted and applied at different positions
Neural network based acoustic model
29
Frame Level Vowel Recognition Using MLP
30
( )あp ( )いp ( )うp ( )えp ( )おp
Softmax function
Sigmoid function
Sigmoid function
Input: Speech feature vector (e.g. MFCC)
0.1 0.4 0.20.15 0.15
Exercise 5.3Obtain recognition result (yes or no). You may use a calculator.
31
Sigmoid
Softmax
P(yes) P(no)
1.5
-2 1
-1
2
-2
-2.5
3
2.5 -4.0
Combination of HMM and MLP
32
s1 s3s4s0 s2
MLP-HMM
s1 s3s4s0 s2
GMM-HMM
Softmaxlayer
( ) ( )( )
( )( )sp
XsMLPspXspsXp ||| =∝
( ) ( )XGMMsXp s=|
MLP-HMM based Phone Recognizer
33
/a/ /i/ /N/
Softmax
Sigmoid
Sigmoid
Input speech feature
Start End
Neural network based language model
34
Word Vector
• One-of-K representation of a word for a fixed vocabulary
35
word ID 1-of-KApple 1 <1,0,0,0,0,0,0>Banana 2 <0,1,0,0,0,0,0>Cherry 3 <0,0,1,0,0,0,0>Durian 4 <0,0,0,1,0,0,0>Orange 5 <0,0,0,0,1,0,0>Pineapple 6 <0,0,0,0,0,1,0>Strawberry 7 <0,0,0,0,0,0,1>
Word Prediction Using RNN
36
<0.02 ,0.65, 0.14 ,0.11, 0.05, 0.01 ,0.02>
<0, 0, 0, 1, 0, 0, 0>
D
Wordt-1
(Distribution of)Wordt
RNN Language Model (Unfolded)
37
</s>
<s>
BigDelicious Red Apple
P(<s>, Delicious, Big, Red, Apple, </s>)
Dialogue System Using Seq2Seq Network
38
What is your name <s>
name
Encoder network
Decoder network
</s>My is TS-800
Sampling from posterior
Input
Output
Evolution of Compute Hardware
39
2002Earth simulator40.96TFLOPS
2017 GeForce GTX 1080Ti10.609TFLOPS699USD
Picture is from wikipedia Picture is from Nvidia.com