Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based...

39
Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course Takahiro Shinozaki 2019/11/5 1

Transcript of Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based...

Page 1: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Speech and Language ProcessingLecture 5

Neural network based acoustic and language models

Information and Communications Engineering Course

Takahiro Shinozaki

2019/11/5 1

Page 2: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Lecture Plan (Shinozaki’s part)

1. 10/18 (remote)Speech recognition based on GMM, HMM, and N-gram

2. 10/25 (remote)Maximum likelihood estimation and EM algorithm

3. 11/4 (@TAIST)Bayesian network and Bayesian inference

4. 11/4 (@TAIST)Variational inference and sampling

5. 11/5 (@TAIST)Neural network based acoustic and language models

6. 11/5 (@TAIST)Weighted finite state transducer (WFST) and speech decoding 2

I gives the first 6 lectures about speech recognition.Through these lectures, the backbone of the latest speech recognition techniques is explained.

Page 3: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Today’s Topic

• Answers for the previous exercises• Neural network based acoustic and language

models

3

Page 4: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Answers for the Previous Exercises

4

Page 5: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Exercise 4.1

• When p(x) and y = f(x) are given as follows, obtain distribution q(y)

5

( ) ( )xyxxp −−=∈= 1log,1,01)(

( ) ( )ydydxyx

yxyx

−=−−=

∞=→==→=

exp,exp1

1,00)exp()()( y

dydxxpyq −==

x

# of

occ

urre

nce Histogram of x

y

Histogram of y#

of o

ccur

renc

e

Page 6: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Exercise 4.2• When p(x) and y = f(x) are given as follows, obtain

distribution q(y)

6

( ) ( ) 43,,1,0|2

exp21)(

2

+=∞∞−∈=

−= xyxxNxxp

π

31,

34

=−

=dydxyx

( ) ( )22

2

23,4|

34

21exp

321)()( yNy

dydxxpyq =

−−==

π

x

# of

occ

urre

nce

Histogram of x

y

Histogram of y

# of

occ

urre

nce

Page 7: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Exercise 4.3

• Show that N(xA|xB, 1) = N(xB|xA,1), where N(x|m,v) is the Gaussian distribution with mean m and variance v

7

( ) ( )

( ) ( )1,|21exp

21

21exp

211,|

2

2

ABAB

BABA

xxNxx

xxxxN

=

−−=

−−=

π

π

Page 8: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Neural network

8

Page 9: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Multi Layer Perceptron (MLP)

• Unit of MLP

• MLP consists of multiple layers of the units

9

+= ∑

iii bxwhy

h: activation functionw: weightb:bias

2x

y

ix1x

1x 2x nx

1y myOutput layer

Input layer

Hidden layersMultiple layers

Page 10: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Activation Functions

10

( ) ≤

=otherwise0

0if1 xxh

( ) ( )xxh

−+=

exp11

( ) xxh =Linear function

Unit step function

Sigmoidfunction

( ) { }xxh ,0max=hinge function

Page 11: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Softmax Function

• For N variables zi, softmax function is:

• Properties of softmax• Positive

• Sum is one

• Example

11

( )∑

=

jj

ii z

zzh)exp(

)exp(

( )izh<0

( ) 0.11

=∑=

N

iizh

Expresses a probability distribution

( ) ( ) ( ) ( ) 0.2595 0.7054, 0.0351,,, 321 == zhzhzhZh1,2,1-,, 321 == zzzZ

16,8,12,, 321 == zzzZ ( ) ( ) ( ) ( ) 0.0180 0.0003, 0.9817,,, 321 == zhzhzhZh

Page 12: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Exercise 5.1

• Let h be a softmax function having inputs z1, z2,…,zN.

• Prove that

12

( )∑

=

jj

ii z

zzh)exp(

)exp(

( ) 0.11

=∑=

N

iizh

( ) 1)exp(

)exp(

)exp()exp(

11===

∑∑

∑∑∑==

jj

iiN

ij

j

iN

ii z

z

zzzh

Page 13: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Forward Propagation

• Compute the output of MLP step by step from the input layer to the output layer

13

E.g. softmax layer

E.g. sigmoid layer

E.g. sigmoid layer

Input vector

Page 14: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Parameters of Neural Network

• The weights and a bias of each unit need training before the network is used

14

( )xw ⋅=

+= ∑

h

bxwhyi

ii

2x

y

Nx

1x 11w

2w Nwb h: activation function

w: weight vectorw=(w1,w2,…,wN,b)

X: input vectorx=(x1,x2,…,xN,1)

The bias b can be regarded as one of the weights whose input takes a constant value 1.0

Page 15: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Principle of NN Training

15

Training set

Reference output vector

Input vector

Adjust parameters of MLP so as to minimize the error

Output by MLP

Page 16: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Definitions of Errors

• Sum of square error• Used when output layer uses linear functions

• Cross-entropy• Used when the output layer is a softmax

16

( ) ( ) 2,21∑ −=

nnn tWXyWE

ntXW

n

n

:Set of weights in MLP

:Vector of a training sample (input)

:Vector of a training sample (output)

:Index of training samples

( ) ( ){ }∑∑−=n k

nnknk WXytWE ,ln

ktnk

:Reference output (Takes 1 if unit kcorresponds to correct output, 0 otherwise):Index of output unit

Page 17: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Gradient Descent

• An iterative optimization method

17

f(x)

x

( )t

tt xxfxx

∂∂

−=+ ε1ε :Learning rate

(small positive value)

x0x1x2xNInitial value

( )0xxx

xf

=∂∂

Page 18: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

MLP Training by Gradient Descent

• Define an error measure E(W) for training samples

• Initialize parameters W={w1, w2,…, wM}• Repeatedly update the parameter set using

gradient descent

18

( ) ( ) 2,21∑ −=

nnn tWXyWEe.g.

( ) ( ) ( )( )twwi

ii

iiwWEtwtw

=∂

∂−=+ ε1

Page 19: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Chain Rule of Differentiation

xy

yz

xz

∂∂

∂∂

=∂∂

19

)()(

xgyyfz

==

f

x

y

z

When x,y,z are scalars:

When x,y,z are vectors:

2121321 ,,,,,, zzzyyyxxxx ===

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

=

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

3

2

2

2

1

2

3

1

2

1

1

1

2

2

1

2

2

1

1

1

3

2

2

2

1

2

3

1

2

1

1

1

xy

xy

xy

xy

xy

xy

yz

yz

yz

yz

xz

xz

xz

xz

xz

xz

Jacobian matrixxy

yz

xz

∂∂

∂∂

=∂∂ The same rule holds using Jacobian matrix

Page 20: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

When There Are Branches

20

)()(

),(

22

11

21

xgyxgy

yyfz

===

g1

f

x

y1

z

g2

y2

xy

yz

xy

yz

xz

∂∂

∂∂

+∂∂

∂∂

=∂∂ 2

2

1

1

g1

f

x

y1

z

y2

f

x

y1

z

g2

y2

Variations: xxg = )(1 Cxg = )(2

(independent of x)xy

yz

yz

xz

∂∂

∂∂

+∂∂

=∂∂ 2

21

xy

yz

xz

∂∂

∂∂

=∂∂ 1

1

Page 21: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Back Propagation(BP)

21

r

( )4344 , wyfy =

( )3233 , wyfy =

( )2122 , wyfy =

( )111 , wxfy =

x

( )ryEErr ,4=

( )344 max ywsofty ⋅=Ex.:

( )233 ywsigmoidy ⋅=Ex.:

f2

f1

f3

f4

E

x

w1

w2

w3

w4

rref

Input1y

2y

3y

4y

Err

4

4

44 wf

fErr

wErr

∂∂

∂∂

=∂∂

3

3

33 wf

fErr

wErr

∂∂

∂∂

=∂∂

2

2

22 wf

fErr

wErr

∂∂

∂∂

=∂∂

1

1

11 wf

fErr

wErr

∂∂

∂∂

=∂∂

3

4

43 ff

fErr

fErr

∂∂

∂∂

=∂∂

4fErr∂∂

2

3

32 ff

fErr

fErr

∂∂

∂∂

=∂∂

1

2

21 ff

fErr

fErr

∂∂

∂∂

=∂∂

①obtain value of each node by forward propagation

②Obtain derivatives by backward propagation

Output

Page 22: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Feed-Forward Neural Network

22

• When the network structure is a DAG, it is called feed-forward network• The nodes are ordered in a line so that all connections have the same direction• The forward/backward propagation can be efficiently applied

1

2

3

4

Page 23: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Exercise 5.2

When h(y) and y(x) are given as follows, obtain

23

( ) ( )yyh

−+=

exp11

baxy +=

xh∂∂

( ) ( ) ( )( )( )

( )( )( )( )( )

( )( ) ( )baxhbaxhabax

baxa

ay

ybaxxyyx

yyh

xh

++−=+−++−

=

−+−

=+∂∂

−+∂

∂=

∂∂

∂∂

=∂∂

1exp1

expexp1

expexp1

1

2

2

Page 24: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Recurrent Neural Network (RNN)

• Neural network having a feedback• Expected to be more powerful modeling performance

than feed-forward MLP, but the training is more difficult

24

Delay

Input

Output

Input layer

Output layer

Hidden layers

Page 25: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Unfolding of RNN to Time Axis

25

D

UnfoldThrough Time

Input feature sequenceTime

Reference vector sequence

Page 26: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Training of RNN by BP Through Time (BPTT)

26

y1 y2 y3 y4

Input(Regard the input sequence as an input)

Output(Regard the output sequence as an output)

Input sequence

Output sequence Back-propagation

h1 h2 h3 h4

x1 x2 x3 x4

y1 y2 y3 y4

h4

h3

h2

h1

x1 x2 x3 x4

Apply BP to the unfolded network

Page 27: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

tx

1−ty1−tc

Long Short-Term Memory (LSTM)

27

LSTM

A type of RNN addressing the gradient vanishing problem

tx

ty

Delay Delay

tc

1−tc

1−ty

tc

ty

σ

tanh σ

tanh σ

Output gate

Input gate

forget gate

σ

tanhTanh layer with affine transform

Sigmoid layer with affine transform

Pointwise multiplication

Sum

Page 28: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Convolutional Neural Network (CNN)

28

1 3 3 4 2 13 5 2 1 3 5

Input

Filter (1)

Filter (2)

Filter (3)

Activation map (1)

5 4 5

Pooling

Convolution Layer Pooling Layer

Next convolution layer etc.

Activation map (2)

Activation map (N)A type of feed-forward neural network with parameter sharing and connection constraint

A filter is shifted and applied at different positions

Page 29: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Neural network based acoustic model

29

Page 30: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Frame Level Vowel Recognition Using MLP

30

( )あp ( )いp ( )うp ( )えp ( )おp

Softmax function

Sigmoid function

Sigmoid function

Input: Speech feature vector (e.g. MFCC)

0.1 0.4 0.20.15 0.15

Page 31: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Exercise 5.3Obtain recognition result (yes or no). You may use a calculator.

31

Sigmoid

Softmax

P(yes) P(no)

1.5

-2 1

-1

2

-2

-2.5

3

2.5 -4.0

Page 32: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Combination of HMM and MLP

32

s1 s3s4s0 s2

MLP-HMM

s1 s3s4s0 s2

GMM-HMM

Softmaxlayer

( ) ( )( )

( )( )sp

XsMLPspXspsXp ||| =∝

( ) ( )XGMMsXp s=|

Page 33: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

MLP-HMM based Phone Recognizer

33

/a/ /i/ /N/

Softmax

Sigmoid

Sigmoid

Input speech feature

Start End

Page 34: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Neural network based language model

34

Page 35: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Word Vector

• One-of-K representation of a word for a fixed vocabulary

35

word ID 1-of-KApple 1 <1,0,0,0,0,0,0>Banana 2 <0,1,0,0,0,0,0>Cherry 3 <0,0,1,0,0,0,0>Durian 4 <0,0,0,1,0,0,0>Orange 5 <0,0,0,0,1,0,0>Pineapple 6 <0,0,0,0,0,1,0>Strawberry 7 <0,0,0,0,0,0,1>

Page 36: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Word Prediction Using RNN

36

<0.02 ,0.65, 0.14 ,0.11, 0.05, 0.01 ,0.02>

<0, 0, 0, 1, 0, 0, 0>

D

Wordt-1

(Distribution of)Wordt

Page 37: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

RNN Language Model (Unfolded)

37

</s>

<s>

BigDelicious Red Apple

P(<s>, Delicious, Big, Red, Apple, </s>)

Page 38: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Dialogue System Using Seq2Seq Network

38

What is your name <s>

name

Encoder network

Decoder network

</s>My is TS-800

Sampling from posterior

Input

Output

Page 39: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro

Evolution of Compute Hardware

39

2002Earth simulator40.96TFLOPS

2017 GeForce GTX 1080Ti10.609TFLOPS699USD

Picture is from wikipedia Picture is from Nvidia.com