Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 ·...

79
Speech Recognition HUNG-YI LEE 李宏毅

Transcript of Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 ·...

Page 1: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

SpeechRecognitionHUNG-YI LEE 李宏毅

Page 2: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Speech Recognition is Difficult?

I heard the story from Prof Haizhou Li.

Page 3: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Speech Recognition

SpeechRecognition

speech text

Speech: a sequence of vector (length T, dimension d)

Text: a sequence of token (length N, V different tokens)

Usually T > N

N

d

V

T

Page 4: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Token

Grapheme: smallest unit of a writing system

“一”,“拳”,“超”,“人”

one_punch_man

N=13, V=26+?

N=4, V≈4000

26 English alphabet

Chinese does not need “space”

+ { _ } (space)

+ {punctuation marks}

Phoneme: a unit of sound

W AH N P AH N CH M AE N

Lexicon: word to phonemes

good ⟶ G UH Dcat ⟶ K AE T

punch man

Lexicon free!

one ⟶ W AH Npunch ⟶ P AH N CH

man ⟶ M AE N

one

Page 5: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Token

Word:

For some languages, V can be too large!

“一拳”“超人”

one punch man N=3, usually V>100K

N=2, V=???

Page 6: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Token

Turkish: Agglutinative language

Source of information: http://tkturkey.com/ (土女時代)

70 characters?!

Page 7: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Token

Morpheme: the smallest meaningful unit

unbreakable → “un” “break” “able”

rekillable → “re” “kill” “able”

What are the morphemes in a language?

Word:

“一拳”“超人”

one punch man N=3, usually V>100K

N=2, V=???

(< word, > grapheme)

linguistic or statistic

For some languages, V can be too large!

Page 8: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Token

Bytes (!): The system can be language independent!

V is always 256

[Li, et al., ICASSP’19]

UTF-8

Page 9: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

TokenGo through more than 100 papers in INTERSPEECH’19, ICASSP’19, ASRU’19

感謝助教群的辛勞

Page 10: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

SpeechRecognition

one punch man word embeddings

SpeechRecognition

one punch man+ Translation

一拳超人

SpeechRecognition

+ Slot filling

SpeechRecognition

one ticket to Taipei on March 2nd

+ Intent classification

<buy ticket>

one ticket to Taipei on March 2nd

NA NA NA <LOC>NA <TIME> <TIME>

Page 11: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Acoustic Feature10ms

25ms

400 sample points (16KHz)39-dim MFCC80-dim filter bank output

length T, dimension d

frame

1s → 100 frames

數位語音處理第七章Speech Signal and Front-end Processinghttp://ocw.aca.ntu.edu.tw/ntu-ocw/ocw/cou/104S204/7

Page 12: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Acoustic Feature

DFT

DCT log

MFCC

………

Waveform

spectrogram

filter bank

Page 13: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Acoustic FeatureGo through more than 100 papers in INTERSPEECH’19, ICASSP’19, ASRU’19

感謝助教群的辛勞

Page 14: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

How much data do we need?(English corpora)

80 hr

300 hr

960 hr

2000 hr

“4096 hr”

[Chiu, et al., ICASSP, 2018]

[Huang, et al., arXiv’19]

“49 min”

4 hr

“2 hr 40 min”MNIST: 28 x 28 x 1 x 60000

= 49 minutes (16kHz)

= 47,040,000

CIFAR-10: 32 x 32 x 3 x 50000

= 2 hours 40 minutes

= 153,600,000

The commercial systems use more than that ……

Page 15: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Two Points of Views

Source of image: 李琳山老師《數位語音處理概論》

Seq-to-seq HMM

Page 16: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Models to be introduced

• Listen, Attend, and Spell (LAS)

• Connectionist Temporal Classification (CTC)

• RNN Transducer (RNN-T)

• Neural Transducer

• Monotonic Chunkwise Attention (MoChA)[Chiu, et al., ICLR’18]

[Jaitly, et al., NIPS’16]

[Chorowski. et al., NIPS’15]

[Graves, et al., ICML’06]

[Graves, ICML workshop’12]

Page 17: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Models Go through more than 100 papers in INTERSPEECH’19, ICASSP’19, ASRU’19

感謝助教群的辛勞

Page 18: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Models to be introduced

• Listen, Attend, and Spell (LAS)

• Connectionist Temporal Classification (CTC)

• RNN Transducer (RNN-T)

• Neural Transducer

• Monotonic Chunkwise Attention (MoChA)[Chiu, et al., ICLR’18]

[Jaitly, et al., NIPS’16]

[Chorowski. et al., NIPS’15]

[Graves, et al., ICML’06]

[Graves, ICML workshop’12]

Encoder Decoder

It is the typical seq2seq with attention.

Page 19: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Listen

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

Input:𝑥1, 𝑥2, ⋯ , 𝑥𝑇

output:ℎ1, ℎ2, ⋯ , ℎ𝑇

acoustic features

high-level representations

• Extract content information• Remove speaker variance, remove noises

Page 20: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Listen

Input:𝑥1, 𝑥2, ⋯ , 𝑥𝑇

output:ℎ1, ℎ2, ⋯ , ℎ𝑇

acoustic features

high-level representations

𝑥4𝑥3𝑥2𝑥1

RNN

ℎ4ℎ3ℎ2ℎ1

Page 21: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Listen

𝑥4𝑥3𝑥2𝑥1

1-D CNN

Input:𝑥1, 𝑥2, ⋯ , 𝑥𝑇

output:ℎ1, ℎ2, ⋯ , ℎ𝑇

acoustic features

high-level representations

Page 22: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Listen

𝑥4𝑥3𝑥2𝑥1

……

……

……

……

𝑏1 𝑏2 𝑏3 𝑏4

• Filters in higher layer can consider longer sequence

• CNN+RNN is common

Input:𝑥1, 𝑥2, ⋯ , 𝑥𝑇

output:ℎ1, ℎ2, ⋯ , ℎ𝑇

acoustic features

high-level representations

1-D CNN

Page 23: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Listen

Input:𝑥1, 𝑥2, ⋯ , 𝑥𝑇

output:ℎ1, ℎ2, ⋯ , ℎ𝑇

acoustic features

high-level representations

Self-attention Layers

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

Please refer to ML video recording:https://www.youtube.com/watch?v=ugWDIIOHtPA

[Zeyer, et al., ASRU’19]

[Karita, et al., ASRU’19]

Page 24: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Listen – Down Sampling

𝑥4𝑥3𝑥2𝑥1

Pyramid RNN

ℎ2ℎ1

Pooling over time

𝑥4𝑥3𝑥2𝑥1

ℎ2ℎ1

[Chan, et al., ICASSP’16]

[Bahdanau. et al., ICASSP’16]

Page 25: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Listen – Down Sampling

Time-delay DNN (TDNN)

Dilated CNN has the same concept

……

……

……

……

Truncated Self-attention

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

Attention in a range

[Yeh, et al., arXiv’19]

[Peddinti, et al., INTERSPEECH’15]

Page 26: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Attention

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑧ℎ

𝑊ℎ 𝑊𝑧

Dot-product Attention[Chan, et al., ICASSP’16]

𝛼

match

𝛼01

𝑧0

資料庫

關鍵字

Page 27: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Attention

match

𝛼01

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑧ℎ

𝑊ℎ 𝑊𝑧

Additive Attention[Chorowski. et al., NIPS’15]

𝛼

+

𝑡𝑎𝑛ℎ

𝑊

𝑧0

Page 28: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Attention

Encoder

ℎ4ℎ3ℎ2ℎ1

𝛼01 𝛼0

2 𝛼03 𝛼0

4

𝑐0

𝑐0 = ො𝛼0𝑖ℎ𝑖

0.5 0.5 0.0 0.0

= 0.5ℎ1 + 0.5ℎ2

softmax

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

𝑧0

As RNN input𝑐0

Page 29: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Spell

Encoder

ℎ4ℎ3ℎ2ℎ1

𝛼01 𝛼0

2 𝛼03 𝛼0

4

𝑐0

𝑐0 = ො𝛼0𝑡ℎ𝑡

0.5 0.5 0.0 0.0

= 0.5ℎ1 + 0.5ℎ2

softmax

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

𝑧0

As RNN input𝑐0

𝑧1

max

Size VDistribution over all tokens

c

a: 0.0b: 0.1c: 0.6d: 0.1……

cat

hidden state

Page 30: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Spell

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑧0

As RNN input𝑐0

𝑧1

Size VDistribution over all tokens

cat

match

𝛼11

max

c

hidden state

Page 31: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Spell

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑧0

𝑐0

𝑧1

Size V

cat

𝑐1

𝛼11 𝛼1

2 𝛼13 𝛼1

4

0.0 0.3 0.6 0.1

softmax

ො𝛼11 ො𝛼1

2 ො𝛼13 ො𝛼1

4

𝑐1

𝑐1 = ො𝛼1𝑡ℎ𝑡

= 0.3ℎ2 + 0.6ℎ3 + 0.1ℎ4

𝑧1 𝑧2

max

c a

Page 32: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Spell

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑧0

𝑐0

𝑧1

Size V

cat

𝑐1

𝑧1 𝑧2

max

c a

𝑐2

𝑧3

𝑐3

𝑧4

attention

t <EOS>

Beam Search is usually used

Page 33: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Beam Search

A B

A AB B

A B A B AB

AB

0.4

0.9

0.9

0.6

0.4

0.4

0.6

0.6

The green path is the best one.Not possible to check all the paths …

Assume there are only two tokens (V=2).

The red path is Greedy Decoding.

簽下去沒簽下去

Page 34: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Beam Search

A B

A AB B

A B A B AB

AB

Keep B best pathes at each step

B (Beam size) = 2

Page 35: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Training

Encoder

ℎ4ℎ3ℎ2ℎ1

𝛼01 𝛼0

2 𝛼03 𝛼0

4

𝑐0

𝑐0 = ො𝛼0𝑖ℎ𝑖

0.5 0.5 0.0 0.0

= 0.5ℎ1 + 0.5ℎ2

softmax

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

𝑧0

As RNN input𝑐0

𝑧1

cCross

entropy

Size VDistribution over all tokens

One-hot vector

cat

𝑝 𝑐

Page 36: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Training

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑧0

𝑐0

𝑧1

Size V

cat

𝑐1

𝛼11 𝛼1

2 𝛼13 𝛼1

4

0.0 0.3 0.6 0.1

softmax

ො𝛼11 ො𝛼1

2 ො𝛼13 ො𝛼1

4

𝑐1

𝑧0 𝑧1

a

Cross entropy

One-hot vector

c

𝑝 𝑎

max

?

Use ground truth

Teacher Forcing

Page 37: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Why Teacher Forcing?

𝑧0

𝑐0

𝑧1

𝑐1

𝑧0 𝑧1

ax

Use previous output

incorrect

𝑧0

𝑐0

𝑧1

𝑐1

𝑧0 𝑧1

accorrect

training

你要輸出 c 你要先講,好嗎?

那我也要睡啦

那我先講…

之前的訓練都白費了……

Page 38: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Why Teacher Forcing?

𝑧0

𝑐0

𝑧1

𝑐1

𝑧0 𝑧1

a?

Use ground truth

𝑧0

𝑐0

𝑧1

𝑐1

𝑧0 𝑧1

a

c

c

training

你要輸出 c 你要先講,好嗎?

ccorrect

那我也要睡啦

那我先講…

correct

Page 39: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Back to Attention

𝑧𝑡 𝑧𝑡+1

𝑐𝑡

Attention

𝑧𝑡

Attention

𝑐𝑡

𝑧𝑡+1

[Bahdanau. et al., ICLR’15] [Luong, et al., EMNLP’15]

𝑐𝑡+1…

Page 40: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

𝑧𝑡

Attention

𝑐𝑡

𝑧𝑡+1

𝑐𝑡

[Chan, et al., ICASSP’16]

+

我全都要!

Page 41: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Back to Attention

ℎ4ℎ3ℎ2ℎ1 ℎ4ℎ3ℎ2ℎ1 ℎ4ℎ3ℎ2ℎ1

ℎ4ℎ3ℎ2ℎ1 ℎ4ℎ3ℎ2ℎ1 ℎ4ℎ3ℎ2ℎ1

generate 1st token generate 2nd token generate 3rd token

generate 1st token generate 2nd token generate 3rd token

Page 42: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Location-aware attention

ℎ4ℎ3ℎ2ℎ1

generate the 1st token

𝛼01 𝛼0

2 𝛼03 𝛼0

4

softmax

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

ℎ4ℎ3ℎ2ℎ1

match

𝛼12

𝑧1

generate the 2nd token

process history

原始論文難以理解,感謝劉浩然同學指點

[Chorowski. et al., NIPS’15]

Page 43: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

LAS – Does it work?

[Chorowski. Et al., NIPS’15]

[Lu, et al., INTERSPEECH’15]

300 hours

TIMIT

10.4% on SWB …[Soltau, et al., ICASSP’14]

Page 44: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

LAS – Yes, it works!

[Chiu, et al., ICASSP, 2018]

[Chan, et al., ICASSP’16] 2000 hours

12500 hours

Page 45: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

[Chan, et al., ICASSP’16]

Location-aware attention is not used here

[Chan, et al.,ICASSP’16]

Page 46: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Hokkien (閩南語、台語)

(台語語音)

台語語音辨識

“母湯”

感謝李仲翊同學提供實驗結果

(台語語音)

台語語音辨識+中文

翻譯

“不行”

看不懂…

訓練資料: YouTube 上的鄉土劇

(台語語音、中文字幕),約 1500 小時

然後就直接用 LAS 訓練下去

Page 47: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Hokkien (閩南語、台語)

• 有背景音樂、音效?

• 語音和字幕沒有對齊?

• 台羅拼音?

只有用深度學習“硬train一發”

不管…

不管…

不會 QQ …

Page 48: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Results Accuracy = 62.1%

你的身體撐不住

沒事你為什麼要請假

要生了嗎

正解:不會膩嗎

我有幫廠長拜託

正解: 我拜託廠長了

Page 49: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Limitation of LAS

• LAS outputs the first token after listening the whole input.

• Users expect on-line speech recognition.

今 天 的 天 氣 非 常 好

LAS is not the final solution of ASR!

Page 50: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Models to be introduced

• Listen, Attend, and Spell (LAS)

• Connectionist Temporal Classification (CTC)

• RNN Transducer (RNN-T)

• Neural Transducer

• Monotonic Chunkwise Attention (MoChA)[Chiu, et al., ICLR’18]

[Jaitly, et al., NIPS’16]

[Chorowski. et al., NIPS’15]

[Graves, et al., ICML’06]

[Graves, ICML workshop’12]

Page 51: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

CTC

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

Classifier

For on-line streaming speech recognition, use uni-directional RNN

ℎ𝑖W= Softmax( )

vocabulary size V

tokendistribution

Page 52: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

CTC

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

tokendistribution

For on-line streaming speech recognition, use uni-directional RNN

ℎ𝑖W= Softmax( )

size V + 1

𝜙

Classifier

Page 53: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

CTC

• Input T acoustic features, output T tokens (ignoring down sampling)

• Output tokens including 𝜙, merging duplicate tokens, removing 𝜙

𝜙 𝜙 d d 𝜙 e 𝜙 e 𝜙 p p d e e p

𝜙 𝜙 d d 𝜙 e e e 𝜙 p p d e p

好 好 棒 棒 棒 棒 棒 好 棒

好 𝜙 棒 𝜙 棒 𝜙 𝜙 好 棒 棒

Page 54: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

CTC

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑥4𝑥3𝑥2𝑥1

tokendistribution

𝑥4𝑥3𝑥2𝑥1 , 好棒

paired training data:

cross-entropy

11

1

1

much less than T, no 𝜙

𝜙

Classifier

𝜙

Page 55: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

CTC – Training

𝑥4𝑥3𝑥2𝑥1 , 好棒

paired training data:

𝑥4𝑥3𝑥2𝑥1 ,𝜙好棒𝜙

𝑥4𝑥3𝑥2𝑥1 , 好𝜙𝜙棒

𝑥4𝑥3𝑥2𝑥1 , 好棒𝜙𝜙

𝑥4𝑥3𝑥2𝑥1 , 好𝜙棒𝜙

𝑥4𝑥3𝑥2𝑥1 ,𝜙好𝜙棒

𝑥4𝑥3𝑥2𝑥1 ,𝜙𝜙好棒

𝑥4𝑥3𝑥2𝑥1 , 好棒𝜙棒

𝑥4𝑥3𝑥2𝑥1 ,好好棒𝜙

𝑥4𝑥3𝑥2𝑥1 ,𝜙好棒棒

𝑥4𝑥3𝑥2𝑥1 ,好棒棒棒

All of them are used in training! (How?!)

alignment

Page 56: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Does CTC work?

[Graves, et al., ICML’14]

V=7K

terry?

[Sak, et al., INTERSPEECH’15]One can increase V to obtain better performance

dietary

diet?

nutritionist

Page 57: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Does CTC work?

[Bahdanau. et al., ICASSP’16]

80 hours

Page 58: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Issue

ℎ4ℎ3ℎ2ℎ1

“Decoder”:• Only attend on one

vector• Each output is decided

independently

c 𝜙 cAssume the first three frames belong to “c”

我不知道前面發生甚麼事?

後面不可以再輸出 c 了

Encoder

Page 59: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Models to be introduced

• Listen, Attend, and Spell (LAS)

• Connectionist Temporal Classification (CTC)

• RNN Transducer (RNN-T)

• Neural Transducer

• Monotonic Chunkwise Attention (MoChA)[Chiu, et al., ICLR’18]

[Jaitly, et al., NIPS’16]

[Chorowski. et al., NIPS’15]

[Graves, et al., ICML’06]

[Graves, ICML workshop’12]

Page 60: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

RNA

ℎ4ℎ3ℎ2ℎ1

c 𝜙

copy

a 𝜙

…CTC Decoder:take one vector as input, output one token

Can one vector map to multiple tokens?

for example, “th”

[Sak, et al., INTERSPEECH’17]

Recurrent Neural Aligner

RNA adds dependency

Page 61: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

RNN-T

t

ℎ𝑡

copy

h

𝜙t h

……

ℎ𝑡

copy

ℎ𝑡+1

Give me the next frame

Page 62: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

t

ℎ𝑡

copy

h 𝜙

ℎ𝑡+1

e 𝜙

ℎ𝑡+3

_ 𝜙

ℎ𝑡+2

𝜙

next frame

next frame

next frame

There are T “𝜙” in the output.

next frame

Page 63: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

t

ℎ𝑡

copy

h 𝜙

ℎ𝑡+1

e 𝜙

ℎ𝑡+3

_ 𝜙

ℎ𝑡+2

𝜙

There are T “𝜙” in the output.

𝑥4𝑥3𝑥2𝑥1 , 好棒

𝜙1 好 𝜙2 𝜙3 𝜙4 𝜙5 棒 𝜙6

𝜙1 𝜙2 𝜙3 𝜙4 𝜙5 好棒 𝜙6

All of them are used in training! (How?!)

Page 64: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

t

ℎ𝑡

copy

h 𝜙

ℎ𝑡+1

e 𝜙

ℎ𝑡+3

_ 𝜙

ℎ𝑡+2

𝜙

ignore ignore ignore ignore

Page 65: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

t

ℎ𝑡

copy

h 𝜙

ℎ𝑡+1

e 𝜙

ℎ𝑡+3

_ 𝜙

ℎ𝑡+2

𝜙

Language Model: ignore speech, only consider tokens

Why?• Language Model can train from text

(easy to collect), no 𝜙 in text• It is critical for training algorithm.

Page 66: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Models to be introduced

• Listen, Attend, and Spell (LAS)

• Connectionist Temporal Classification (CTC)

• RNN Transducer (RNN-T)

• Neural Transducer

• Monotonic Chunkwise Attention (MoChA)[Chiu, et al., ICLR’18]

[Jaitly, et al., NIPS’16]

[Chorowski. et al., NIPS’15]

[Graves, et al., ICML’06]

[Graves, ICML workshop’12]

Page 67: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Neural Transducer

CTC, RNA, RNN-T

c

ℎ𝑡 ℎ𝑡 ℎ𝑡+1 ℎ𝑡+𝑤……

𝜙

copy copy

… …

c a

𝜙

attention

Neural Transducer

Page 68: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Neural Transducer

c a 𝜙

attention

ℎ4ℎ3ℎ2ℎ1 ℎ6ℎ5 ℎ7 ℎ8

attention

t 𝜙

window 1 window 2

next chunk

next chunk

Page 69: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Neural Transducer

[Jaitly, et al., NIPS’16]

Page 70: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Models to be introduced

• Listen, Attend, and Spell (LAS)

• Connectionist Temporal Classification (CTC)

• RNN Transducer (RNN-T)

• Neural Transducer

• Monotonic Chunkwise Attention (MoChA)[Chiu, et al., ICLR’18]

[Jaitly, et al., NIPS’16]

[Chorowski. et al., NIPS’15]

[Graves, et al., ICML’06]

[Graves, ICML workshop’12]

Page 71: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

MoChA: Monotonic Chunkwise Attention

similar to attention

ℎ4ℎ3ℎ2ℎ1 ℎ6ℎ5 ℎ7

𝑧0 here?

yes/noyes: put window at hereno: shift window to the right

dynamically shift the window

Page 72: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

MoChA

ℎ4ℎ3ℎ2ℎ1

𝑧0 here?

no

here?

no

here?

no

here?

yes

Page 73: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

MoChA

ℎ4ℎ3ℎ2ℎ1 ℎ6ℎ5

𝑧0

c

attention

𝑧1

output one token for each window, does not output 𝜙

here?

no

here?

yes

Page 74: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

MoChA

ℎ4ℎ3ℎ2ℎ1 ℎ6ℎ5

𝑧0

c

𝑧1

a

𝑧2

Please refer to the original paper for model training [Chiu, et al., ICLR’18]

output one token for each window, does not output 𝜙

Page 75: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

LAS: 就是 seq2seq

CTC: decoder 是 linear classifier 的 seq2seq

RNA: 輸入一個東西就要輸出一個東西的 seq2seq

RNN-T: 輸入一個東西可以輸出多個東西的 seq2seq

Neural Transducer: 每次輸入一個 window 的 RNN-T

MoCha: window 移動伸縮自如的 Neural Transducer

Summary

Page 76: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Reference

• [Li, et al., ICASSP’19] Bo Li, Yu Zhang, Tara Sainath, Yonghui Wu, William Chan, Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes, ICASSP 2019

• [Bahdanau. et al., ICLR’15] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR, 2015

• [Bahdanau. et al., ICASSP’16] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, Yoshua Bengio, End-to-End Attention-based Large Vocabulary Speech Recognition, ICASSP, 2016

• [Chan, et al., ICASSP’16] William Chan, Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, Listen, Attend and Spell, ICASSP, 2016

• [Chiu, et al., ICLR’18] Chung-Cheng Chiu, Colin Raffel, Monotonic ChunkwiseAttention, ICLR, 2018

• [Chiu, et al., ICASSP’18] Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani, State-of-the-art Speech Recognition With Sequence-to-Sequence Models, ICASSP, 2018

Page 77: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Reference

• [Chorowski. et al., NIPS’15] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio, Attention-Based Models for Speech Recognition, NIPS, 15

• [Huang, et al., arXiv’19] Hongzhao Huang, Fuchun Peng, An Empirical Study of Efficient ASR Rescoring with Transformers, arXiv, 2019

• [Graves, et al., ICML’06] Alex Graves, Santiago Fernández, Faustino Gomez, Jurgen Schmidhuber, Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks". In Proceedings of the International Conference on Machine Learning, ICML, 2006

• [Graves, ICML workshop’12] Alex Graves, Sequence Transduction with Recurrent Neural Networks, ICML workshop, 2012

• [Graves, et al., ICML’14] Alex Graves, Navdeep Jaitly, Towards end-to-end speech recognition with recurrent neural networks, ICML, 2014

• [Lu, et al., INTERSPEECH’15] Liang Lu, Xingxing Zhang, Kyunghyun Cho, Steve Renals, A Study of the Recurrent Neural Network Encoder-Decoder for Large Vocabulary Speech Recognition, INTERSPEECH, 2015

• [Luong, et al., EMNLP’15] Minh-Thang Luong, Hieu Pham, Christopher D. Manning, Effective Approaches to Attention-based Neural Machine Translation, EMNLP, 2015

Page 78: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Reference

• [Karita, et al., ASRU’19] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, TakaakiHori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, TakenoriYoshimura, Wangyou Zhang, A Comparative Study on Transformer vs RNN in Speech Applications, ASRU, 2019

• [Soltau, et al., ICASSP’14] Hagen Soltau, George Saon, Tara N. Sainath, Joint training of convolutional and non-convolutional neural networks, ICASSP, 2014

• [Sak, et al., INTERSPEECH’15] Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays, Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition, INTERSPEECH, 2015

• [Sak, et al., INTERSPEECH’17] Haşim Sak, Matt Shannon, Kanishka Rao, Françoise Beaufays, Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for Sequence to Sequence Mapping, INTERSPEECH, 2017

• [Jaitly, et al., NIPS’16] Navdeep Jaitly, Quoc V. Le, Oriol Vinyals, Ilya Sutskever, David Sussillo, Samy Bengio, An Online Sequence-to-Sequence Model Using Partial Conditioning, NIPS, 2016

Page 79: Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR (v12).pdf · 2020-03-10 · Speech Recognition Speech Recognition speech text Speech: a sequence of vector (length

Reference

• [Rao, et al., ASRU’17] Kanishka Rao, Haşim Sak, Rohit Prabhavalkar, Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer, ASRU. 2017

• [Peddinti, et al., INTERSPEECH’15] Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts, INTERSPEECH, 2015

• [Yeh, et al., arXiv’19] Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer, Transformer-Transducer: End-to-End Speech Recognition with Self-Attention, arXiv, 2019

• [Zeyer, et al., ASRU’19] Albert Zeyer, Parnia Bahar, Kazuki Irie, Ralf Schlüter, Hermann Ney, A Comparison of Transformer and LSTM Encoder Decoder Models for ASR, ASRU, 2019