Deep learning for (more than) speech recognitionSpeech recognition was performed by combining...

Post on 20-Jun-2020

18 views 1 download

Transcript of Deep learning for (more than) speech recognitionSpeech recognition was performed by combining...

Deep learning for (more than)speech recognition

IndabaX Western Cape, UCT, Apr. 2018

Herman Kamper

E&E Engineering, Stellenbosch Universityhttp://www.kamperh.com/

Success in automatic speech recognition (ASR)

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

1 / 40

Success in automatic speech recognition (ASR)

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]

1 / 40

Success in automatic speech recognition (ASR)

[Xiong et al., arXiv’16]; [Saon et al., arXiv’17]1 / 40

Talk outline

1. State-of-the-art automatic speech recognition (ASR)

2. Examples of non-ASR speech processing (the first rant)

3. Examples of local work (a second rant)

2 / 40

Talk outline

1. State-of-the-art automatic speech recognition (ASR)

2. Examples of non-ASR speech processing (the first rant)

3. Examples of local work (a second rant)

2 / 40

Talk outline

1. State-of-the-art automatic speech recognition (ASR)

2. Examples of non-ASR speech processing

(the first rant)

3. Examples of local work (a second rant)

2 / 40

Talk outline

1. State-of-the-art automatic speech recognition (ASR)

2. Examples of non-ASR speech processing (the first rant)

3. Examples of local work (a second rant)

2 / 40

Talk outline

1. State-of-the-art automatic speech recognition (ASR)

2. Examples of non-ASR speech processing (the first rant)

3. Examples of local work

(a second rant)

2 / 40

Talk outline

1. State-of-the-art automatic speech recognition (ASR)

2. Examples of non-ASR speech processing (the first rant)

3. Examples of local work (a second rant)

2 / 40

State-of-the-art speech recognition

Supervised speech recognition

i had to think of some example speech

since speech recognition is really cool

4 / 40

Feature extraction for speech processing

25 ms

10 ms

X

x1

x2

...xD

x(1)

x1

x2

...xD

x(2)

x1

x2

...xD

x(3)

x1

x2

...xD

x(4)

5 / 40

Feature extraction for speech processing

0.0 0.2 0.4 0.6 0.8 1.00

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

0.0 0.2 0.4 0.6 0.8Time (s)

0

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

6 / 40

Name these networks

Image: http://colah.github.io/posts/2015-08-Understanding-LSTMs/7 / 40

Name these networks

x(i)

y(i)

8 / 40

Name these networks

Image: http://deeplearning.net/tutorial/lenet.html9 / 40

Name these networks

p(x(1),x(2), . . . ,x(N))|[ih])

10 / 40

Hidden Markov models (HMMs)

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

P (W, U |X) [ “without” = /w ih th aw t/ ]

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model

12 / 40

Hidden Markov models (HMMs)

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

P (W, U |X) [ “without” = /w ih th aw t/ ]

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model

12 / 40

Hidden Markov models (HMMs)

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

P (W, U |X) [ “without” = /w ih th aw t/ ]

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model

12 / 40

Hidden Markov models (HMMs)

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

P (W, U |X) [ “without” = /w ih th aw t/ ]

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model

12 / 40

Hidden Markov models (HMMs)

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

P (W, U |X) [ “without” = /w ih th aw t/ ]

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model

12 / 40

Hidden Markov models (HMMs)

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

P (W, U |X) [ “without” = /w ih th aw t/ ]

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model

12 / 40

Hidden Markov models (HMMs)

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

P (W, U |X) [ “without” = /w ih th aw t/ ]

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model

12 / 40

Hidden Markov models (HMMs)

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

P (W, U |X) [ “without” = /w ih th aw t/ ]

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model

12 / 40

Hidden Markov models (HMMs)

W ∗ = arg maxW

P (W = w(1), w(2), . . . w(M)|X = x(1), x(2), . . . , x(N))

= arg maxW

P (W |X)

= arg maxW

∑U

P (W, U |X) [ “without” = /w ih th aw t/ ]

= arg maxW

∑U

p(W, U, X)p(X)

= arg maxW

∑U

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|W, U)P (U |W )P (W )

≈ arg maxW

maxU

p(X|U)P (U |W )P (W )

p(X|U): acoustic model P (U |W ): pronunciation dictionaryP (W ): language model

12 / 40

Hidden Markov models (HMMs)p(X|[ih])

x1

x2

...xD

x(1)

x1

x2

...xD

x(2)

x1

x2

...xD

x(3)

x1

x2

...xD

x(4)

X

x1

x2

...xD

x(5)

x1

x2

...xD

x(6)

Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).

13 / 40

Hidden Markov models (HMMs)p(X|[ih])

x1

x2

...xD

x(1)

x1

x2

...xD

x(2)

x1

x2

...xD

x(3)

x1

x2

...xD

x(4)

X

x1

x2

...xD

x(5)

x1

x2

...xD

x(6)

Speech recognition was performed by combining acoustic model(thousands of HMM states) with pronunciation dictionary and languagemodel in (very big) decoder network (finite state machine).

13 / 40

Back to today: End-to-end speech recognition

[Chan et al., arXiv’15]

14 / 40

Back to today: End-to-end speech recognition

[Chan et al., arXiv’15]14 / 40

End-to-end speech recognition

[Chan et al., arXiv’15]15 / 40

End-to-end speech recognition

[Chan et al., arXiv’15]16 / 40

End-to-end speech recognition

[Chan et al., arXiv’15]17 / 40

Why did we talk about HMMs?

• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?

• Idea: Use HMM to obtainframe alignments for DNN!

• Hybrid model: DNN-HMM

• Can be seen asrepresentation learningtrained jointly with classifier x(i)

y(i)s1 s2 . . . s9000

18 / 40

Why did we talk about HMMs?

• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?

• Idea: Use HMM to obtainframe alignments for DNN!

• Hybrid model: DNN-HMM

• Can be seen asrepresentation learningtrained jointly with classifier x(i)

y(i)s1 s2 . . . s9000

18 / 40

Why did we talk about HMMs?

• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?

• Idea: Use HMM to obtainframe alignments for DNN!

• Hybrid model: DNN-HMM

• Can be seen asrepresentation learningtrained jointly with classifier x(i)

y(i)s1 s2 . . . s9000

18 / 40

Why did we talk about HMMs?

• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?

• Idea: Use HMM to obtainframe alignments for DNN!

• Hybrid model: DNN-HMM

• Can be seen asrepresentation learningtrained jointly with classifier

x(i)

y(i)s1 s2 . . . s9000

18 / 40

Why did we talk about HMMs?

• Could we use a standardfeedforward deep neuralnetwork (DNN) for ASR?

• Idea: Use HMM to obtainframe alignments for DNN!

• Hybrid model: DNN-HMM

• Can be seen asrepresentation learningtrained jointly with classifier x(i)

y(i)s1 s2 . . . s9000

18 / 40

What about convolutional neural networks?

0.0 0.2 0.4 0.6 0.8 1.00

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

0.0 0.2 0.4 0.6 0.8Time (s)

0

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

19 / 40

What about convolutional neural networks?

0.0 0.2 0.4 0.6 0.8 1.00

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

0.0 0.2 0.4 0.6 0.8Time (s)

0

1000

2000

3000

4000

5000

6000

7000

8000

Fre

qu

ency

(Hz)

19 / 40

Is end-to-end the best?

• End-to-end models are easier to implement1

• But, do they give state-of-the-art performance?

• What do you think CLDNN-HMM2 stands for?

1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]

20 / 40

Is end-to-end the best?

• End-to-end models are easier to implement1

• But, do they give state-of-the-art performance?

• What do you think CLDNN-HMM2 stands for?

1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]

20 / 40

Is end-to-end the best?

• End-to-end models are easier to implement1

• But, do they give state-of-the-art performance?

• What do you think CLDNN-HMM2 stands for?

1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40

Is end-to-end the best?

• End-to-end models are easier to implement1

• But, do they give state-of-the-art performance?

• What do you think CLDNN-HMM2 stands for?

1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40

Is end-to-end the best?

• End-to-end models are easier to implement1

• But, do they give state-of-the-art performance?

• What do you think CLDNN-HMM2 stands for?

1https://github.com/espnet/espnet 2[Sainath et al., ICASSP’15]20 / 40

Summary: Speech recognition is important, but. . .

• Very important engineering endeavour:information access, illiteracy, assistance for the disabled

• But it is more: speech and language makes us human

• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models

• And studies about how we perceive the world can tell us somethingabout better engineering decisions

21 / 40

Summary: Speech recognition is important, but. . .

• Very important engineering endeavour:information access, illiteracy, assistance for the disabled

• But it is more: speech and language makes us human

• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models

• And studies about how we perceive the world can tell us somethingabout better engineering decisions

21 / 40

Summary: Speech recognition is important, but. . .

• Very important engineering endeavour:information access, illiteracy, assistance for the disabled

• But it is more: speech and language makes us human

• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models

• And studies about how we perceive the world can tell us somethingabout better engineering decisions

21 / 40

Summary: Speech recognition is important, but. . .

• Very important engineering endeavour:information access, illiteracy, assistance for the disabled

• But it is more: speech and language makes us human

• Engineering decisions can tell us something about how we perceivethe world: saw how structure helps in speech recognition models

• And studies about how we perceive the world can tell us somethingabout better engineering decisions

21 / 40

Rant 1: Do we always need/have ASR?

Examples of non-ASR speech processing

What if we do not have supervision?

• Google Voice: English, Spanish, German, . . . , Zulu (∼50 languages)

• Data: 2000 hours transcribed speech audio; ∼350M/560M words text

• Can we do this for all 7000 languages spoken in the world?

• Many of these languages are endangered and unwritten

23 / 40

Example 1: Query-by-example search

Spoken query:Spoken query:Spoken query:Spoken query:

Useful speech system, not requiring any transcribed speech

[Jansen and Van Durme, Interspeech’12]24 / 40

Example 1: Query-by-example search

Spoken query:

Spoken query:Spoken query:Spoken query:

Useful speech system, not requiring any transcribed speech

[Jansen and Van Durme, Interspeech’12]24 / 40

Example 1: Query-by-example search

Spoken query:

Spoken query:

Spoken query:Spoken query:

Useful speech system, not requiring any transcribed speech

[Jansen and Van Durme, Interspeech’12]24 / 40

Example 1: Query-by-example search

Spoken query:Spoken query:

Spoken query:

Spoken query:

Useful speech system, not requiring any transcribed speech

[Jansen and Van Durme, Interspeech’12]24 / 40

Example 1: Query-by-example search

Spoken query:Spoken query:Spoken query:

Spoken query:

Useful speech system, not requiring any transcribed speech

[Jansen and Van Durme, Interspeech’12]24 / 40

Example 2: Linguistic and cultural documentation

http://www.stevenbird.net/25 / 40

Example 2: Linguistic and cultural documentation

http://www.stevenbird.net/26 / 40

Example 3: Learning robots to understand speech

[Janssens and Renkens, 2014]; [Renkens et al., SLT’14]27 / 40

Rant 2: Taking inspiration from humans

Examples of local work

Supervised speech recognition

i had to think of some example speech

since speech recognition is really cool

Can we acquire language from audio alone?

29 / 40

Supervised speech recognition

i had to think of some example speech

since speech recognition is really cool

Can we acquire language from audio alone?

29 / 40

Full-coverage segmentation and clustering

31 / 40

Full-coverage segmentation and clustering

31 / 40

Full-coverage segmentation and clustering

31 / 40

Unsupervised segmental Bayesian model

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Wordsegm

entation

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Unsupervised segmental Bayesian model

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Wordsegm

entation

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Unsupervised segmental Bayesian model

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Wordsegm

entation

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Unsupervised segmental Bayesian model

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Wordsegm

entation

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Unsupervised segmental Bayesian model

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Wordsegm

entation

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Unsupervised segmental Bayesian model

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Wordsegm

entation

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Unsupervised segmental Bayesian model

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

Acoustic frames y1:M

fa(·)fa(·) fa(·)

Speech waveform

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

Bayesian Gaussian mixture model

fe(·)

Acoustic

modelling

Wordsegm

entation

Embeddings xi = fe(yt1:t2)

Acoustic frames y1:M

p(xi|h−)

fe(·) fe(·)

fa(·)fa(·) fa(·)

Speech waveform

32 / 40

Listen to discovered clusters

• Small-vocabulary cluster 45: Play

• Large-vocabulary English cluster 1214: Play

• Large-vocabulary Xitsonga cluster 629: Play

33 / 40

Arrival

Using images for grounding language

Consider images paired with unlabellel spoken captions:

Play

35 / 40

Using images for grounding language

Consider images paired with unlabellel spoken captions:

Play

35 / 40

Map images and speech into common space

X

VGG

max

conv

max

feedfwd

d(yvis,yspch) distance

yvis yspch

[Harwath et al., NIPS’16]

36 / 40

Map images and speech into common space

X

VGG

max

conv

max

feedfwd

d(yvis,yspch) distance

yvis yspch

[Harwath et al., NIPS’16]36 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach Play (one of top 10)behindbikeboyslargeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . .behindbikeboyslargeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . . correctbehindbikeboyslargeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wavebikeboyslargeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebikeboyslargeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the airboyslargeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboyslargeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys Play

largeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the parklargeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlargeplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge Play

playsittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of waterplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplaysittingyellowyoung

[Kamper et al., Interspeech’17]37 / 40

Visually grounded keyword spotting

Keyword Example of matched utterance Type

beach a boy in a yellow shirt is walking on a beach . . . correctbehind a surfer does a flip on a wave mistakebike a dirt biker flies through the air variantboys two children play soccer in the park semanticlarge . . . a rocky cliff overlooking a body of water semanticplay children playing in a ball pit variantsitting two people are seated at a table with drinks semanticyellow a tan dog jumping over a red and blue toy mistakeyoung a little girl on a kid swing semantic

[Kamper et al., Interspeech’17]37 / 40

Summary and conclusion

What did we chat about today?

• Supervised speech recognition: From HMMs all the way to CLDNNs

• Structure is still important in speech recognition

• Saw three examples of models that do not require ASR

• Looked at local work taking inspiration from humans

39 / 40

What’s next (specifically for us)?

• Still many many unsolved core machine learning problems inunsupervised and low-resource speech processing

• Building speech search systems for (South) African languages

• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?

• What can we learn about language acquisition in humans?

• Language acquisition in robots

• Main take-away: Look at machine learningproblems from different perspectives and angles

40 / 40

What’s next (specifically for us)?• Still many many unsolved core machine learning problems in

unsupervised and low-resource speech processing

• Building speech search systems for (South) African languages

• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?

• What can we learn about language acquisition in humans?

• Language acquisition in robots

• Main take-away: Look at machine learningproblems from different perspectives and angles

40 / 40

What’s next (specifically for us)?• Still many many unsolved core machine learning problems in

unsupervised and low-resource speech processing

• Building speech search systems for (South) African languages

• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?

• What can we learn about language acquisition in humans?

• Language acquisition in robots

• Main take-away: Look at machine learningproblems from different perspectives and angles

40 / 40

What’s next (specifically for us)?• Still many many unsolved core machine learning problems in

unsupervised and low-resource speech processing

• Building speech search systems for (South) African languages

• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?

• What can we learn about language acquisition in humans?

• Language acquisition in robots

• Main take-away: Look at machine learningproblems from different perspectives and angles

40 / 40

What’s next (specifically for us)?• Still many many unsolved core machine learning problems in

unsupervised and low-resource speech processing

• Building speech search systems for (South) African languages

• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?

• What can we learn about language acquisition in humans?

• Language acquisition in robots

• Main take-away: Look at machine learningproblems from different perspectives and angles

40 / 40

What’s next (specifically for us)?• Still many many unsolved core machine learning problems in

unsupervised and low-resource speech processing

• Building speech search systems for (South) African languages

• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?

• What can we learn about language acquisition in humans?

• Language acquisition in robots

• Main take-away: Look at machine learningproblems from different perspectives and angles

40 / 40

What’s next (specifically for us)?• Still many many unsolved core machine learning problems in

unsupervised and low-resource speech processing

• Building speech search systems for (South) African languages

• Can some of these approaches be used in other machine learningdomains? E.g. can vision tell us something about speech?

• What can we learn about language acquisition in humans?

• Language acquisition in robots

• Main take-away: Look at machine learningproblems from different perspectives and angles

40 / 40

http://www.kamperh.com/

https://github.com/kamperh

Backup slides

Acoustic word embeddings (AWe)

Acoustic word embeddings x ∈ RD

fe(Y1)

fe(Y2)

Y2

Y1

x1

x2

[Levin et al., ASRU’13]43 / 40

Word similarity Siamese CNN

Use idea of Siamese networks [Bromley et al., PatRec’93]

Y1

x1 = f(Y1)

Y2

x2 = f(Y2)

Y1

x1 = f(Y1)

Y2

x2 = f(Y2)

distancel(x1,x2)

[Kamper et al., ICASSP’15]44 / 40

Word similarity Siamese CNN

Use idea of Siamese networks [Bromley et al., PatRec’93]

Y1

x1 = f(Y1)

Y2

x2 = f(Y2)

Y1

x1 = f(Y1)

Y2

x2 = f(Y2)

distancel(x1,x2)

[Kamper et al., ICASSP’15]44 / 40

Retrieval in common (semantic) space

y ∈ RD in D-dimensional space

yvis

yspch

[Harwath et al., NIPS’16]45 / 40

Word prediction from images and speech

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a spoken bag-of-words(BoW) classifier

[Kamper et al., Interspeech’17]

46 / 40

Word prediction from images and speech

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a spoken bag-of-words(BoW) classifier

[Kamper et al., Interspeech’17]46 / 40

Word prediction from images and speech

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a spoken bag-of-words(BoW) classifier

[Kamper et al., Interspeech’17]46 / 40

Word prediction from images and speech

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a spoken bag-of-words(BoW) classifier

[Kamper et al., Interspeech’17]46 / 40

Word prediction from images and speech

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a spoken bag-of-words(BoW) classifier

[Kamper et al., Interspeech’17]46 / 40

Word prediction from images and speech

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a spoken bag-of-words(BoW) classifier

[Kamper et al., Interspeech’17]46 / 40

Word prediction from images and speech

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a spoken bag-of-words(BoW) classifier

[Kamper et al., Interspeech’17]46 / 40

Word prediction from images and speech

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a spoken bag-of-words(BoW) classifier

[Kamper et al., Interspeech’17]46 / 40

Word prediction from images and speech

VGG

hat man

shirtyvis

VGG

hat man

shirtyvis

0.85 0.8 0.9

X

VGG

hat man

shirtyvis f(X)

max

conv

max

feedfwd

X

VGG

hat man

shirtyvis f(X)Loss

max

conv

max

feedfwd

L

X

f(X)

max

conv

max

feedfwd

X

f(X)

max

conv

max

feedfwdma

nhat

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

X

f(X)

max

conv

max

feedfwdma

nhat

f(X) ∈ RW is vector ofword probabilities

I.e., a spoken bag-of-words(BoW) classifier

[Kamper et al., Interspeech’17]46 / 40