"Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

2

Automatic speech recognition for mobile applications in Yandex

Automatic speech recognition for mobile applications in YandexFran CampilloFran Campillo

3

OutlineOutline● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.

● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.

4

MotivationMotivation

5

MotivationMotivation

6

Road mapRoad map

7

Road mapRoad map

● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.

● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.

● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.

● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.

8

Automatic speech recognitionAutomatic speech recognition

9

ASR: complexityASR: complexity

Style Planned Spontaneous

Audio quality CD Telephone

Vocabulary size Hundreds Hundreds of thousands

Number of speakers One Many

Recognition rate WorseWorseBetterBetter

Complexity BiggerBiggerSmallerSmaller

10

Word pronunciationsWord pronunciations

● ASR: sounds => words.● How is a word pronounced?– Line => /'laɪn/.– Linear => /'lɪnɪɘʳ/

● Need a mapping from writing to phonemes: G2P.

11

Word pronunciations: dictionaryWord pronunciations: dictionaryа aаб a tc pабад a dc b a tc tабаза a dc b a z aабакан a dc b a tc k ax nабакана a dc b a tc k a n aабакане a dc b a tc k a nj eабаканская a dc b a tc k a n s tc k ax j aабаканский a dc b a tc k a n s tc kj I jабакумова a dc b a tc k u m ax v aабанский a dc b a n s tc kj I jабганеровская a dc b dc g ax nj I r ax f s tc k ax j aабдулино a dc b dc dK& u lj i n aабельмановская a dc bj e lj m ax n ax f s tc k ax j aабзаково a dc b z a tc k o v aабзелиловский a dc b zj i lj i l ax f s tc kj I j

12

Speech parametrizationSpeech parametrizationPhone /a/ Phone /i/

13

ASR: the problemASR: the problem

● We have a sequence of observations:– O = {o

1, o

2, …, o

T}

– oi is a feature vector representing a speech frame.

● Goal: finding the likeliest sequence of words wi

for O:argmax

iP (w i /O)argmax

iP (w i /O)

14

ASR: the problem (II)ASR: the problem (II)

● We cannot compute directly P(wi/O).

● Bayes: P(wi /O)=P (O /w i)P (w i)

P (O)

argmaxiP (w i /O)=argmax

i{P (O /w i)P (w i)}

Acoustic model Language model

15

Language modelLanguage model

● Probability of sequences of words:– “We will rock you” => P

1.

– “Will will rock you” => P2.

● Trained on large corpora.● The closer to the application domain, the better.

16

Acoustic model: Hidden Markov ModelsAcoustic model: Hidden Markov Models

● HMM of first order: sequence of states that depend only on the state before, and are associated to events we can observe

● Typical layout for ASR:

Q1

Q2

Q3

a11

a12

a22

a23

a33

b1(o) b

2(o) b

3(o)

● aij: transition probabilities.

● bj(o): probability of observation o in state j.

17

Acoustic model: HMM and speechAcoustic model: HMM and speech

● Each state models a part of the phoneme:– 1st: beginning of the phoneme.– 2nd: stationary part.– 3rd: end of the phoneme.

● aij: duration of each part.

● bj(o): probability of producing a vector of features o in

state j.

18

Modeling probability of observationModeling probability of observation● Gaussian mixtures:

– cjm

= weight of mth Gaussian of state j.– μ

jm => average (vector) of mth Gaussian of state j.

– ∑jm

=> covariance matrix of mth Gaussian of state j.

● Neural networks.

b j(x)=∑m c jmN (x ,μ jm ,Σ jm)

19

Waveform, phonemes, frames, and statesWaveform, phonemes, frames, and states/o//o/

to1

o2

o3

o4

o5

o6

o7

o8

o9

o10

/o//o/

Q1

Q2

Q3

Q1 => o

1, o

2

Q2 => o

3, o

4, o

5, o

6, o

7

Q3 = > o

8, o

9, o

10μ

3m, ∑

3m, c

3m

μ2m,

∑2m,

c2m

μ1m,

∑1m,

c1m

20

Block diagram for trainingBlock diagram for training

Initialization

Baum-Welch

HMM Parameters update

Convergence

Prototype HMM

No

Trained models

Yes

Initial μ0m,

∑j0m, com

for the GMMs

Alignments of the training sentences (observations to states)

New estimations for μ

jm, ∑

jm, c

jm

Training sentences

21

DecodingDecoding

●Lexicon: words that can be recognized.●Decoder: dynamic programming, with the constraints imposed by the lexicon, the acoustic models, and the language model.

Parametrize

Lexicon Acousticmodels

Languagemodel

DecoderSpeech signal Words

22

Our decoderOur decoder● Based on Weighted Finite State transducers.

●The lexicon, the language model, and the acoustic model are composed into a single structure.–Same information, but more efficient.

Lexicon Acousticmodels

Languagemodel

HCLG

23

Composition of WFST: exampleComposition of WFST: example

Lexicon

Language model

0 1B:Bob2

ah: 3b:

4

l: likes

5ay: k: 6

s:

24

Data collectionData collection

25

Data collectionData collection

● Speech samples taken from the field.● Manual transcriptions:– Speaker features: gender, native,...– Anomalies in the pronunciation.– Noises in the recording.

26

Manual transcriptionsManual transcriptions

● 600k recordings.● Uncompressed format: 8KHz and 16KHz.● 286020 different speakers.

Percentage (%)

Native 87.7

Male 83.3

Female 8.5

Child 8.2

27

Manual transcriptionsManual transcriptions

● Percentage of records without anomalies: 7.4%

Anomalies Percentage (%)side_speech 14.4speech-in-noise 71.5Indistinguishable 3.7mouth_noise 3.6breath_noise 6.3Irregular pronunciations 5.3Hesitations 0.5Fragments 5.5Transient noise 14.0Foreign words 0.1

28

Manual transcriptions: examplesManual transcriptions: examples

● марциальные воды male, native ● *трёx#пруд#ньій* male, native, speech-in-noise● [side_speech] чкалова male, native, speech-in-noise, bad-audio тр

29

VisualQAVisualQA

30

ExperimentsExperiments

31

Grapheme-2-phonemeGrapheme-2-phoneme

● Sequitur:– Based on joined sequence models.– Accuracy => 2.09% phoneme error rate.

● Phonetisaurus:– WFST.– Accuracy => 1.04% phoneme error rate.

● Special treatment for Latin words:– G2P trained on transliterated version of Russian pronunciation (for example: whatsapp => уотсап).

32

Noise modelsNoise models

33

Experiments: acoustic model vs. language modelExperiments: acoustic model vs. language model

34

Experiments: number of GaussiansExperiments: number of Gaussians

35

ResultsResults

36

Users: NavigatorUsers: Navigator

37

● Results relative to our WER in each experiment (in red, experiments in which our system is outperformed):

Results: relative word error rateResults: relative word error rate

Maps Navigation General search

Yandex-GMM 1 1 1

3rd Party 44.6% 31.8% 37.3%

Competitor 1.9% -9.7% -23.4%

General searchYandex-DNN 1

Competitor 6.6%

38

Thanks for your attention!Thanks for your attention!

39

Fran CampilloFran CampilloSenior Software EngineerSenior Software Engineer

Yandex Speech GroupYandex Speech Group

[email protected]@yandex-team.ru

PhDPhD

"Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс

Technology

Transcript of "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс