"Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс
-
Upload
yandex -
Category
Technology
-
view
1.535 -
download
3
description
Transcript of "Automatic speech recognition for mobile applications in Yandex" — Fran Campillo, Яндекс
1
2
Automatic speech recognition for mobile applications in Yandex
Automatic speech recognition for mobile applications in YandexFran CampilloFran Campillo
3
OutlineOutline● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.
● Motivation.● Road map.● Automatic speech recognition.● Data collection.● Experiments.● Results.
4
MotivationMotivation
5
MotivationMotivation
6
Road mapRoad map
7
Road mapRoad map
● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.
● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.
● Sep-2011: study of open source tools and data collection.– HTK, Sphinx, Rasr, Kaldi,...– Service provided by 3rd party.
● Jan-2012: development of in-house technology.● Jan-2013: launching of own services.
8
Automatic speech recognitionAutomatic speech recognition
9
ASR: complexityASR: complexity
Style Planned Spontaneous
Audio quality CD Telephone
Vocabulary size Hundreds Hundreds of thousands
Number of speakers One Many
Recognition rate WorseWorseBetterBetter
Complexity BiggerBiggerSmallerSmaller
10
Word pronunciationsWord pronunciations
● ASR: sounds => words.● How is a word pronounced?– Line => /'laɪn/.– Linear => /'lɪnɪɘʳ/
● Need a mapping from writing to phonemes: G2P.
11
Word pronunciations: dictionaryWord pronunciations: dictionaryа aаб a tc pабад a dc b a tc tабаза a dc b a z aабакан a dc b a tc k ax nабакана a dc b a tc k a n aабакане a dc b a tc k a nj eабаканская a dc b a tc k a n s tc k ax j aабаканский a dc b a tc k a n s tc kj I jабакумова a dc b a tc k u m ax v aабанский a dc b a n s tc kj I jабганеровская a dc b dc g ax nj I r ax f s tc k ax j aабдулино a dc b dc dK& u lj i n aабельмановская a dc bj e lj m ax n ax f s tc k ax j aабзаково a dc b z a tc k o v aабзелиловский a dc b zj i lj i l ax f s tc kj I j
12
Speech parametrizationSpeech parametrizationPhone /a/ Phone /i/
13
ASR: the problemASR: the problem
● We have a sequence of observations:– O = {o
1, o
2, …, o
T}
– oi is a feature vector representing a speech frame.
● Goal: finding the likeliest sequence of words wi
for O:argmax
iP (w i /O)argmax
iP (w i /O)
14
ASR: the problem (II)ASR: the problem (II)
● We cannot compute directly P(wi/O).
● Bayes: P(wi /O)=P (O /w i)P (w i)
P (O)
argmaxiP (w i /O)=argmax
i{P (O /w i)P (w i)}
Acoustic model Language model
15
Language modelLanguage model
● Probability of sequences of words:– “We will rock you” => P
1.
– “Will will rock you” => P2.
● Trained on large corpora.● The closer to the application domain, the better.
16
Acoustic model: Hidden Markov ModelsAcoustic model: Hidden Markov Models
● HMM of first order: sequence of states that depend only on the state before, and are associated to events we can observe
● Typical layout for ASR:
Q1
Q2
Q3
a11
a12
a22
a23
a33
b1(o) b
2(o) b
3(o)
● aij: transition probabilities.
● bj(o): probability of observation o in state j.
17
Acoustic model: HMM and speechAcoustic model: HMM and speech
● Each state models a part of the phoneme:– 1st: beginning of the phoneme.– 2nd: stationary part.– 3rd: end of the phoneme.
● aij: duration of each part.
● bj(o): probability of producing a vector of features o in
state j.
18
Modeling probability of observationModeling probability of observation● Gaussian mixtures:
– cjm
= weight of mth Gaussian of state j.– μ
jm => average (vector) of mth Gaussian of state j.
– ∑jm
=> covariance matrix of mth Gaussian of state j.
● Neural networks.
b j(x)=∑m c jmN (x ,μ jm ,Σ jm)
19
Waveform, phonemes, frames, and statesWaveform, phonemes, frames, and states/o//o/
to1
o2
o3
o4
o5
o6
o7
o8
o9
o10
/o//o/
Q1
Q2
Q3
Q1 => o
1, o
2
Q2 => o
3, o
4, o
5, o
6, o
7
Q3 = > o
8, o
9, o
10μ
3m, ∑
3m, c
3m
μ2m,
∑2m,
c2m
μ1m,
∑1m,
c1m
20
Block diagram for trainingBlock diagram for training
Initialization
Baum-Welch
HMM Parameters update
Convergence
Prototype HMM
No
Trained models
Yes
Initial μ0m,
∑j0m, com
for the GMMs
Alignments of the training sentences (observations to states)
New estimations for μ
jm, ∑
jm, c
jm
Training sentences
21
DecodingDecoding
●Lexicon: words that can be recognized.●Decoder: dynamic programming, with the constraints imposed by the lexicon, the acoustic models, and the language model.
Parametrize
Lexicon Acousticmodels
Languagemodel
DecoderSpeech signal Words
22
Our decoderOur decoder● Based on Weighted Finite State transducers.
●The lexicon, the language model, and the acoustic model are composed into a single structure.–Same information, but more efficient.
Lexicon Acousticmodels
Languagemodel
HCLG
23
Composition of WFST: exampleComposition of WFST: example
Lexicon
Language model
0 1B:Bob2
ah: 3b:
4
l: likes
5ay: k: 6
s:
24
Data collectionData collection
25
Data collectionData collection
● Speech samples taken from the field.● Manual transcriptions:– Speaker features: gender, native,...– Anomalies in the pronunciation.– Noises in the recording.
26
Manual transcriptionsManual transcriptions
● 600k recordings.● Uncompressed format: 8KHz and 16KHz.● 286020 different speakers.
Percentage (%)
Native 87.7
Male 83.3
Female 8.5
Child 8.2
27
Manual transcriptionsManual transcriptions
● Percentage of records without anomalies: 7.4%
Anomalies Percentage (%)side_speech 14.4speech-in-noise 71.5Indistinguishable 3.7mouth_noise 3.6breath_noise 6.3Irregular pronunciations 5.3Hesitations 0.5Fragments 5.5Transient noise 14.0Foreign words 0.1
28
Manual transcriptions: examplesManual transcriptions: examples
● марциальные воды male, native ● *трёx#пруд#ньій* male, native, speech-in-noise● [side_speech] чкалова male, native, speech-in-noise, bad-audio тр
29
VisualQAVisualQA
30
ExperimentsExperiments
31
Grapheme-2-phonemeGrapheme-2-phoneme
● Sequitur:– Based on joined sequence models.– Accuracy => 2.09% phoneme error rate.
● Phonetisaurus:– WFST.– Accuracy => 1.04% phoneme error rate.
● Special treatment for Latin words:– G2P trained on transliterated version of Russian pronunciation (for example: whatsapp => уотсап).
32
Noise modelsNoise models
33
Experiments: acoustic model vs. language modelExperiments: acoustic model vs. language model
34
Experiments: number of GaussiansExperiments: number of Gaussians
35
ResultsResults
36
Users: NavigatorUsers: Navigator
37
● Results relative to our WER in each experiment (in red, experiments in which our system is outperformed):
Results: relative word error rateResults: relative word error rate
Maps Navigation General search
Yandex-GMM 1 1 1
3rd Party 44.6% 31.8% 37.3%
Competitor 1.9% -9.7% -23.4%
General searchYandex-DNN 1
Competitor 6.6%
38
Thanks for your attention!Thanks for your attention!
39
Fran CampilloFran CampilloSenior Software EngineerSenior Software Engineer
Yandex Speech GroupYandex Speech Group
[email protected]@yandex-team.ru
PhDPhD