Introduction to Spoken Language Systems
Transcript of Introduction to Spoken Language Systems
Weareateamofscientistsanddevelopersworkingonaudio,speechandlanguagesolutionsthatwillrevolutionizehow
customersinteractwithproductsandservices.
SpeechUserInterfaceFlowSkillsASR NLU TTSUser
SpeechWords Intents Actions Speech
Output
Component Input Output Example
Automatic SpeechRecognition(ASR)
Speech Text(1-best ortopalternatives)
“PlayTwo StepsBehindbyDef Leppard”
Natural LanguageUnderstanding(NLU)
Text IntentTypeand“Slots” Intent:PlayMusicIntentSlots:Artist =DefLeppard
Song =TwoStepsBehind
Skills– internalandexternalservices
Intent&Slots Text and/orActions Play <URL>Say“Playing TwoStepsBehindbyDefLeppard”
Text-to-Speech Text Speech “Playing TwoStepsBehindbyDefLeppard”
Howdidwegethere• 1930s:BellLabsvocoderwork,VODER• 1952:BellLabssingle-digitASR• 1950s:OVEandPAT(formantsynthesis)• 1960s:single-vowel/phonemerecognition• 1960s:ASY(articulatorysynthesis)• 1969:BellLabsde-fundsASR
Human-likeTTS• 1982:SAM,firstsoftwaresynthesisprogram• 199x:Firstdiphone synthesis• 1990s:UnitSelection• 2005:IVONATTS• 2000s:NewHMMsystems• 2010+:DNNbasedTTS
TTSevolution
Naturalness
Controllability
SPSSTTS(HMM-based)
UnitSelection
FormantTTS
Diphone TTS
HybridTTS(USguided)
SPSSTTS(DeepModeling-based)
UnitSelection(unlimitedunitsinthecloud?)
WaveNet
HybridTTS(blending)
ArticulatoryTTS
• Goal:Converttextintointelligible,accurateandnaturalspeech• Challenges
– Homographs:wordswrittenidenticallythathavedifferentpronunciation• Ilive inSeattlevs Reportinglive fromSeattle
– Textnormalization:disambiguationofabbreviations,acronyms,units• ‘m’expandedas‘minutes’or‘miles’or‘meters’oreven‘medium’
– Prosodyrequiresunderstandingofsemantics
– Foreignwords,propernames,slangetc.
TTSdevelopment
TTSoperationText
Textnormalization
Grapheme-to-phonemeconversion
Waveformgeneration
Speech
Shehas20$inherpocket.
shehastwentydollarsinherpocket
ˈʃ iˈhæ zˈtwɛ n.tiˈd ɑ .ɫ ə ɹ zˈɪ nˈhɝ ɹ ˈp ɑ .kə t
TTSBackendp|l|iy1|z| ae1|d| …
Parametersprediction(HMM,
DNN)
SpeechGeneration
SpeechInventory
UnitSelection(ViterbiSearch)
SpeechConcatenation
HybridTTS
UnitSelection– Viterbisearcht-uw uw-#
# #
#-t
#-t1
#-t2
#-t3
uw-#1
uw-#2
uw-#3
#-uw1
#-uw3
#-uw2
#-uw4TargetcostConcatenation
cost
• an erroroccurredwhilesearchingforyourroute• becausesnapsweren'tallsoobedientanymore,• nowwesayapple again.andwesayapple,• generalelectricsoarstoday. informationon
generalelectric• quickbreads,zucchini,holiday, crockpot,cake,• soareyoustillkeepingtabsonyouroldteam,• thatweighsmorethanfourtons,disrupts the…
An apple a day, keeps …
BuildingaTTSsystem• Textnormalization,handlingnon-words:rules• TTSlanguagemodel:lexicons• Textanalysis,POS-tagging,prosody:NLP• Dealingwithambiguousinputs• SSMLprocessing,PLSlexicons• Voices!
AdaptiveASR• 1986/92:Spinx /SphinxII– HMM+n-grams• 1990s:commercialASRsystems(eg.Dragon)• 2000s:HMM+neuralnet• 2010s:HMM+DNN/LSTMnet
ASRdevelopment• Goal:Convertspokenaudiointotext• Challenges
– Noisyenvironment,e.g.infar-field recognitionwehaveroomreverberation,ambientnoise,backgroundspeech
– Largevocabulary,highperplexitydomains,e.g.music– Difficulttopredictspokenformsforcatalogentriesandtheir
associatedpronunciations,e.g.artistnamessuchasU2,P!nk– Acousticallyconfusablestrings(“openthepodbaydoors”/“openthepotbait
oars”)
ASRoperationSpeech
Spectrumanalysis
Phoneme sequence
De-normalization
Text
ˈʃ iˈhæ zˈtwɛ n.tiˈd ɑ .ɫ ə ɹ zˈɪ nˈhɝ ɹ ˈp ɑ .kə t
she has twenty dollars inher pocket
She has 20$inher pocket.
Where’s my Kindle?
25 17 6 24 … 4131 14 11 15 … 3832 11 13 14 … 2621 15 14 8 … 19Etc.
WEH
RZ
M
AY
whereWEHR
where’sWEHRZ
werewolfWEHRWUHLF
AardvarkAXRDVAXRK
KindleKIHNDUL
myMAY
where
where’s
is
my
Mikein dull
kinKindle
StatisticalConversationalASR• UseacousticandlinguisticMLmodels• Inputisaudio,potentiallyfrommanymicrophones• Initialsource-specificmodels/processing• Intermediateoutputisasequenceofpotentialphonemes/diphones /triphones
• Finaloutputarepotentialtexttranscriptions• Requireslotsofmemory
WakeWordEngine• Low-power,continuouslylisteningdevice• Atriggerword(‘Alexa’)• Lackofcontext– pronetonoise• Needtorunonlocaldevice• Needshighlyoptimizedcode&real-timeprocessing
BuildinganASRengine• StatisticalASRmodels– build&combine• Personalization• Constrainedvsfree-forminput(dictation)• Textde-normalization,handlingentitynames• Domain-specificrecognition• Identifyerrors,useasfeedback
Understandingthelanguage• 1950:Turingtestdefined• 1964-72:STUDENT,ELIZA,PARRY• 1976:Collosal Cave,Zork – interactivefiction• 1990s:StatisticalMLmodels• 2006:Watson,firstbottowinJeopardy• 2011:Siri 2014:Alexa,Cortana
NLUDevelopment• Goal:understandthespokenintentandassociatedentities• Challenges
– Semanticrepresentationforlanguage– Cross-domainintentrecognition
• e.g.“Playremindme”vs.“Remindmetogototheplay”– RobustnesstoASRerrors andambiguity
• “PlayRollingStone”(BobDylan)vs “PlayRollingStones”– Usercorrectionincontext,“No,therollingstones”– Needtogettopchoicecorrectsincethereisnodisplay
ApproachforNLU
NamedEntityRecognition
(NER)
NERModels
IntentClassification
(IC)
ICModels
Text EntityResolution
RankingModels Catalogs
Interpretations
FinitePatternMatching
/
PersonalAssistantcomponents• Languagemodel• Skills/Intentscatalogue• Entitycatalogs&ontology• Knowledgedatabase(s)• Externaldatasourcesintegration• Personalization
…andsomeothers• NLG– NaturalLanguageGeneration• NLP– NaturalLanguageProcessing• DataMining– buildingtheknowledgebase• Compressiontechniques• Audioprocessingandmediastreaming• Distributedsystems• …
AI.Thefinal frontier?• Deeplearningeverywhere– Wakeword– SpeechRecognition– LanguageUnderstanding– Text-to-Speech
• RequiresalotofDatatotraindeepneuralnetworks(DNNs)and otherMLmodels
+
THANKYOU!
Howtostart?Courses:https://www.coursera.org/learn/machine-learninghttps://www.coursera.org/learn/nlp
Tools:http://cmusphinx.sourceforge.net/http://festvox.org/http://kaldi-asr.org/
http://mallet.cs.umass.edu/https://deeplearning4j.org/https://www.tensorflow.org/