speech recognition and removal of disfluencies
-
Upload
ankit-sharma -
Category
Engineering
-
view
91 -
download
6
description
Transcript of speech recognition and removal of disfluencies
Automatic Detection of Sentence Boundaries and Disfluencies in speech recognition
techniques.
• Ankit Sharma -1MJ10EC013
Speech Processing
Speech is one of the most intriguing signals that humans work with every day.
• Purpose of speech processing:– To understand speech as a means of communication;– To represent speech for transmission and reproduction;– To analyze speech for automatic recognition and extraction of
information– To discover some physiological characteristics of the talker.
Automatic speech recognitionWhat is the task?What are the main difficulties?How is it approached?How good is it?How much better could it be?
3/34
text(concept)
speech
air flow
Sound sourcevoiced: pulseunvoiced: noise
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r w
ave
by s
peec
h in
form
atio
n
fund
amen
tal f
req.
voic
ed/u
nvoi
ced
freq
. tr
ans
. ch
ar.
4
Speech production process in humans
How might computers do it?
DigitizationAcoustic analysis of the
speech signalLinguistic interpretation
8
Acoustic waveform Acoustic signal
Speech recognition
Microsoft Speech Recognition – Windows 7
6/34
DigitizationAnalog to digital conversion Sampling and quantizingUse filters to measure energy levels for
various points on the frequency spectrumKnowing the relative importance of
different frequency bands (for speech) makes this process more efficient
E.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth (log scale)
7/34
Separating speech from background noiseNoise cancelling microphones
Two mics, one facing speaker, the other facing away
Ambient noise is roughly same for both micsKnowing which bits of the signal relate to
speechSpectrograph analysis
8/34
Variability in individuals’ speechVariation among speakers due to
Vocal range (f0, and pitch range – see later)Voice quality (growl, whisper, physiological
elements such as nasality, adenoidality, etc)ACCENT !!! (especially vowel systems, but also
consonants, allophones, etc.)Variation within speakers due to
Health, emotional stateAmbient conditions
Speech style: formal read vs spontaneous9/34
10/34
Detection of Sentence Boundaries and Disfluencies
11/34
Divide speech into frames
Speech is a non-stationary signal
… but can be assumed to be quasi-stationary
Divide speech into short-time frames (e.g., 5ms shift, 25ms length)
12/34
Approaches to ASR
Template based
Neural Network
based
Statistics based
Statistics-based approachCollect a large corpus of transcribed speech
recordingsTrain the computer to learn the
correspondences (“machine learning”)At run time, apply statistical processes to
search through the space of all possible solutions, and pick the statistically most likely one
13/34
14/34
What is a corpus?A corpus can be defined as a collection of texts assumed to be representative of a given language put together so that it can be used for linguistic analysis. Usually the assumption is that the language stored in a corpus is naturally-occurring, that is gathered according to explicit design criteria, with a specific purpose in mind, and with a claim to represent natural chunks of language selected according to specific typology
“nowadays the term 'corpus' nearly always implies the additional feature of 'machine-readable'”.
Statistics based approachAcoustic and Lexical Models
Analyse training data in terms of relevant features
Learn from large amount of data different possibilities different phone sequences for a given word different combinations of elements of the speech
signal for a given phone/phonemeCombine these into a Hidden Markov Model
expressing the probabilities
15/34
Excitationgeneration
Synthesis Filter
TEXT
Text analysis
SYNTHESIZEDSPEECH
Training HMMs
Parameter generationfrom HMMs
Context-dependent HMMs& state duration models
Labels Excitationparameters
Excitation
Spectralparameters
Labels
Training part
Synthesis part
ExcitationParameterextraction
SPEECHDATABASE
SpectralParameterExtraction
Spectralparameters
Excitationparameters
Speech signal
HMM-based speech synthesis system (HTS)
16
HMMs for some words
17/34
Identify individual phonemesIdentify wordsIdentify sentence structure and/or meaning
18/34
Performance errorsPerformance “errors” include
Non-speech soundsHesitationsFalse starts, repetitions
Filtering implies handling at syntactic level or above
Some disfluencies are deliberate and have pragmatic effect – this is not something we can handle in the near future
19/34
20/34
Disfluencies
Disfluencies: standard terminology (Level it)
Reparandum : thing repairedInterruption point (IP): where speaker breaks
offEditing phase (edit terms): uh, I mean, you
knowRepair: fluent continuation
Prosodic characteristics of disfluenciesFragments are good cues to disfluenciesProsody:
Pause duration is shorter in disfluent silence than fluent silence
F0 increases from end of reparandum to beginning of repair, but only minor change
Repair interval offsets have minor prosodic phrase boundary, even in middle of NP: Show me all n- | round-trip flights | from Pittsburgh | to Atlanta
Syntactic Characteristics of DisfluenciesThe repair often has same structure as reparandumBoth are Noun Phrases (NPs) in this example:
So if could automatically find IP, could find and correct reparandum!
Disfluencies in language modelingShould we “clean up” disfluencies before
training LM (i.e. skip over disfluencies?)Filled pauses
Does United offer any [uh] one-way fares?Repetitions
What what are the fares?Deletions
Fly to Boston from BostonFragments (we’ll come back to these)
I want fl- flights to Boston.
Detection of disfluenciesDecision tree at wi-wj boundary
pause duration Word fragments Filled pause Energy peak within wi Amplitude difference between wi and wj F0 of wi F0 differences Whether wi accented
Results: 78% recall/89.2% precision
Recent work: EARS Metadata Evaluation (MDE)Sentence-like Unit (SU) detection:
find end points of SU Detect subtype (question, statement, backchannel)
Edit word detection: Find all words in reparandum (words that will be removed)
Filler word detection Filled pauses (uh, um) Discourse markers (you know, like, so) Editing terms (I mean)
Interruption point detection
Liu et al 2003
Kinds of disfluenciesRepetitions
I * I like itRevisions
We * I like itRestarts (false starts)
It’s also * I like it
MDE transcriptionConventions:
./ for statement SU boundaries, <> for fillers, [] for edit words, * for IP (interruption point) inside edits
And <uh> <you know> wash your clothes wherever you are ./ and [ you ] * you really get used to the outdoors ./
Recent works to improve qualityVocoding
– MELP-style / CELP-style excitation– LF model– Sinusoidal models
Acoustic model– Segment models, trajectory models– Model combination (product of experts)– Minimum generation error training– Bayesian modeling
Oversmoothing– Pre & postfiltering– Improvements of GV– Hybrid approaches
& more…29
Other challenging topicsNon-professional speakers
• AVM + adaptation (CSTR)
Too little speech data• VTLN-based rapid speaker adaptation (Titech, IDIAP)
Noisy recordings• Spectral subtraction & AVM + adaptation (CSTR)
No labels• Un- / Semi-supervised voice building (CSTR, NICT, CMU, Toshiba)
Insufficient knowledge of the language or accent• Letter (grapheme)-based synthesis (CSTR)• No prosodic contexts (CSTR, Titech)
Wrong language• Cross-lingual speaker adaptation (MSRA, EMIME)• Speaker & language adaptive training (Toshiba)
30
THANK YOU!
31/34