LING 439/539: Statistical Methods in Speech and Language Processing
description
Transcript of LING 439/539: Statistical Methods in Speech and Language Processing
1
LING 439/539: Statistical Methods in Speech and Language Processing
Ying LinDepartment of Linguistics
University of Arizona
2
Welcome! Get the syllabus Fill out and return the information sheet Email: [email protected] Office: Douglass 224 OH: MW 2:00 --3:00 by appoint (also
teaching another undergrad class) Course webpage: see syllabus Listserv coming soon.
3
438/538 and 439/539 LING 438/538 (Computational
Linguistics): Symbolic representations (mostly
syntax), e.g. FSA, CFG. Focus on logic Simple probabilistic models, e.g. N-
grams.
4
438/538 and 439/539 This class complements 438/538:
Numerical representations (speech signals): need digital signal processing
Focus on statistics/learning More sophisticated probabilistic
models, e.g. HMM, PCFG
5
Main reference texts (!) Huang, Acero and Hon (2001). Spoken Language
Processing: A guide to theory, algorithm, and system development. Prentice-Hall.
Manning and Schutze (1999). Foundations of Statistical Natural Language Processing. MIT Press.
Rabiner and Juang (1993). Fundamental of Speech Recognition. Prentice-Hall.
Duda, Hart and Stork (2001). Pattern Classification (2nd ed). JohnWiley & Sons.
Rabiner and Schafer (1978). Digital Processing of Speech Signals. Prentice-Hall.
Hastie, Tibshirani and Friedman (2001). The Elements of Statistical Learning. Springer.
6
Guideline for course reading There is no single book that covers all of
our materials. Most books are written either for EE or CS
audience only. A few chapters are selected from each
book (see the reading list). Lecture notes will summarize the reading.
Expect a rough ride for the first time -- feedback is greatly appreciated!
7
Three skills for this class 1. Linguistics: understanding source
of particular patterns. 2. Math/Statistics: underlying
principles of the model. 3. Programming: implementation This class emphasizes 2, reason:
Models are based on simple structures Programming skills require much practice
8
What is “statistical approach”? Narrow: uses statistical principle,
I.e. based on the probability calculus or other theories of inductive inference Compared to logic: dedutive inference
Broad: any work that uses a quantative measure of success Relevant to both language
engineering and linguistic science
9
What is “statistical approach”? Narrow: uses statistical principle,
I.e. based on the probability calculus or other theories of inductive inference Compared to logic: dedutive inference
Broad: any work that uses a quantative measure of success Relevant to both anguage
engineering and linguistic science
Thiscourse
10
Language engineering: speech recognition Tasks: increasing level of difficulty
WordErrorRate
11
A brief history of speech recognition 1950’s: U.S. government started
funding research on automatic recognition of speech
1960-70’s: Isolated words, digit strings Debate: rules v.s. statistics Dynamic time warping
1980-now: continuous speech, speech understanding, spoken dialog Hidden Markov model dominates
12
Why the rules didn’t work? Completely bottom-up approach:
Rules are hand-coded by experts Problem: variability in speech
Sophisticated, symbolic rules are not flexible enough to handle continuous speech
“How are you?”
Phonetic rules
Phonologicalrules
13
The rise of statistical methods in speech Initial solution: hire many linguists to
continually improve the rule system This turns out to be costly and slow, failing the
high expectation Advantage of statistical models:
Allows training on different data: flexible, scalable Computing power much cheaper than expert Drives the move to less and less constrained tasks
Bitterness: “every time I fire a linguist, the word error rate goes up” -- F. Jelinek (IBM)
14
The rise of statistics in NLP Very similar scenarios also happened in NLP:
E.g. tagging, parsing, machine translation “Old” NLP: deductive systems, hand-coded “New” NLP: broad-coverage, corpus-based,
emphasize training, evaluation Speech is now merging with NLP
Many tools originated in speech, then got copied to NLP
New task keep emerging: web as an (unstructured) data source
15
Basic architecture of today’s ASR system
Audio speech Featureextraction
X
Model parameters trained offline:M1 = “I recognize speech”M2 = “I wreck a nice beach”…
Acoustic modeling
Likelihoodp(X|M1), p(X|M2) Scoring
Languagemodel
rank
p(M1),p(M2)
ANSWER
16
Component 1: signal processing / feature extraction First 1/3 of the course (also useful
for understanding synthesis):
17
Examples of some common features
18
Component 2: Acoustic models Mixture of Gaussians: p(ot | qi) =
Dimension reduction: principle component analysis, linear discriminant analysis, parameter tying
19
Component 3:Pronunciation modeling Model for differnent pronunciations of
“you” in continuous speech
Other types of units: triphones, syllables
ou
j
a
endstart
Each unit is an HMM
20
Component 4: Language model
Provide the probability of word sequence models p(M) to combine with the acoustic model p(X|M) Common: N-gram with smoothing, backoff,
very hard and specialized business Just starting to integrate parsing Fundamental equation:
M* = argmaxM p(M|X) = argmaxM p(X|M)p(M)Viterbi, beam, A*, N-best search
21
ASR: example of a generative model Component 2+3+4 provide an instance of generative models Language M generates word sequences Word sequence generates pronunciation Pronunciation generates acoustic features
Unsupervised learning/training Maximum likelihood estimation Expectation-Maximization algorithm (different
incarnations) Main focus of this class
22
Other models to look at: Descriptive/maximum entropy models
Started in vision, then copied to speech, then NLP
Discriminative models: directly using data to construct classifiers, with weak assumptions about prob distribution
Supervised learning, focus on the perspective of classification
Input string Feature vector Output labelscount classifier
“Machine learning approach to NLP”
23
Problem solved? No, improvements are mostly due
to larger training set and speed up
Driven byMoore’s law?
24
Challenges Environment distortion (microphone, noise,
cocktail party) breaks feature extraction Acoustic condition mismatch
Between + within speaker variability breaks the pronunciation modeling and acoustic modeling
Conversational speech breaks the language model
Understanding these problems is crucial for improving the performance of ASR
25
Dreaming “2001: A Space Odyssey” (1968)
Dave: “Open the pod bay doors, HAL”
HAL9000: “I’m sorry Dave. I’m afraid I can’t do that.”
26
The reality,before the problem is solved Speech is used as a user interface
only when people can’t use hand Driving a car (use speech to drive?) Device too small (cellphone) Customer service (who will tolerate
touch tone?) Dictation (how many people actually
use it?)
27
For next time: We will start with signal processing
Uses engineering math, including power series (including convergence), trigonometric functions, integration and representation of complex numbers.
If you forgot or do not know these materials, please look for references and study it before class.