SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for...
-
Upload
merryl-rogers -
Category
Documents
-
view
212 -
download
0
Transcript of SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for...
![Page 1: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/1.jpg)
SpeechTEK August 22, 2007
Better Recognition by manipulation of ASR results
Generic concepts for post computation recognizer result components.
Emmett CoinIndustrial Poet
ejTalk, Inc. www.ejTalk.com
![Page 2: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/2.jpg)
SpeechTEK August 22, 2007
Who?
Emmett Coin Industrial Poet
Rugged solutions via compact and elegant techniques
Focused on creating more powerful and richer dialog methods
ejTalk Frontiers of Human-Computer conversation
What does it take to “talk with the machine”? Can we make it meta?
![Page 3: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/3.jpg)
SpeechTEK August 22, 2007
What this talk is about
How applications typically use the recognition result
Why accuracy is not that important, BUT error rate is.
How some generic techniques can sometimes help reduce the effective recognition error rate.
![Page 4: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/4.jpg)
SpeechTEK August 22, 2007
How do most apps deal with recognition?
Specify a grammar (cfg or slm) Specify a level of “confidence” Wait for the recognizer to decide what
happens (no result, bad, good) Use the 1st nbest result when it is “good” Leave all the errors and uncertainties to
the dialog management level
![Page 5: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/5.jpg)
SpeechTEK August 22, 2007
Accuracy: confusing concept
95% accuracy is good, 97% percent is a little better … or is it? Think of roofing a house.
Do people accurately perceive the ratio of “correct” vs. “incorrect” recognition? Users hardly notice when you “get it right”.
They expect it. When you get it wrong…
![Page 6: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/6.jpg)
SpeechTEK August 22, 2007
Confidence: What is it?
A sort of “closeness” of fit Acoustic scores
How well it matches the expected sounds
Language model scores How much work it took to find the phrase
A splash of recognizer vendor voodoo How voice-like, admix of noise, etc.
All mixed together and reformed as a number between 0.0 and 1.0 (usually)
![Page 7: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/7.jpg)
SpeechTEK August 22, 2007
Confidence: How good is it?
Does it correlate with how a human would rank things?
Does it behave consistently? long vs. short utterances? Different word groups?
What happens when you rely on it?
![Page 8: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/8.jpg)
SpeechTEK August 22, 2007
Can we add more to the model?
We already use Sounds – the Acoustic Model (AM) Words – the Language Model (LM)
We can add Meaning – the Semantic Model (SM) Rethinking
![Page 9: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/9.jpg)
SpeechTEK August 22, 2007
Strategies that humans use
Rejection Don’t hear repeated wrong utterances
Also called “skip lists”
Acceptance Intentionally allowing only the likely utterances
Aka “pass lists”
Anticipation Asking a question where the answer is known
Sometimes called “hints”
![Page 10: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/10.jpg)
SpeechTEK August 22, 2007
Rejection (skip)
The people and computers should not make the same mistake twice. Keep a list of confirmed mis-recs Remove those from the next recognition’s
nbest list But, beware the dark side ...
…the Chinese finger puzzle. Remember: knowing what to reject is based
on recognition too!
![Page 11: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/11.jpg)
SpeechTEK August 22, 2007
Acceptance (pass)
It is possible to specify the relative weights in the language model (grammar). But there is a danger. It is a little like cutting the legs
on a chair to make it level. Hasty modifications will have unintended interactions.
Another way is to create a sieve This has the advantage of not changing the balance
of the model. The other parts that do not pass the sieve become a defacto garbage collector.
![Page 12: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/12.jpg)
SpeechTEK August 22, 2007
Anticipation
Explicit e.g. confirming identity, amounts, etc.
Probabilistic Dialogs are journeys Some parts of the route are routine,
predictable
![Page 13: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/13.jpg)
SpeechTEK August 22, 2007
What should we disregard?
When is a recognition event truly the human talking to the computer? The human is speaking
But not to the computer But saying the wrong thing
Some human is saying something Other noise
Car horn, mic bump, radio music, etc.
As dialogs get longer we need to politely ignore what we were not intended to respond to
![Page 14: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/14.jpg)
SpeechTEK August 22, 2007
In and Out of Grammar (oog)
The recognizer returned some text Was it really what was said? Can we improve over the “confidence”?
Look at the “scores” of the nbest Use them as a “feature space” Use example waves to discover clusters in
feature space that correlate with “in” and “out” of Vocabulary
![Page 15: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/15.jpg)
SpeechTEK August 22, 2007
Where do we put it?
Where does all this heuristic post analysis go? Out in the dialog?
How can we minimize the cognitive load on the application developer?
We need to wrap up all this extra functionality inside a new container to hide the extra complexity
![Page 16: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/16.jpg)
SpeechTEK August 22, 2007
Re-listening
If an utterance is going to be rejected then try again. (Re-listen to the same wave)
If you can infer a smaller scope then listen with a grammar that “leans” that way.
Merge the nbests via some heuristic Re-think the combined uttererance to see
if it can now be considered “good and in grammar”
![Page 17: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/17.jpg)
SpeechTEK August 22, 2007
Serial Listening
The last utterance is not “good enough” Prompt for a repeat and listen again (live
audio from the user) If it is “good” by itself then use it Otherwise, heuristically merge the nbests
based on similarities Re-think the combined uttererance to see
if it can now be considered “good and in grammar”
![Page 18: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/18.jpg)
SpeechTEK August 22, 2007
Parallel Listening
Listen on two recognizers One with the narrow “expectation” grammar The other with the wide “possible” grammar
If utterance is in both results process the “expectation” results
If not process the “possible” results
![Page 19: SpeechTEK August 22, 2007 Better Recognition by manipulation of ASR results Generic concepts for post computation recognizer result components. Emmett.](https://reader035.fdocuments.us/reader035/viewer/2022072005/56649ced5503460f949bb16b/html5/thumbnails/19.jpg)
SpeechTEK August 22, 2007
Conclusions
Error rate is the metric to watch There is more information in the
recognition result than the 1st good nbest Putting conventional recognition inside a
heuristic “box” makes sense The information needed by the “box” is a
logical extension of the listening context