Transcript of Spoken Dialogue System Architecture Joshua Gordon CS4706 1.
- Slide 1
- Spoken Dialogue System Architecture Joshua Gordon CS4706 1
- Slide 2
- Outline Goals of an SDS architecture Goals of an SDS
architecture Research challenges Research challenges Practical
considerations Practical considerations An end-to-end tour of a
real world SDS An end-to-end tour of a real world SDS 2
- Slide 3
- SDS Architectures Software abstractions that tie together
orchestrate the many NLP components required for human-computer
dialogue Software abstractions that tie together orchestrate the
many NLP components required for human-computer dialogue Conduct
task-oriented, limited-domain conversations Conduct task-oriented,
limited-domain conversations Manage the many levels of information
processing (e.g., utterance interpretation, turn taking) necessary
for dialogue Manage the many levels of information processing
(e.g., utterance interpretation, turn taking) necessary for
dialogue In real-time, under uncertainty In real-time, under
uncertainty 3
- Slide 4
- Examples Information seeking, transactional Most common Most
common CMU Bus route information CMU Bus route information Columbia
Virtual Librarian Columbia Virtual Librarian Google Directory
service Google Directory service Lets Go Public 4
- Slide 5
- Examples Virtual Humans Multimodal input / output Multimodal
input / output Prosody and facial expression Prosody and facial
expression Auditory and visual clues assist turn taking Auditory
and visual clues assist turn taking Many limitations Many
limitations Scripting Scripting Constrained domain Constrained
domain http://ict.usc.edu/projects/virtual_humans 5
- Slide 6
- Examples Interactive Kiosks Multi-participant conversations!
Surprises and challenges passersby to trivia games [Bohus and
Horvitz, 2009] 6
- Slide 7
- Examples Robotic Interfaces www.cellbots.comSpeech interface to
a UAV [Eliasson, 2007] 7
- Slide 8
- Conversational skills SDS Architectures tie together: SDS
Architectures tie together: Speech recognition Speech recognition
Turn taking Turn taking Dialogue management Dialogue management
Utterance interpretation Utterance interpretation Grounding
Grounding Natural language generation Natural language generation
And increasingly include And increasingly include Multimodal input
/ output Multimodal input / output Gesture recognition Gesture
recognition 8
- Slide 9
- Research Challenges in every area Speech recognition Speech
recognition Accuracy in interactive settings, detecting emotion.
Accuracy in interactive settings, detecting emotion. Turn taking
Turn taking Fluidly handling overlap, backchannels. Fluidly
handling overlap, backchannels. Dialogue management Dialogue
management Increasingly complex domains, better generalization,
multi- party conversations. Increasingly complex domains, better
generalization, multi- party conversations. Utterance
interpretation Utterance interpretation Reducing constraints on
what the user can say, and how they can say it. Attending to
prosody, emphasis, speech rate. Reducing constraints on what the
user can say, and how they can say it. Attending to prosody,
emphasis, speech rate. 9
- Slide 10
- A tour of a real-world SDS CMU Olympus CMU Olympus Open source
collection of dialogue system components Open source collection of
dialogue system components Research platform used to investigate
dialogue management, turn taking, spoken language interpretation
Research platform used to investigate dialogue management, turn
taking, spoken language interpretation Actively developed Actively
developed Many implementations Many implementations Lets go public,
Team Talk, CheckItOut Lets go public, Team Talk, CheckItOut
www.speech.cs.cmu.edu 10
- Slide 11
- Conventional SDS Pipeline 11 Speech signals to words. Words to
domain concepts. Concepts to system intentions. Intentions to
utterances (represented as text). Text to speech.
- Slide 12
- Olympus under the hood: provider pattern 12
- Slide 13
- Speech recognition 13
- Slide 14
- The Sphinx Open Source Recognition Toolkit Pocket-sphinx
Pocket-sphinx Continuous speech, speaker independent recognition
system Continuous speech, speaker independent recognition system
Includes tools for language model compilation, pronunciation, and
acoustic model adaptation Includes tools for language model
compilation, pronunciation, and acoustic model adaptation Provides
word level confidence annotation, n-best lists Provides word level
confidence annotation, n-best lists Efficient runs on embedded
devices (including an iPhone SDK) Efficient runs on embedded
devices (including an iPhone SDK) Olympus supports parallel
decoding engines / models Olympus supports parallel decoding
engines / models Typically runs parallel acoustic models for male
and female speech Typically runs parallel acoustic models for male
and female speech 14 http://cmusphinx.sourceforge.net/
- Slide 15
- Speech recognition challenge in interactive settings 15
- Slide 16
- Spontaneous dialogue is difficult for speech recognizers Poor
in interactive settings compared to one-off applications like voice
search and dictation Poor in interactive settings compared to
one-off applications like voice search and dictation Performance
phenomena: backchannels, pause-fillers, false-starts Performance
phenomena: backchannels, pause-fillers, false-starts OOV words OOV
words Interaction with an SDS is cognitively demanding for users
Interaction with an SDS is cognitively demanding for users What can
I say and when? Will the system understand me? What can I say and
when? Will the system understand me? Uncertainty increases
disfluency, resulting in further recognition errors Uncertainty
increases disfluency, resulting in further recognition errors
16
- Slide 17
- WER (Word Error Rate) Non-interactive settings Non-interactive
settings Google Voice Search: 17% deployed (0.57% OOV over 10k
queries randomly sampled from Sept-Dec, 2008) Google Voice Search:
17% deployed (0.57% OOV over 10k queries randomly sampled from
Sept-Dec, 2008) Interactive settings: Interactive settings: Lets Go
Public: 17% in controlled conditions vs. 68% in the field Lets Go
Public: 17% in controlled conditions vs. 68% in the field
CheckItOut: Used to investigate task-oriented performance under
worst case ASR - 30% to 70% depending on experiment CheckItOut:
Used to investigate task-oriented performance under worst case ASR
- 30% to 70% depending on experiment Virtual Humans: 37% in
laboratory conditions Virtual Humans: 37% in laboratory conditions
17
- Slide 18
- Examples of (worst-case) recognizer noise S: What book would
you like? U: The Language of Sycamores ASR: THE LANGUAGE OF IS.A.
COMING WARS S: Hi Scott, welcome back! U: Not Scott, Sarah! Sarah
Lopez. ASR: SCOTT SARAH SCOUT LAW 18
- Slide 19
- Error Propagation Recognizer noise injects uncertainty into the
pipeline Recognizer noise injects uncertainty into the pipeline
Information loss occurs when moving from an acoustic signal to a
lexical representation Information loss occurs when moving from an
acoustic signal to a lexical representation Most SDSs ignore
prosody, amplitude, emphasis Most SDSs ignore prosody, amplitude,
emphasis Information provided to downstream components includes
Information provided to downstream components includes An n-best
list, or word lattice An n-best list, or word lattice Low level
features: speech rate, speech energy Low level features: speech
rate, speech energy 19
- Slide 20
- Spoken Language Understanding 20
- Slide 21
- SLU maps from words to concepts Dialog acts (the overall intent
of an utterance) Dialog acts (the overall intent of an utterance)
Domain specific concepts (like a book, or bus route) Domain
specific concepts (like a book, or bus route) Single utterances vs.
across turns Single utterances vs. across turns Challenging in
noisy settings Challenging in noisy settings Ex. Does the library
have Hitchhikers Guide to the Galaxy by Douglas Adams on audio
cassette? Ex. Does the library have Hitchhikers Guide to the Galaxy
by Douglas Adams on audio cassette? 21 Dialog ActBook Request
TitleThe Hitchhikers Guide to the Galaxy AuthorDouglas Adams
MediaAudio Cassette
- Slide 22
- Semantic grammars Domain independent concepts Domain
independent concepts [Yes], [No], [Help], [Repeat], [Number] [Yes],
[No], [Help], [Repeat], [Number] Domain specific concepts Domain
specific concepts [Book], [Author] [Book], [Author] [Quit] (*THANKS
*good bye) (*THANKS goodbye) (*THANKS +bye) ; THANKS (thanks
*VERY_MUCH) (thank you *VERY_MUCH) VERY_MUCH (very much) (a lot) ;
22
- Slide 23
- Grammars generalize poorly Useful for extracting fine-grained
concepts, but Useful for extracting fine-grained concepts, but Hand
engineered Hand engineered Time consuming to develop and tune Time
consuming to develop and tune Requires expert linguistic knowledge
to construct Requires expert linguistic knowledge to construct
Difficult to maintain over complex domains Difficult to maintain
over complex domains Lack robustness to OOV words, novel phrasing
Lack robustness to OOV words, novel phrasing Sensitive to
recognizer noise Sensitive to recognizer noise 23
- Slide 24
- SLU in Olympus: the Phoenix Parser Phoenix is a semantic
parser, indented to be robust to recognition noise Phoenix is a
semantic parser, indented to be robust to recognition noise Phoenix
parses the incoming stream of recognition hypotheses Phoenix parses
the incoming stream of recognition hypotheses Maps words in ASR
hypotheses to semantic frames Maps words in ASR hypotheses to
semantic frames Each frame has an associated CFG Grammar,
specifying word patterns that match the slot Each frame has an
associated CFG Grammar, specifying word patterns that match the
slot Multiple parses may be produced for a single utterance
Multiple parses may be produced for a single utterance The frame is
forward to the next component in the pipeline The frame is forward
to the next component in the pipeline 24
- Slide 25
- Statistical methods Supervised learning is commonly used for
single utterance interpretation Supervised learning is commonly
used for single utterance interpretation Given word sequence W,
find the semantic representation of meaning M that has maximum a
posteriori probability P(M|W) Given word sequence W, find the
semantic representation of meaning M that has maximum a posteriori
probability P(M|W) Useful for dialog act identification,
determining broad intent Useful for dialog act identification,
determining broad intent Like all supervised techniques Like all
supervised techniques Requires a training corpus Requires a
training corpus Often is domain and recognizer dependent Often is
domain and recognizer dependent 25
- Slide 26
- Belief updating 26
- Slide 27
- Cross-utterance SLU U: Get my coffee cup and put it on my desk.
The one at the back. U: Get my coffee cup and put it on my desk.
The one at the back. Difficult in noisy settings Difficult in noisy
settings Mostly new territory for SDS Mostly new territory for SDS
[Zuckerman, 2009] 27
- Slide 28
- Dialogue Management 28
- Slide 29
- The Dialogue Manager Represents the systems agenda Represents
the systems agenda Many techniques Many techniques Hierarchal
plans, state / transaction tables, Markov processes Hierarchal
plans, state / transaction tables, Markov processes System
initiative vs. mixed initiative System initiative vs. mixed
initiative System initiative has less uncertainty about the dialog
state, but is clunky System initiative has less uncertainty about
the dialog state, but is clunky Required to manage uncertainty and
error handing Required to manage uncertainty and error handing
Belief updating, domain independent error handling strategies
Belief updating, domain independent error handling strategies
29
- Slide 30
- 30 Task Specification, Agenda, and Execution [Bohus, 2007]
- Slide 31
- Domain independent error handling 31 [Bohus, 2007]
- Slide 32
- Error recovery strategies Error Handling Strategy
(misunderstanding) Example Explicit confirmationDid you say you
wanted a room starting at 10 a.m.? Implicit confirmationStarting at
10 a.m.... until what time? Error Handling Strategy (non-
understanding) Example Notify that a non-understanding
occurredSorry, I didnt catch that. Ask user to repeatCan you please
repeat that? Ask user to rephraseCan you please rephrase that?
Repeat promptWould you like a small room or a large one? 32
- Slide 33
- Statistical Approaches to Dialogue Management Learning
management policy from a corpus Learning management policy from a
corpus Dialogue can be modeled as Partially Observable Markov
Decision Processes (POMDP) Dialogue can be modeled as Partially
Observable Markov Decision Processes (POMDP) Reinforcement learning
is applied (either to existing corpora or through user simulation
studies) to learn an optimal strategy Reinforcement learning is
applied (either to existing corpora or through user simulation
studies) to learn an optimal strategy Evaluation functions
typically reference the PARADISE framework Evaluation functions
typically reference the PARADISE framework 33
- Slide 34
- Interaction management 34
- Slide 35
- The Interaction Manager Mediates between the discrete, symbolic
reasoning of the dialog manager, and the continuous real-time
nature of user interaction Mediates between the discrete, symbolic
reasoning of the dialog manager, and the continuous real-time
nature of user interaction Manages timing, turn-taking, and
barge-in Manages timing, turn-taking, and barge-in Yields the turn
to the user on interruption Yields the turn to the user on
interruption Prevents the system from speaking over the user
Prevents the system from speaking over the user Notifies the dialog
manager of Notifies the dialog manager of Interruptions and
incomplete utterances Interruptions and incomplete utterances
35
- Slide 36
- Natural Language Generation and Speech Synthesis 36
- Slide 37
- NLG and Speech Synthesis Template based, e.g., for explicit
error handling strategies Template based, e.g., for explicit error
handling strategies Did you say ? Did you say ? More interesting
cases in disambiguation dialogs More interesting cases in
disambiguation dialogs A TTS synthesizes the NLG output A TTS
synthesizes the NLG output The audio server allows interruption mid
utterance The audio server allows interruption mid utterance
Production systems incorporate Production systems incorporate
Prosody, intonation contours to indicate degree of certainty
Prosody, intonation contours to indicate degree of certainty Open
source TTS frameworks Open source TTS frameworks Festival -
http://www.cstr.ed.ac.uk/projects/festival/ Festival -
http://www.cstr.ed.ac.uk/projects/festival/http://www.cstr.ed.ac.uk/projects/festival/
Flite - http://www.speech.cs.cmu.edu/flite/ Flite -
http://www.speech.cs.cmu.edu/flite/http://www.speech.cs.cmu.edu/flite/
37
- Slide 38
- Asynchronous architectures 38 Blaylock, 2002 An asynchronous
modification of TRIPS, most work is directed toward best-case
speech recognition Lemon, 2003 Backup recognition pass enables
better discussion of OOV utterances
- Slide 39
- Problem-solving architectures FORRSooth models task- oriented
dialogue as cooperative decision making Six FORR-based services
operating in parallel Interpretation Grounding Generation Discourse
Satisfaction Interaction Each service has access to the same
knowledge in the form of descriptives 39
- Slide 40
- Thanks! Questions? 40