Dialog Management for Rapid-Prototyping of Speech-Based Training Agents Victor Hung, Avelino...
-
Upload
thomasina-robinson -
Category
Documents
-
view
213 -
download
1
Transcript of Dialog Management for Rapid-Prototyping of Speech-Based Training Agents Victor Hung, Avelino...
Dialog Management for Rapid-Prototyping of Speech-Based Training Agents
Victor Hung, Avelino Gonzalez, Ronald DeMara
University of Central Florida
Introduction
Approach
Evaluation
Results
Conclusions
Agenda
Introduction
• General Problem– Elevate the level of speech-based discourse to a new level of
naturalness in Embodied Conversation Agents (ECA) carrying an open-domain dialog
• Specific Problem– Overcome Automatic Speech Recognition (ASR) limitations– Domain-independent knowledge management
• Training Agent Design– Conversational input with robustness to ASR and adaptable
knowledge base
Approach
• Build a dialog manager that:– Handles ASR limitations– Manages domain-independent knowledge– Provides open dialog
• CONtext-driven Corpus-based Utterance Robustness (CONCUR)– Input Processor– Knowledge Manager– Discourse Model I/O
Dialog Manager
User Input
Agent Response
Input Processor
Discourse ModelKnowledge Manager
CONCUR
• Input Processor– Pre-process knowledge corpus via keyphrasing– Break down user utterance
Input Processor
Corpus
Data
Keyphrase ExtractorWordNet
NLP Toolkit
User Utterance
• Knowledge Manager– 3 data bases– Encyclopedia-entry style corpus– Context-driven
CONCUR
• CxBR Discourse Model– Goal Bookkeeper
• Goal Stack (Branting et al, 2004)• Inference Engine
– Context Topology• Agent Goals• User Goals
Detailed CONCUR Block Diagram
Evaluation
Plagued by subjectivity Gathering of both objective and subjective metrics
Qualitative and quantitative metrics: Efficiency metrics
Total elapsed time Number of user turns Number of system turns Total elapsed time per turn Word-Error Rate (WER)
Quality metrics Out-of-corpus misunderstandings General misunderstandings Errors Total number of user goals Total number of user goals fulfilled Goal completion accuracy Conversational accuracy
Survey data Naturalness Usefulness
Evaluation Instrument
Nine statements, judged on a 1-to-7 scale based on level of agreement Naturalness
If I told someone the character in this tool was real they would believe me. The character on the screen seemed smart. I felt like I was having a conversation with a real person. This did not feel like a real interaction with another person.
Usefulness I would be more productive if I had this system in my place of work. The tool provided me with the information I was looking for. I found this to be a useful way to get information. This tool made it harder to get information than talking to a person or using a
website. This does not seem like a reliable way to retrieve information from a
database.
Data Acquisition
General data set acquisition procedure:• User asked to interact with agent
• Natural, information-seeking• Voice recording
• User asked to complete survey Data analysis process:
• Voice transcriptions, ASR transcripts, internal data, and surveys analyzed
Data Set Dialog Manager Agent Style DomainSurveys/
TranscriptsCollected
1 AlexDSS LifeLike Avatar NSF I/UCRC 30/302 CONCUR LifeLike Avatar NSF I/UCRC 30/203 CONCUR Chatbot NSF I/UCRC 0/204 CONCUR Chatbot Current Events 20/20
Data Acquisition
LifeLike Avatar
Speech Recognizer
CONCUR Dialog Manager
AgentExternals
ASR String
Response String
MicUser Voice
Speaker
Monitor
Agent Voice
Agent Image
Monitor
CONCUR Chatbot
CONCUR Dialog Manager
Jabber-basedAgent
Agent Text Output
Keyboard User Text Input
ECA
Chatbot
Survey Baseline
Agent Naturalness User Rating Usefulness User Rating
Data Set 1: AlexDSS Avatar 4.02 4.47
Data Set 2: CONCUR Avatar 4.14 4.51
Amani (Gandhe et al, 2009)
3.09 3.24
Hassan(Gandhe et al, 2009)
3.55 4.00
1. Both LifeLike Avatars established user assessments that exceeded other ECA efforts
2. Both avatar-based systems in the speech-based data sets established similar scores in Naturalness and Usefulness
Question 1: What are the expectations of naturalness and usefulness for the conversation agents in this study?
Question 2: How differently did users rate the AlexDSS Avatar with the CONCUR Avatar?
Survey Baseline
3. ECA-based systems were judged similarly, both better than chatbot
Question 3: How differently did users rate the ECA systems with the chatbot system?
ASR Resilience
Data Set 1: AlexDSSAvatar
Data Set 2: CONCUR
AvatarEfficiency
MetricsWER 60.85% 58.48%
Quantitative Analysis
Out-of-Corpus Misunderstanding Rate
0.29% 6.37%
Goal Completion Accuracy 63.29% 60.48%
Question 1: Can a speech-based CONCUR Avatar’s goal completion accuracy measure up to the AlexDSS Avatar under a high WER?
1. A Speech-based CONCUR Avatar’s goal completion accuracy measures up to AlexDSS avatar with similarly high WER
ASR Resilience
Data Set 2: CONCUR
Avatar
Data Set 3: CONCUR Chatbot
Efficiency Metrics
WER 58.48% 0.00%
Quantitative Analysis
Out-of-Corpus Misunderstanding
Rate6.37% 6.77%
Goal Completion Accuracy 60.48% 68.48%
Question 2: How does improving WER affect CONCUR’s goal completion accuracy?
2. Improved WER does not increase CONCUR’s goal completion accuracy because no new user goals were identified or corrected with the better recognition
ASR Resilience
Agent AverageWER
Goal Completion Accuracy
Data Set 2: CONCUR Avatar 58.48% 60.48%
Digital Kyoto(Misu and Kawahara, 2007)
29.40% 61.40%
Question 3: Can CONCUR’s goal completion accuracy measure up to other conversation agents in lieu of high WER?
3: CONCUR’s goal completion accuracy is similar to that of the Digital Kyoto system, with twice the WER.
ASR Resilience
Data Set 1: AlexDSSAvatar
Data Set 2: CONCUR
AvatarEfficiency
MetricsWER 60.85% 58.48%
Quantitative Analysis
General Misunderstanding Rate
9.51% 14.12%
Error Rate 8.71% 21.81%
Conversational Accuracy 81.78% 64.22%
Question 4: Can a speech-based CONCUR Avatar’s conversational accuracy measure up to the AlexDSS avatar under a high WER?
4. Speech-based CONCUR’s conversational accuracy does not measure up to an AlexDSS Avatar with similarly high WER. This can be attributed to general misunderstandings and errors caused by misheard user requests or specific question answering requests not common with menu-driven discourse models
ASR Resilience
Data Set 2: CONCUR
Avatar
Data Set 3: CONCUR Chatbot
Efficiency Metrics
WER 58.48% 0.00%
Quantitative Analysis
General Misunderstanding
Rate14.12% 7.48%
Error Rate 21.81% 16.68%Goal Completion
Accuracy 60.48% 68.48%
Conversational Accuracy 64.22% 75.31%
Question 5: How does improving WER affect CONCUR’s conversational accuracy?
5. Improved WER increases CONCUR’s conversational accuracy by decreasing general misunderstandings
ASR Resilience
Agent AverageWER
Conversational Accuracy
Data Set 2: CONCUR Avatar 58.48% 64.22%
TARA(Schumaker et al, 2007)
0.00% 54.00%
Question 6: Can CONCUR’s conversational accuracy measure up to other conversation agents in lieu of high WER?
6: CONCUR’s conversational accuracy surpasses that of the TARA system, which is text-based.
Domain-Independence
Data Set 2: NSF I/UCRC Avatar
Data Set 3:NSF I/UCRC Chatbot
Data Set 4: Current Events Chatbot
Quantitative AnalysisOut-Of-Corpus
Misunderstanding Rate 6.15% 6.77% 17.45%
Goal Completion Accuracy 60.48% 68.48% 48.08%
Question 1: Can CONCUR maintain goal completion accuracy after changing to a less specific domain corpus?
1. CONCUR’s goal completion accuracy does not remain consistent after a change to a generalized domain corpus. Changing domain expertise may increase out-of-corpus requests, which decreases goal completion
Domain-Independence
Data Set 2: NSF I/UCRC Avatar
Data Set 3:NSF I/UCRC Chatbot
Data Set 4: Current Events Chatbot
Quantitative Analysis
General Misunderstanding Rate 14.49% 7.48% 0.00%
Error Rate 21.81% 16.68% 16.46%
Conversational Accuracy 64.22% 75.34% 83.54%
Question 2: Can CONCUR maintain conversational accuracy after changing to a less specific domain corpus?
2. After changing to a general domain corpus, CONCUR is capable of maintaining its conversational accuracy
Domain-Independence
Dialog System Method Turnover Time
CONCUR Corpus-based 3 DaysMarve
(Babu et al, 2006)Wizard-of-Oz 18 Days
Amani(Gandhe et al, 2009)
Question-Answer Pairs Weeks
AlexDSS Expert System WeeksSergeant Blackwell(Robinson et al, 2008)
Wizard-of-Oz 7 Months
Sergeant Star(Artstein et al, 2009)
Question-Answer Pairs 1 Year
HMIHY (Béchet et al, 2004)
Hand-modeled 2 Years
Hassan(Gandhe et al, 2009)
Question-Answer Pairs Years
3. CONCUR’s Knowledge Manager enables a shortened knowledge development turnover time as compared to other conversation agent knowledge management systems
Question 3: Can CONCUR provide a quick method of providing agent knowledge?
Conclusions
• Building Training Agents– Agent Design
• ECA preference over Chatbot format– ASR
• ASR improvements leads to better conversation-level processing• High ASR not necessarily an obstacle for ECA design
– Knowledge Management• Tailoring domain expertise for an intended audience is more effective
than a generalized corpus• Separation of domain knowledge from agent discourse helps to
maintain conversational accuracy and speed up agent development times