Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial...

18
Dialogsystem, AGI Joakim Gustafson, CTT, KTH 1 J. Gustafson, CTT, KTH http://www.speech.kth.se Joakim Gustafson Centrum för Talteknologi, KTH Tal, musik och hörsel, KTH Dialogsystem Dialogsystem Dialogsystem Dialogsystem J. Gustafson, CTT, KTH 2 Agenda Agenda Agenda Agenda Introduction to speaker Talking Machines Research Spoken Dialogue Systems Commercial Speech Applications Spoken Dialogue System Components Spoken Dialogue System Research J. Gustafson, CTT, KTH 3 Introduction Introduction Introduction Introduction J. Gustafson, CTT, KTH 4 Joakim Gustafson Joakim Gustafson Joakim Gustafson Joakim Gustafson 1987-1992 Electrical Engineering program at KTH 1992-1993 Linguistics at Stockholm University 1993-2000 PhD studies at CTT/KTH 2000-2007 Senior researcher at Telia R&D department 2007– Assistant Professor CTT/KTH J. Gustafson, CTT, KTH 5 My research field: My research field: My research field: My research field: spoken dialogue systems spoken dialogue systems spoken dialogue systems spoken dialogue systems J. Gustafson, CTT, KTH 6 Multi Multi Multi Multi -disciplinary field disciplinary field disciplinary field disciplinary field

Transcript of Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial...

Page 1: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

1

J. Gustafson, CTT, KTH http://www.speech.kth.se

Joakim Gustafson

Centrum för Talteknologi, KTHTal, musik och hörsel, KTH

DialogsystemDialogsystemDialogsystemDialogsystem

J. Gustafson, CTT, KTH 2

AgendaAgendaAgendaAgenda

• Introduction to speaker• Talking Machines• Research Spoken Dialogue Systems• Commercial Speech Applications• Spoken Dialogue System Components• Spoken Dialogue System Research

J. Gustafson, CTT, KTH 3

IntroductionIntroductionIntroductionIntroduction

J. Gustafson, CTT, KTH 4

Joakim GustafsonJoakim GustafsonJoakim GustafsonJoakim Gustafson

1987-1992 Electrical Engineering program at KTH

1992-1993 Linguistics at Stockholm University

1993-2000 PhD studies at CTT/KTH

2000-2007 Senior researcher at Telia R&D department

2007– Assistant Professor CTT/KTH

J. Gustafson, CTT, KTH 5

My research field:My research field:My research field:My research field:spoken dialogue systemsspoken dialogue systemsspoken dialogue systemsspoken dialogue systems

J. Gustafson, CTT, KTH 6

Multi Multi Multi Multi ----disciplinary fielddisciplinary fielddisciplinary fielddisciplinary field

Page 2: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

2

J. Gustafson, CTT, KTH

Research areas• Speech production

• Speech perception

• Communication aids

• Multimodal speech synthesis

• Speech recognition

• Speaker verification

• Spoken conversational speech

• Interactive dialogue systems

CTT CTT CTT CTT ---- Centrum Centrum Centrum Centrum förförförför talteknologitalteknologitalteknologitalteknologi

J. Gustafson, CTT, KTH

The KTH speech groupThe KTH speech groupThe KTH speech groupThe KTH speech group

---- Early daysEarly daysEarly daysEarly days

Gunnar Fant and OVE I 1953

J. Gustafson, CTT, KTH

Namn Nr Årskurs Period

Poäng

Talteknologi (Speech Technology)

2F1111 D3-4, E4, F4 3 4

Talteknologi, utökad kurs (Speech Technology, extended course)

2F1112 Media 3-4 3 5

Spektrala transformer för Media 2F1120 Media 3-4 2 4

Musikakustik (Music Acoustics)

2F1212 D4, E4, F4 4 4

Musical communication & music technology(Musikalisk kommunikation och musikteknologi)

2F1213 Media 3-4,E4, D4

4 5

Elektroakustik (Electroacoustics)

2F1400 E4-5, M4-5 1 4

Audioteknik (Audio technology)

2F1410 Media 3-4,E4, D4

2 5

Orkesterspelets teori(Theory of orchestral performance)

2F1601 alla 6

Orkesterspelets praktik(Orchestral performance)

2F1602 alla 6

Elektroprojekt 2U1700 E1 3-4 5

Medieteknik grundkurs (avsnitt Ljud) 2D1574 Media 2 1-2 12

J. Gustafson, CTT, KTH 10

Talking MachinesTalking MachinesTalking MachinesTalking Machines

J. Gustafson, CTT, KTH 11

Talking machines Talking machines Talking machines Talking machines as we know themas we know themas we know themas we know them

• Star Trek

• HAL 9000 (2001)

• C3PO (Star Wars)

J. Gustafson, CTT, KTH 12

The beauty of speechThe beauty of speechThe beauty of speechThe beauty of speech1. Works in hands free situations2. Works in eyes free situations3. Works when other interfaces are inconvenient:

4. Works where disabilities make other interfaces useless

5. Works with common hardware, e.g. a telephone

6. Efficient information transfer (as far as humans are concerned).

Page 3: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

3

J. Gustafson, CTT, KTH 13

Speech application DomainsSpeech application DomainsSpeech application DomainsSpeech application Domains

• Information retrieval systems– e.g. train time table information or directory

inquiries

• Ordering– ticket booking, buying music

• Command control systems– home control (“turn the radio off”) or voice

command shortcuts (“save”)

• Dictation

J. Gustafson, CTT, KTH 14

The beauty of speech (cont)The beauty of speech (cont)The beauty of speech (cont)The beauty of speech (cont)

7. Reasoning.8. Problem solving.9. Naturalness10. Easy-of-use11. Flexibility12. Error handling and hedging13.Mutual adaptation enables error handling and more efficient information transfer

14.Social, bond-building.

J. Gustafson, CTT, KTH 15

Application Domains (cont)Application Domains (cont)Application Domains (cont)Application Domains (cont)

• Games and entertainment– Games can take advantage of the more social aspects of

spoken dialogue

• Co-ordinated collaboration– The task of controlling or over-viewing complex situations

requires flexibility and efficiency.

• Expert systems– Diagnose and help systems that need to reason about facts

and goals and may benefit from natural dialogue

• Learning and training – Naturalness, flexibility, and robustness are attractive features

in training environments

J. Gustafson, CTT, KTH 16

The spoken dialogueThe spoken dialogueThe spoken dialogueThe spoken dialoguesystem vision system vision system vision system vision

An interface that allows speakers to interact with a computer using spontaneous, unconstrained speech

J. Gustafson, CTT, KTH 17

Multimodal interfacesMultimodal interfacesMultimodal interfacesMultimodal interfaces

• User and system interact by speech and other modalities.

J. Gustafson, CTT, KTH 18

Why multimodality?Why multimodality?Why multimodality?Why multimodality?

• Different modalities suited for different things

• Increased flexibility• New kinds of terminals• Facilitates error handling• Graphical output can confirm spoken input

Page 4: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

4

J. Gustafson, CTT, KTH 19

Research SpokenResearch SpokenResearch SpokenResearch SpokenDialogue SystemsDialogue SystemsDialogue SystemsDialogue Systems

J. Gustafson, CTT, KTH 20

Classic research dialogue systemsClassic research dialogue systemsClassic research dialogue systemsClassic research dialogue systems

• Historic research systems– Voyager (1989)– ATIS (1992)– SUNDIAL (1993)– TRAINS (1996)

• Application– Philips Train Information (1995)

• Large Efforts– Communicator – Verbmobil– SmartKom

J. Gustafson, CTT, KTH 21

The TRAINS Project: Natural Spoken The TRAINS Project: Natural Spoken The TRAINS Project: Natural Spoken The TRAINS Project: Natural Spoken Dialogue and Interactive PlanningDialogue and Interactive PlanningDialogue and Interactive PlanningDialogue and Interactive Planning

• Project Leader: James Allen

• TRAINS-91 was limited in scope but were important demonstration of the "doability" of spoken dialogue systems.

J. Gustafson, CTT, KTH 22

The Nordic SceneThe Nordic SceneThe Nordic SceneThe Nordic Scene

• Stockholm, Sweden– KTH (Waxholm, August, AdApt, Higgins)– Telia (Resebokning, NICE)

• Linköping, Sweden– IDA (LINLIN, Ornitology, Movie Finder)

• Göteborg, Sweden– Lingvistiken (TRINDI, GODIS, TALK)– Chalmers (State-chart xml)

• Aalborg, Odense Denmark• Helsinki, Tampere Finland• Trondheim, Norway

J. Gustafson, CTT, KTH 23

The Waxholm Project (92The Waxholm Project (92The Waxholm Project (92The Waxholm Project (92----95)95)95)95)

• Tourist information– Stockholm archipelago

– time-tables, hotels, hostels, camping and dining possibilities.

• Mixed initiative dialogue– speech recognition

– multimodal synthesis

• Graphic information– pictures, maps, charts and

time-tables.

J. Gustafson, CTT, KTH 24

Waxholm SystemWaxholm SystemWaxholm SystemWaxholm System

Page 5: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

5

J. Gustafson, CTT, KTH 25

Waxholm DatabaseWaxholm DatabaseWaxholm DatabaseWaxholm Database

• About 70 subjects (9200 words)• Phonetically transcribed• Examples from the Waxholm system• Five different speakers• EJ KR GÖ LN MK

(.wav) (.wav) (.wav) (.wav) (.wav)

J. Gustafson, CTT, KTH 26

Three years later….Three years later….Three years later….Three years later….

J. Gustafson, CTT, KTH

The Waxholm systemThe Waxholm systemThe Waxholm systemThe Waxholm system

Where can I find hotels?Information about hotels is shown in this tableWhich hotels are in Waxholm?Information about the hotels in Waxholm is

shown in this table

Where is Waxholm?Waxholm is shown on this mapI think I want to go to WaxholmI am looking for boats to Waxholm

From where do you want to go

The cityWhich day of the week do you want to go? I want to go tomorrow There are lots of boats from Stockholm to Waxholm

on a Friday, At what time do you want to go?

When do the evening boats depart?This is a table of the boats...Is it possible to eat in Waxholm?Information about the restaurants in Waxholm

is shown in this table

Thank youThank you tooJ. Gustafson, CTT, KTH 28

The August kiosk at The August kiosk at The August kiosk at The August kiosk at KulturhusetKulturhusetKulturhusetKulturhuset 98 98 98 98

• Swedish spoken dialogue system for public use • Animated agent named after August Strindberg• Speech technology components developed at KTH• Designed to be easily expanded and reconfigured • Used to collect spontaneous speech data

J. Gustafson, CTT, KTH

The August system

What is your name?I call myself Strindberg, but I

don’t really have a surnameWhen were you born?Strindberg was born in 1849What do you do for a living?I can answer questions about

Strindberg, KTH and Stockholm

How many people live in

Stockholm?

Over a million people live

in the Stockholm areaDo you like it here?People who live in glass houses

should not throw stones

Yes, that was a smart thing

to say!

I come from the department

of Speech, Music and Hearing

The Royal Institute of

Technology!

The information is shown

on the mapThank you!You are welcome!Good bye!Perhaps we will meet

soon again!Yes, it might be that we will!Strindberg was married

three times!

Strindberg was married

three times! J. Gustafson, CTT, KTH 30

The HIGGINS system (03The HIGGINS system (03The HIGGINS system (03The HIGGINS system (03----))))

• The primary domain of HIGGINS is city navigation for pedestrians.

• Secondarily, HIGGINS is intended to provide simple information about the immediate surroundings.

Page 6: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

6

J. Gustafson, CTT, KTH 31

Higgins målsättning Higgins målsättning Higgins målsättning Higgins målsättning ----demodemodemodemo

J. Gustafson, CTT, KTH 32

The Commercial Speech The Commercial Speech The Commercial Speech The Commercial Speech ApplicationsApplicationsApplicationsApplications

J. Gustafson, CTT, KTH 33

Access Convergence & SpeechAccess Convergence & SpeechAccess Convergence & SpeechAccess Convergence & Speech

J. Gustafson, CTT, KTH 34

The “Killer” ApplicationsThe “Killer” ApplicationsThe “Killer” ApplicationsThe “Killer” Applicationsof Speech technologyof Speech technologyof Speech technologyof Speech technology

• Dictation– Medical

– Legal

– Windows Vista (95-99% without training?)

• Customer care center automation– Call Routing

– Technical support

• Multimodal mobile applications– Information search

– Content download

– Entertainment

J. Gustafson, CTT, KTH 35

Third generation dialog systemsThird generation dialog systemsThird generation dialog systemsThird generation dialog systems

LOW MEDIUM HIGH

COMPLEXITY

FLIGHT

STATUS

STOCK

TRADING

PACKAGE

TRACKING

FLIGHT/TRAIN

RESERVATION

BANKING CUSTOMER

CARE

TECHNICAL

SUPPORT (Tell-Eureka)

1st Generation

INFORMATIONAL

2nd Generation

TRANSACTIONAL

3RD Generation

PROBLEM SOLVING

(Pieracchini 2006) J. Gustafson, CTT, KTH 36

Problem solvingProblem solvingProblem solvingProblem solving

• Voxify– Wyndham hotel reservation

– Continental airline tickets

• SpeechCycle– Broadband support Agent

– Video support Agent

– IP Telephony Agent

Page 7: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

7

J. Gustafson, CTT, KTH 37

Speech Technology Speech Technology Speech Technology Speech Technology ComponentsComponentsComponentsComponents

J. Gustafson, CTT, KTH 38

Spoken dialogue processingSpoken dialogue processingSpoken dialogue processingSpoken dialogue processing

J. Gustafson, CTT, KTH 39

The Components of The Components of The Components of The Components of Spoken Dialogue SystemsSpoken Dialogue SystemsSpoken Dialogue SystemsSpoken Dialogue Systems

• Audio Server• Automatic Speech Recognition (ASR)• Natural Language Understanding (NLU)• Dialogue Management (DM)• Natural Language Generation (NLG)• Text-To-Speech synthesis (TTS)• Multimodal modules

J. Gustafson, CTT, KTH 40

Audio serverAudio serverAudio serverAudio server

Purpose: data capture

Input: speech

Output: digitized version of speech

Considerations: Availability, Bandwidth, Drop-out and Barge-in

J. Gustafson, CTT, KTH 41

Speech RecognizerSpeech RecognizerSpeech RecognizerSpeech Recognizer

Purpose: convert speech to text

Input: digital speech

Output: String(s) or lattice that represent the most

probable utterance(s)

Considerations: Vocabulary size, Grammar and Speech type

J. Gustafson, CTT, KTH

What makes speechWhat makes speechWhat makes speechWhat makes speechrecognition difficult? recognition difficult? recognition difficult? recognition difficult?

Speaker Channel Listener

• Between speakers– Age– Gender

– Anatomy– Dialect

• Within speaker– Stress– Mood

– Health– Formel / Spontanteous

• Environment– Additive noise

– Room

• Microphone, Telephone– Bandwidth

– Noise

• Listener– Age– First Language

– Level of hearing– Known / Unknown

– Human / Maschine

Page 8: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

8

J. Gustafson, CTT, KTH 43

(.wav)

(.wav)

(.wav)

(.wav)

”Flyget, tåget och bilbranschen tävlar om lönsamhet och folkets gunst”.

Född iUSAex-Jugoslavien

(.wav)

J. Gustafson, CTT, KTH 44

Speech Recognition Speech Recognition Speech Recognition Speech Recognition

• Identify the speech sounds– Measure the spectrum of the speech signal 100

times per second

• Match the recognized sounds against a pronunciation dictionary– Only the allowed words are active

• Construct a sentence of the most probable words – Taking the probability of the context into account

J. Gustafson, CTT, KTH 45

Statistiskt baserad taligenkänningStatistiskt baserad taligenkänningStatistiskt baserad taligenkänningStatistiskt baserad taligenkänning

Sökning

Språkmodellmöjliga ordföljder

N-best

Meningsförståelse

N-bestN-bestN-bästa

Fonemmodellerakustisk beskr.

Lexikonvokabulär + fonemtranskription

Vald meningKunskap

Analys

Jämförelse

ParametriseringFFT16 kHz 100 Hz

Spektralanalys

Klassificering

Kontinuerlig Diskret

J. Gustafson, CTT, KTH 46

Olika tOlika tOlika tOlika talparametraralparametraralparametraralparametrar• Filterbanksamplituder (FFT)

– talets spektrum var 10 ms

– Mel-skala - baserad på örats frekvensupplösning

• Cepstrum– inversfouriertransform av logaritmiskt spektrum - ortogonala

• Artikulatoriska parametrar– nära kopplad till talproduktionen

– komplicerade att beräkna

• Hörselbaserade parametrar– förenklad modellering av hörseln

– lovande resultat för tal stört av buller och brus

J. Gustafson, CTT, KTH 47

Språkmodeller för igenkänningSpråkmodeller för igenkänningSpråkmodeller för igenkänningSpråkmodeller för igenkänning

• N-gram (ordföljdssannolikheter) ger bra resultat trots sin enkelhet– bigram: P(wi | wi-1)

– trigram: P(wi | wi-2, wi-1)

• Klasspar/Ordpar om träningsmaterialsaknas

J. Gustafson, CTT, KTH

Resultat år 2005Resultat år 2005Resultat år 2005Resultat år 2005

(tidigare resultat inom parentes)(tidigare resultat inom parentes)(tidigare resultat inom parentes)(tidigare resultat inom parentes)

Corpus Speech type Lexicon size Word Error Rate (%)

Human Error Rate (%)

Digit string (phone) spontaneous 10 0.3 0.009

Resource Management

read 1000 3.6 0.1

ATIS spontaneous 2000 2 -

Wall Street Journal

read 64000 6.6 1

Radio News mixed 64000 13.5 (20) -

Switchboard (phone)

conversation 10000 19.3 (38) 4

Call Home (phone) conversation 10000 30 (50) -

Page 9: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

9

J. Gustafson, CTT, KTH 49

Natural Language Natural Language Natural Language Natural Language UnderstandingUnderstandingUnderstandingUnderstanding

Purpose: produce meaning representation from ASR output

Input: String(s) lattice of words

Output: Meaning representation

Considerations: Type of grammar (Finite-state, Full parse or Word-spotting)

J. Gustafson, CTT, KTH 50

Phrases and syntaxPhrases and syntaxPhrases and syntaxPhrases and syntax

I want to take the boat

J. Gustafson, CTT, KTH 51

Robust Utterance AnalysisRobust Utterance AnalysisRobust Utterance AnalysisRobust Utterance Analysis

J. Gustafson, CTT, KTH 52

Pragmatics/world knowledgePragmatics/world knowledgePragmatics/world knowledgePragmatics/world knowledge

• Flights can be delayed• To be delayed is bad• ”Usually” → recurring in time. How often is ”usually”?

• Is this particular flight delayed a lot? • What would that mean for connecting flights?

S: There is a flight at 8.10.

U: Isn’t that usually delayed?

J. Gustafson, CTT, KTH 53

Dialogue ManagementDialogue ManagementDialogue ManagementDialogue Management

Purpose: decide the next system action

Input: a meaning representation from NLU

Output: High-level communicative goal(s)

J. Gustafson, CTT, KTH 54

Automating DialogueAutomating DialogueAutomating DialogueAutomating Dialogue

• To automate a dialogue, we must model:User’s Intentions User’s Intentions User’s Intentions User’s Intentions

• Recognized from input media forms (multi-modality)

• Disambiguated by the Context

System ActionsSystem ActionsSystem ActionsSystem Actions• Based on the task model and on output medium forms

(multi-media)

Dialogue FlowDialogue FlowDialogue FlowDialogue Flow• Content Selection (what to say)

• Information Packaging (how much to say)

• Turn-taking (who can speak/interrupt)

Page 10: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

10

J. Gustafson, CTT, KTH 55

Knowledge SourcesKnowledge SourcesKnowledge SourcesKnowledge Sourcesfor Dialogue Systemsfor Dialogue Systemsfor Dialogue Systemsfor Dialogue Systems

• Context:– Dialogue history– Current task, slot, state

• Ontology:– World Knowledge model– Domain model

• Language:– Models of conversational competence:

• Turn-taking strategies• Discourse obligations

– Linguistic models:• Grammars, Dictionaries, Corpora

• User Model:– Information about user that could be relevant to the dialogue

• Age, gender, spoken language, location, preferences, etc.• Dynamic information: the current Mental State (Belief, Desires, Intentions)

J. Gustafson, CTT, KTH 56

Dialogue Control StrategiesDialogue Control StrategiesDialogue Control StrategiesDialogue Control Strategies

• Finite-state based systems– dialog and states explicitly specified

• Frame based systems– dialog separated from information states

• Agent based systems– model of intentions, goals, beliefs

J. Gustafson, CTT, KTH 57

Interaction controlInteraction controlInteraction controlInteraction control

• Conversation– Exchange of information

– Control of the exchange of information

• Turn-taking– Control of the ’floor’

• Feedback – Perception, attention, understanding, attitude…

• Dialogue systems needs both turn-taking and feedback

J. Gustafson, CTT, KTH 58

Conversational “grunts”Conversational “grunts”Conversational “grunts”Conversational “grunts”

• Grunts occur an average of once every 5 seconds in American English conversation. (Nigel Ward, 2000)

• In Switchboard database

– um was the 6th most frequent item (after I, and, the, you, and a), (Nigel Ward, 2000)

– the four items uh, uh-huh and um and um-hum accounted for 4% of the total (Picone et al. 1998).

• DEMO

J. Gustafson, CTT, KTH 59

An example:An example:An example:An example:Telias customer care line 90200Telias customer care line 90200Telias customer care line 90200Telias customer care line 90200

• 14 million calls/year

• A number of areas/families

– Fixed telephony, Mobile telephony, Mobile data, Broad band, IP telephoni, IP-TV....

• A number of reasons for calling

– Billing, ordering, problem solving, support, information,

J. Gustafson, CTT, KTH 60

Simulation (WizardSimulation (WizardSimulation (WizardSimulation (Wizard----ofofofof----Oz)Oz)Oz)Oz)

Page 11: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

11

J. Gustafson, CTT, KTH 61

The point of WoZ simulationThe point of WoZ simulationThe point of WoZ simulationThe point of WoZ simulation

• Data for speech recognition• Typical utterance patterns• Typical dialogue patterns• Click-and-talk behaviour• Hints at important research problems

J. Gustafson, CTT, KTH 62

The WOZ recording environmentThe WOZ recording environmentThe WOZ recording environmentThe WOZ recording environment• Customer care agent were used as wizards:

– Listened to the customer (acting as ASR)

– Generate system answers via a prompt piano (acting as DM)

– Routing the call to the appropriate agent (acting as call router)

• All calls were logged and recorded• The wizard could route to themselves in problematic cases:– Less customer irritation

– Human error handling dialogues

– Used to collect interwievs

Wozmiljö och promptpiano framtaget av

Wirén, Eklund, Engberg and Westermark

J. Gustafson, CTT, KTH 63

• Ten real customer care agent were used as wizards (in-service WOZ collection)

• Five weeks of recordings resulted in 42,000 transcribed and tagged calls

The WoZ recordingsThe WoZ recordingsThe WoZ recordingsThe WoZ recordings

Wirén, M., Eklund, R., Engberg, F. and Westermark, J “Experiences of an In-Service Wizard-of-Oz Data Collection

for the Deployment of a Call-Routing Application”, Proceedings of the Workshop on Bridging the Gap: Academic and Industrial

Research in Dialog TechnologiesJ. Gustafson, CTT, KTH 64

A study on a system with a more A study on a system with a more A study on a system with a more A study on a system with a more natural interactions controlnatural interactions controlnatural interactions controlnatural interactions control

• The prompts were re-designed to be more spontanueos and conversational in style

• These should make the system more responsive (indicating listening, thinking)

• Feedback and grunts (mmm, okej) were added to be used to encourage the user to keep on describing their reason for calling

• Utterances as responset to channell checks (hello) and social utterances like greeting (Hi) were added

J. Gustafson, CTT, KTH 65

WizardWizardWizardWizard----ofofofof----Oz interface 2Oz interface 2Oz interface 2Oz interface 2

J. Gustafson, CTT, KTH 66

Example dialogExample dialogExample dialogExample dialogSYS: välkommen till telia vad kan jag hjälpa dig med?

KUND: ja hejsan [PAUS]

EEH [PAUS] hör du mig

SYS: ja hallå

KUND: ja [PAUS] jo EEH jag har problem med en räkning

SYS: MM?

KUND: jag har EHH anmält att jag har ingen telefon och

jag har fått en eh [PAUS] en räkning på ett abonnemang

SYS: okej

KUND: och jag har ringt flera flera gånger och försökt förklara det

att jag betalar inte så länge jag inte får telefonen inkopplad

SYS: gäller det din hemtelefon?

KUND: ja!

SYS: MM dröj lite så ska jag koppla dig

KUND: MM.

Welcome to Telia how may I help you?

Yes hello [PAUS]

EEH [PAUS] do you hear me?

Yes hello

yeah [PAUS] well EH I have a problem with my bill

MM?

I have EHH notified you that I don’t have a phone

And I have gotten a EEH [PAUS] billl

Okey..

Ind I have called you several times to tell you that

I refuse to pay until you reconnect my phone

Is this your home phone

Yes!

MM, hold on while I connect you

MM.

Page 12: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

12

J. Gustafson, CTT, KTH 67

Natural Language GenerationNatural Language GenerationNatural Language GenerationNatural Language Generation

Purpose: produce an output string to be shown/spoken to the user

Input: Representation from DM

Output: String for TTS (possibly marked for prosody, etc.)

J. Gustafson, CTT, KTH 68

Utterance GenerationUtterance GenerationUtterance GenerationUtterance Generation

• Predefined utterances• Frames with slots• Generation based on grammar and underlying semantics

J. Gustafson, CTT, KTH 69

TextTextTextText----totototo----Speech SynthesisSpeech SynthesisSpeech SynthesisSpeech Synthesis

Input: text string (with tags)

Purpose: speak string to user

Considerations: Human-like, Flexibility

J. Gustafson, CTT, KTH 70

What is speech synthesis?What is speech synthesis?What is speech synthesis?What is speech synthesis?• Recorded human speech

– Words and phrases

– fix vocabulary

• Systematically recorded fragments of speech– one fix speaker

• Record the phoneme transitions "diphones"

• Parametric synthesis (artificial)– Formant synthesis

– articulatory synthesis

• Multimodal synthesis (talking head)

J. Gustafson, CTT, KTH 71

TextTextTextText----tilltilltilltill----tal (TTS)tal (TTS)tal (TTS)tal (TTS)

text

Lingvistisk analys

Prosodisk analys

Fonetisk beskrivning

Ljudgenerering

Morfologisk analysLexikon och reglerSyntaxanalys

Regler och lexikon

Regler och enhetsval

Sammansättning Regler

Språkidentifiering

“abcd”

J. Gustafson, CTT, KTH Kent (1997). The Speech Sciences

SourceSourceSourceSourceSourceSourceSourceSource--------Filter Theory Filter Theory Filter Theory Filter Theory Filter Theory Filter Theory Filter Theory Filter Theory of Speech Production of Speech Production of Speech Production of Speech Production of Speech Production of Speech Production of Speech Production of Speech Production

Page 13: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

13

J. Gustafson, CTT, KTH Kent (1997). The Speech Sciences

FormantsFormantsFormantsFormantsFormantsFormantsFormantsFormants

J. Gustafson, CTT, KTH

General Rules of Vowel General Rules of Vowel General Rules of Vowel General Rules of Vowel

FormantsFormantsFormantsFormants(F1 = First Formant Frequency, (F1 = First Formant Frequency, (F1 = First Formant Frequency, (F1 = First Formant Frequency,

F2 = Second Formant Frequency)F2 = Second Formant Frequency)F2 = Second Formant Frequency)F2 = Second Formant Frequency)

Rules:

F1 associated with Tongue Height

• /i/ & /u/ (high vowels): Low F1

• /æ/ & /a/ (low vowels): High F1

F2 associated with A-P Position of tongue

• /u/ & /a/ (back vowels): Low F2

• /i/ & /æ/ (front vowels): High F2

Kent (1997). The Speech Sciences

J. Gustafson, CTT, KTH

TextText--toto--Speech Synthesis (TTS) Speech Synthesis (TTS)

EvolutionEvolution

1962 1967 1972 1977 1982 1987 1992 1997

Year

Formant

Synthesis

Bell Labs; Joint

Speech Research

Unit; MIT (DEC-

Talk); Haskins

Lab; KTH

LPC-Based

Diphone/Dyad

Synthesis

Bell Labs; CNET;

Bellcore; Berkeley

Speech

Technology

Unit Selection

Synthesis

ATR in Japan; CSTR

in Scotland; BT in

England; AT&T Labs

(1998); L&H in

Belgium

Poor

Intelligibility;

Poor Naturalness

Good

Intelligibility;

Poor Naturalness

Good

Intelligibility;

Customer Quality

Naturalness

(Limited Context)

larry rabiner J. Gustafson, CTT, KTH 76

Parametrical methods for Parametrical methods for Parametrical methods for Parametrical methods for speech synthesisspeech synthesisspeech synthesisspeech synthesis

Articulators

The form of the vocal tract

Tubes

SourceResonators

Electrical networks

Läppar

J. Gustafson, CTT, KTH 77

ArticulatoryArticulatoryArticulatoryArticulatoryArticulatoryArticulatoryArticulatoryArticulatory Synthesis IssuesSynthesis IssuesSynthesis IssuesSynthesis IssuesSynthesis IssuesSynthesis IssuesSynthesis IssuesSynthesis Issues

• requires highly accurate models of glottis and vocal tract

• requires rules for dynamics of the articulators

J. Gustafson, CTT, KTH 78

Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959Formant synthesis (1959--------1987)1987)1987)1987)1987)1987)1987)1987)

• Haskins, 1959 • KTH – Stockholm, 1962• Bell Labs, 1973 • MIT, 1976• MIT-talk, 1979• Speak ‘N spell, 1980• BELL Labs, 1985• Dec talk, 1987

larry rabiner

Page 14: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

14

J. Gustafson, CTT, KTH 79

PSOLAPSOLAPSOLAPSOLA

• Pitch pulses moved in time to fit F0 contour

• Conceptually simple and computationally efficient

- Need for precise pitch pulse marking- Could not handle spectral interpolation

J. Gustafson, CTT, KTH 80

Unit selectionUnit selectionUnit selectionUnit selection

• Large databases of recorded natural speech• Minimal processing• Annotation of database – what information is needed?

• Synthesis defaults to transcription and search problem

• Few cuts > maximally long units selected(but context and prosody must fit well)

• Target and concatenation costs

J. Gustafson, CTT, KTH 81

Concatenation UnitsConcatenation UnitsConcatenation UnitsConcatenation UnitsConcatenation UnitsConcatenation UnitsConcatenation UnitsConcatenation Units

• Words:– no complete coverage for broad domains � words

have to be supplemented with smaller units– limited ability to modify pitch, amplitude and

duration without losing naturalness and intelligibility

– need huge database to extract multiple versions of each word

• Sub-Words:– hard to isolate in context due to co-articulation– need allophonic variations to characterize units in

all contexts– puts large burden on signal processing to smooth at

unit join points

larry rabiner J. Gustafson, CTT, KTH 82

Modern TTS Systems (Natural Modern TTS Systems (Natural Modern TTS Systems (Natural Modern TTS Systems (Natural Modern TTS Systems (Natural Modern TTS Systems (Natural Modern TTS Systems (Natural Modern TTS Systems (Natural Voices from AT&T)Voices from AT&T)Voices from AT&T)Voices from AT&T)Voices from AT&T)Voices from AT&T)Voices from AT&T)Voices from AT&T)

• Soliloquy from Hamlet—• Gettysburg Address—• Bob Story—• German female—• UK British female—• Spanish female—• Korean female—• French male—

larry rabiner

J. Gustafson, CTT, KTH

Modern COMMERCIAL SystemsModern COMMERCIAL Systems

• Lucent• AcuVoice • Festival• L&H RealSpeak• SpeechWorks female• SpeechWorks male• Cselt (Actor) - Italian

larry rabiner J. Gustafson, CTT, KTH 84

Multimodal SynthesisMultimodal SynthesisMultimodal SynthesisMultimodal Synthesis

Input: tagged text string

Purpose: support synthesis with lip movements

Considerations: Synchronization, Human-like?

Page 15: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

15

J. Gustafson, CTT, KTH 85

Talking headTalking headTalking headTalking head

• Parametric 3D model• Articulatory based parameters

• Rule based control– speech

– gestures

• Tools for character creation• Windows, linux, unix

Listening & Thinking

J. Gustafson, CTT, KTH

Articulatory synthesisArticulatory synthesisArticulatory synthesisArticulatory synthesis

/usu/

J. Gustafson, CTT, KTH

Using humans to control Using humans to control Using humans to control Using humans to control

animated characters ...animated characters ...animated characters ...animated characters ...

J. Gustafson, CTT, KTH

Makes a convincing ”system”Makes a convincing ”system”Makes a convincing ”system”Makes a convincing ”system”

J. Gustafson, CTT, KTH 89

The NICE systemThe NICE systemThe NICE systemThe NICE system

J. Gustafson, CTT, KTH 90

BackgroundBackgroundBackgroundBackground

• NICE was a EU project 2002-2005• Five Partners

– Liquid Media, Sweden

– Scansoft (initially Philips, now Nuance), Germany

– Limsi, France

– TeliaSonera, Sweden

– Nislab, Denmark

• Two systems were developed– a Swedish Fairytale game

– an English HC Andersen story-teller

Page 16: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

16

J. Gustafson, CTT, KTH 91

The NICE system setupThe NICE system setupThe NICE system setupThe NICE system setup

J. Gustafson, CTT, KTH 92

A computer game with A computer game with A computer game with A computer game with multimodal dialoguemultimodal dialoguemultimodal dialoguemultimodal dialogue

• It puts new & different demands on speech and dialogue technology– young users– a game lasts for many hours– many kinds of dialogue: problem solving, negotiation,

information-seeking, socializing, command & control, multi-party, …

– the setup is multimodal and asynchronous

• Many things are easier than for ‘normal’ dialogue systems– low penalty for recognition errors– voluntary suspension of belief– gamers already know all the clichés

• $ Might pay off big some day $

J. Gustafson, CTT, KTH 93

Research challengesResearch challengesResearch challengesResearch challenges

• Conversational Fairytale characters – Different personalities and goals

– Autonomous

– Reactive

– Aware of the user, the environment and each other

• Conversational multimodal output generation

• Game-progression VS user control• Understanding children’s multi-modal input

J. Gustafson, CTT, KTH 94

Character design in NiceCharacter design in NiceCharacter design in NiceCharacter design in Nice

• Cloddy Hans looks stupid, but friendly –Karin looks a bit angry and smart

• Cloddy Hans is a classical friendly helper character

• The tasks the users have to solve are centered around what is seen in the scenes

• All characters have emotional and attention states

• The system features characters with different roles (e.g. helper and gatekeeper)

J. Gustafson, CTT, KTH 95

The fairyThe fairyThe fairyThe fairy----tale gametale gametale gametale game

• Two scenes – the room and the fairy-tale world– Scene one – Multimodal training session put-that-there– Scene two – Multi-party negotiation dialogue

• Three characters– Cloddy Hans – slow but friendly Helper character– Karen – fast and sullen Gatekeeper character– Thumbelina – silent supporting character with attitudinal gestures

J. Gustafson, CTT, KTH Interspeech07

Aug 30 2007

User Eh put that lamp in that hole eh pipe where the jewels are Concrete

Cloddy Could we call it valuable? Function

User Yes do that

Cloddy Do you want me to put the lamp in valuable? Function

User Yes do that

Cloddy I will go over to the shelf

User Yes do that

Cloddy I will go over to the machine

User Yeah and put it in that jewelery thing or the jewelery pipe Function+slot

Cloddy I will put the lamp in valuable Function

User Yes….Okay put the lamp in valuable oops did you do that already? Function (convergence)

User Put the sword in the pipe with the skull Concrete

Page 17: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

17

J. Gustafson, CTT, KTH 97

Inside a fairyInside a fairyInside a fairyInside a fairy----tale braintale braintale braintale brain

• World model: Interrelated (Java) objects representing things, characters, and their relationships.

• Discourse history: Who said what when?

• Agenda: Goals, actions and their causal relationships.

J. Gustafson, CTT, KTH

Scene scriptingScene scriptingScene scriptingScene scripting

Game

RoomScene

Intro Main

WorldScene

Intro

Exploration

Negotiation

Crossing

There is only one“current scene”

J. Gustafson, CTT, KTH 99

Characters are autonomousCharacters are autonomousCharacters are autonomousCharacters are autonomous

Virtual World

DispatcherUser

Karen’s

brain

Cloddy’s

brain

ASR

TTS

NLU

NLG

Dialogue acts

Spoken words

Animation track

J. Gustafson, CTT, KTH 100

MessageMessageMessageMessage----passingpassingpassingpassing

Virtual World

DispatcherUser

Karen’s

brain

Cloddy’s

brain

ASR

TTS

NLU

NLG

Receipt

Broadcast

Utterance

J. Gustafson, CTT, KTH 101

EventEventEventEvent----based dialoguebased dialoguebased dialoguebased dialogue

J. Gustafson, CTT, KTH 102

The Animation systemThe Animation systemThe Animation systemThe Animation system

Page 18: Dialogsystem, AGI Joakim Gustafson, CTT, KTH · • Research Spoken Dialogue Systems • Commercial Speech Applications ... TRANSACTIONAL 3RD Generation PROBLEM SOLVING (Pieracchini

Dialogsystem, AGI

Joakim Gustafson, CTT, KTH

18

J. Gustafson, CTT, KTH 103

System Module

(Message)

AnimationHandler

Request

Animation Renderer Request

GR(StartOfGesture) AttentionTo(GestureInput)

GR(EndOfGesture)

GI(Select(Lamp)) LookAt(Lamp) 2) Play(Animation(

TorsoTurnLeft(amount),

HeadTurnLeft(amount)))

ASR(StartOfSpeech) AttentionTo(SpeechInput)

ASR(EndOfSpeech) TakeTurnGesture 4) Play(Animation(Eyegaze(UpRight),

Eyebrow(frown)

))NLP(PickUp(Lamp))(by simple fusion)

Animation handler actions generated by Animation handler actions generated by Animation handler actions generated by Animation handler actions generated by ”pick it up”+selection of lamp”pick it up”+selection of lamp”pick it up”+selection of lamp”pick it up”+selection of lamp

1) RaiseEyebrows

3) TurnTo(ActiveCamera)

J. Gustafson, CTT, KTH 104

Animation handler actions used to Animation handler actions used to Animation handler actions used to Animation handler actions used to pick up the lamppick up the lamppick up the lamppick up the lamp

6a) TurnTo(Lamp)

6b) Play(Animation(

PickupGesture

))

6c) PickUp(Lamp)

6d) TurnTo(ActiveCamera)

System Module

(Message)Animation Handler

Request

Animation Renderer Request

DM(convey(tell(

PickUp(Lamp))))

NLG(convey(“I will pick up the lamp”)

Say(“I will pick up the lamp”)(the Synthesizer is first requested to

generate the sound file and lip-synch

track)

5) Play(Sound(synthesis_file)

Animation(Lipsynch_track)

)

DM(perform(PickUp(Lamp))) PickUp(Lamp)

J. Gustafson, CTT, KTH 105

AttentionalAttentionalAttentionalAttentional and emotionaland emotionaland emotionaland emotionaloutput behaviorsoutput behaviorsoutput behaviorsoutput behaviors

J. Gustafson, CTT, KTH 106

Body gestures and Body gestures and Body gestures and Body gestures and physical actionsphysical actionsphysical actionsphysical actions

J. Gustafson, CTT, KTH 107

Signs of social involvementSigns of social involvementSigns of social involvementSigns of social involvementand engagementand engagementand engagementand engagement

• Users form alliances with either Cloddy Hans or Karin

• User show repent when being accused of deceipt

• Users lie, make ironic, sarcastic and humorous remarks

• Users react to the character’s mood and use politeness markers

• Users lecture Cloddy Hans while making reference to common dialogue history.

J. Gustafson, CTT, KTH 108

The End