Spoken Dialogue Systems A Tutorial

Gautam Varma MantenaDr. Kishore Prahallad

Speech and Vision Lab, International Institute of Information Technology-

Hyderbad

Spoken Dialogue SystemsA Tutorial

Speech and Vision Lab, International Institute of Information Technology - Hyderabad

IntroductionInterface between a human and a machine for

information access.Some applications for information access:

Search engines. Example: Google search, Bing, etc.IVR applications. Example: Most customer careDialogue Systems. Example Google chat bots (

[email protected])Dialogue Systems:

Dialogue is a much freer mode of communication.Speech being the natural mode of communication,

there is a need to build systems which communicate to users via speech

2

mailto:[email protected]


Introduction (contd.)

3

Modes of communication in a dialogue systemTextSpeechMultimodal features like gestures, touch

screen, etc.


DialogueHow is spoken dialogue different from other

modes of communication?Input: Input utterance can consists

disfluencies (like ‘uh’ and ‘hmm’), word repetitions, etc.

Barge-ins: Interrupting the system before allowing it to finish

Turn taking: System should know when to speak (i.e. when to take control of the conversation)

Prosodic information.Grounding with user.

4


Applications of Spoken Dialogue SystemsARISE – Automatic Railway Information System for

EuropeRoomLine – Conference room scheduling and

reservation.Let’s Go – Bus information systemTeamTalk – Command and control interface to a

team of robotsCommunicator – Air travel planningJupiter – Automated weather serviceMIS – Mandi Information System

and many more

5


Characteristics of a Spoken Dialogue SystemsAbility to understand user’s goal and to

reach an appropriate and a satisfied solution.

Ability to carry out sub-dialogues to achieve sub-goals.

Ability to pass control from one sub-dialogue to another.

Ability to vary the dialogue initiative modes from system initiative to user initiative.

Use of a user model to expect user’s utterances and act aptly.

Directing the user towards task completion.6


Components of a Spoken Dialogue SystemAutomatic Speech Recognition (ASR)

Convert user input (utterance/speech) to textNatural Language Understanding (NLU)

Structure text to a format which can be understoodDialogue Manager (DM)

Actions to be performed based on structured text.Natural Language Generator (NLG)

Generate text that is to be conveyed to the user.Text to Speech Synthesis (TTS)

Convert the text to speech

7


Architecture of a Spoken Dialogue System

8


Approaches to build Dialogue ManagerClassification can be done based on

On how the conversation is controlled:System initiative: System has complete control on

the conversation.User initiative: User has the complete dialogue

control and the system will only respond to user requests.

Mixed initiative: Both user and the system take turns in controlling the conversation.

Functioning of DMFinite StateSelf Organizing

9


Finite State Dialogue ManagersDialogue is structured in a sequence of

predetermined utterances.

10


Finite State Dialogue Managers (contd.)

11

User is expected to answer only the system queries. Providing more information would be redundant.

FS-DM systems are system initiative as they restrict user responses.

FS-DM systems are very robust as it already knows what state it is in and what would be the possible responses of the user.

Useful when dealing with well structured data like small enquiry systems, questionnaires, etc.


Self Organizing Models

12

Dialogue path is not predeterminedDialogue path evolves along with the user

responses to the system.Mixed initiative.Provide much freer form of communication

as the user is not restricted.Types of self organizing models:

Frame basedAgent basedInformation State based.


Frame Based Dialogue Manager

13

Frame (or form) based models function like form filling application. It keeps track of the information present and information to be acquired.

System chooses the required question to be asked at every instant. Dialogue path is not predetermined.

State of DM is defined as the content of the frame.

Frame is a data structure which holds the concepts and the necessary actions to be performed.


Frame Based Dialogue Manager (contd)

14

Preconditions are the conditions under which the actions are to be performed.

Given all the questions and the preconditions, the system would determine what actions to be performed.

Example:Concepts Questions PreconditionsPerson name

Prompt: What is the person name?

Person name empty

Lab name Prompt: What is the lab name? Lab name emptyDetail type Prompt: What details do you

want?Detail type empty

Query type Execute: update query type Person name or Lab name not empty


Agent based Systems

15

An agent is one which perceives its environment and perform necessary actions.

Information obtained from the environment by an agent are called percepts.

Agents behavior is dependent on the sequence of percepts it receives from the environment.

Goal of an intelligent agent is to effectively map the percept sequences to its corresponding action sequences via a program called agent program.

Example model is a plan based model.


Agent based Systems (contd)

16

In a plan based model, given an initial state and a goal state, the system constructs a plan or a sequence of operations to be performed.

The series of actions are to be performed are defined by an action schema.

An example action schema consists of the following parameters and constraintsPreconditions: Necessary conditions for

successful execution of the procedure


Agent based Systems (contd)

17

Effect: Conditions that become true after execution of the procedure.

Body: A set of goal states that are to be achieved in executing the procedure.


Information State based Dialogue Systems

18

Dialogue management operations are all in terms of information state.

Information state of a dialogue represents:Current information present.Information obtained from previous dialogue

moves.Future actions to be performed.

Information state approach for dialogue modeling consists of:Description and a representation of the

information components that constitute an information state. In general they can be sub-divided into static and dynamic components.


Information State based Dialogue Systems (contd)

19

Static components are those information components that do not change during a conversation with a user. Static components can be domain knowledge, update rules, etc.

Set of dialogue moves that will update the information state.

Set of update rules that control the update process of the information state. Rules are governed by the current information state and the performed dialogue move.

An update strategy for deciding which update rule to be executed from a set of applicable ones.


Stochastic Methods for learning Dialogue Strategies

20

Dialogue systems like information state based systems are complex to build.

Difficulty comes in incorporating update moves and strategies.

An approach to overcome the complexity is to model the dialogue manager as a stochastic model and train the model using preliminary dialogue corpus.

Incase of real time systems, it is difficult to foresee all possible dialogue scenarios that could occur between a user and a system. To overcome such problems we could approximate a dialogue system to a stochastic model.


Stochastic Methods for learning Dialogue Strategies (contd.)

21

Some of the approaches used for modeling dialogue managers are as follows:Markov Decision Process (MDP)Partially Observable Markov Decision Process

(POMDP)


Markov Decision Process

22

Markov decision process or MDP is characterized byA set of states S an agent can be in.A set of actions A the agent can take.A reward r(s,a) that the agent receives for

taking an action.Goal is to specify a policy which specifies

which action to be taken to receive the best reward.

Expected cumulative reward Q for a sequence of states is

Q[s0,a0,s1,a1,..] = R(s0,a0) + αR(s1,a1) + α2 R(s2,a2) + …


Markov Decision Process

23

α is called discount factor and lies between 0 and 1. It gives more importance to the current reward than to

future rewards. The more future a reward, the more discounted its

value.Bellman equation:Q(s,a) = R(s,a) + α ∑s1 P(s1|s,a) maxa1 Q(s1,a1)We require hand labeled data to model R(s,a) and P(s1|

s,a)


Partially Observable Markov Decision Process (POMDP)

24

Problems with MDP:System might never know what state the

user is in.System state might be misdirected by

recognition errors.Use of POMDP to overcome the above

problems.Environment is not completely observable for

the agent.POMDP provides a framework to overcome

such errors by explicitly mentioning the recognition errors.

It models the user output as an observed signal generated from another hidden variable.


Evaluation of Spoken Dialogue System

25

Dialogue metrics can be classified as:Objective measures: calculated automatically

by machine without any considerations towards human judgment.Some objective measures:

Percentage of correct answers given by a system to a user.

Number of turns taken to complete the task.Mean user response time.Mean system response time.Percentage of errors recognized by the dialogue

system.Time taken to complete the task.


Evaluation of Spoken Dialogue System (contd)

26

Subjective measures: require human analysis and a set of ground rules for user evaluation. Some subjective measures:

System cooperation.User satisfactionPercentage of correct and partially correct

answers.


Issues in evaluating Spoken Dialogue Systems

27

Only work when responses of the system is a single correct answer. It would not be able to compare systems if their outputs are summarized texts.

Cannot clearly demarcate the different metrics. Some of the given metrics can be correlated and can be redundant.

Given all the metrics, how do we combine the metrics for evaluation.


28

Mandi Information System


Mandi Information System

29

A conversation system for accessing prices of agricultural commodities Target users: farmers in rural and semi-urban

areasCommodities: Vegetables, fruits, pulses,

spices, Markets: Andhra Pradesh, IndiaLanguage: Telugu


Issues/Challenged involved

30

Noisy environment: Target users are primarily farmers in rural and semi-urban areas. Quality of speech is affected by the surrounding noises and also the

distance of mobile/telephone handset or the microphone.Dialect/Pronunciation variation:

Four distinct dialects of Telugu in Andhra Pradesh Dialectal variation is a continuum

Unstructured conversation: Target audience with no familiarity with computers Conversation is unstructured with disfluencies like repeats and false

startsPersonalization:

Useful to guide the user in obtaining the required information as quickly as possible.

Require mining user's previous calls to predict his/her preferred query during the next call.


Building a baseline system

31

Data collectionUsers were asked to read out the names of the

commodities, place names, etc. from the text provided to them.

Users were shown a series of pictures of agricultural commodities and were asked to say the names of the commodities shown in the picture. This is to collect the dialectal variations of the commodities.

Users were asked a series of questions related to agriculture and the places around their locality. This is to record conversational speech.


Building a baseline system (contd)

32

Dialogue flow is as follows



33

Finite state dialogue manager Limited domain and number of inputs

required are only three.Well structured.Real time application and the system is

expected to be very robust.A sample conversation of the systemSystem: What is the name of the

commodity?User: Orange

System: Did you say orange?User: Yes



34

Sample action schema for MISObtaining commodity namePreconditions:NOT(COMMODITY_IS_AVAILABLE)Body:PROMPT(‘What is the commodity name?’)RECORD_USER_UTTSPEECH_RECOGNITIONGET_LIST_DISTRICTS_SELLING(COMMODITY)Effect:#Commodity is obtainedCOMMODITY_IS_AVAILABLE



35

Obtain district name

Preconditions:COMMODITY_IS_AVAILABLE and NOT(DISTRICT_IS_AVAILABLE) and IS_AVAILABLE(LIST_DISTRICT_SELLING(COMMODITY))Body:PROMPT(LIST_DISTRICTS_SELLING(COMMODITY))PROMPT(‘What is the district name?’)RECORD_USER_UTTSPEECH_RECOGNITIONGET_LIST_DISTRICTS_SELLING(COMMODITY)Effect:#District is obtainedDISTRICT_IS_AVAILABLE


Characteristics of MIS baseline system

36

MIS expects three concepts or inputs from the user, which are commodity, district and market names.

When a user provides some information to MIS, the goodness of recognition hypothesis is checked by using explicit confirmation. This is to make sure that the input query is right as recognition is error prone.

The strategy of explicit confirmation is hardly convenient, as the user has to confirm to all the information that was provided to the MIS queries.


Mandi Information System II

37

Spoken dialogue system should provide accurate information to a user in less number of turns (or interactions).

Speech recognition being error prone, it is difficult to avoid confirmations from users. However, the objective would be to limit these confirmations.

An approach would be to associate a confidence score to the recognition output of an ASR.

Based on the confidence measure we decide the dialogue flow of the system.


Multiple Decoders and Contextual Information

38

One could build multiple ASR decoders, where each decoder tries to capture complementary information about the speech data. Can be done through training multiple decoders using Different training datasetsDifferent features such as Mel-frequency cepstral

coefficients, linear prediction cepstral coefficients.If a majority of these decoders agree on a

hypothesis, the recognized output is considered to be same and the system could choose to avoid an explicit confirmation from a user.


Multiple Decoders and Contextual Information (contd)

39

Given a set of decoders {d1,d2} and for a given acoustic signal, their corresponding hypothesis be {h1,h2}.

Let C[i] be the contextual information for the dialogue state i.

Following are the possible cases:Case 1: h1 = h2 and h1 belongs in C[i]

Action: Recognition output is most likely to be correct and the system would jump to subsequent dialogue states

Case 2: h1 = h2 and h1,h2 does not belong in C[i]Action: Recognition output is most likely to be correct, but the input

is of no help as it is not present in the contextual information. System would prompt the user, saying that no such information is available pertaining to that given input and would ask the user to provide some other query.



40

Case 3: h1 ≠ h2 and h1/h2 belongs in C[i]Action: System would try to consider the hypothesis that is

present in the contextual information and discard the other. To make sure that the recognition is correct, system would ask for an explicit confirmation from the user.

Case 4: h1 ≠ h2 and h1,h2 does not belong in C[i]Action: Mis-recognition might have occurred and the system

would prompt the user to provide the information again.



41



42

Sample conversation logged using MIS IISystem: What is the name of the district

User: Karnul(CI: Karnul, Krishna,

Nellur,..)(Recognition AM-1, AM-2) Karnul

System: Red grams is sold in 6 markets in Karnul

System: What is the market name?User: Dhoni

(CI: Adoni, Dhoni, .. )(Recognition AM-1) Dhoni(Recognition AM-2) Guti

System: Did you say Dhoni?User: Yes


MIS performace

43

Performance was measured based on whether the user was able to retrieve information or not.

Evaluation was conducted on 8 users, where each user was asked to retrieve information for any 5 commodities and the performance of MIS baseline system and MIS-II is as follows:

System AccuracyMIS baseline 57.4% (23/40)

MIS II 77.5% (31/40)


Personalization

44

A frequent caller expects his/her preferred query to be automatically answered.

Such personalization requires identifying a user and mining users previous calls to predict his/her preferred query during the next call.

Currently the system logs the user calls based on the caller id.

If the user is a frequent caller the system would prompt a yes/no question asking for the most recent transaction they made.Example: Do you want information of Rice in

VishakhapatnamDialogue flow will be user specific and based on user

responses.

Number: 66150320 extension 2

MIS Demo

Thank You

Spoken Dialogue Systems A Tutorial

Documents

Transcript of Spoken Dialogue Systems A Tutorial