Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F....

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young Cambridge University Engineering Department | {mg436, fj228, sk561, farm2, brmt2, ky219, sjy}@eng.cam.ac.uk Acknowledgements This research was partly funded by the UK EPSRC under grant agreement EP/F013930/1 and by the EU FP7 Programme under grant agreement 216594 (CLASSiC project: www.classic-project.org ). Real-world Problem: CamInfo Dialogue POMDP-based Dialogue Management Kernel Function • Models the uncertainty about the dialogue state by maintaining a distribution over all possible dialogue states in every turn – belief state • Maximises the overall dialogue success by optimising Q-function Q(a,b(s)) – the highest expected long-term reward for action a taken in belief state b(s) Problem: To ensure tractability of optimisation, standard methods discretise the belief space into a small number of points, causing the loss of information. Solution: Model the continuous nature of the belief state and include the prior knowledge about the domain to speed up the learning process. Gaussian Processes in Reinforcement Learning Toy problem: VoiceMail Dialogue Conclusion • GP-Sarsa can obtain the optimal policy faster than standard methods provided an adequate choice of the kernel function • The measure of uncertainty that GP-Sarsa is estimating can be utilised in an Active Learning framework to additionally speed up the learning process Results on CamInfo • Gaussian Process (GP) – non-parametric Bayesian model for function approximation • For given prior function correlations and some noisy function observations , it estimates the posterior of any function value • GP-Sarsa is an online Reinforcement learning algorithm that models Q-function as a Gaussian Process If the Q-function value was known in one belief state-action pair what is the Q- function value in another belief state for the same action? • Prior knowledge about Q-function correlations is incorporated in the kernel function. • Kernel hyper-parameters can be learned from the data labelled with belief states, actions and rewards. In that way the kernel captures correlations found in the data. Results on VoiceMail Comparison: • GP-Sarsa with various kernel functions • Grid-based Monte Carlo Control algorithm • Exact POMDP solution In order to estimate the speed of convergence, the policy was evaluated after every training batch: Hidden Information State Dialogue Manager • POMDP-based Dialogue Manager that can tractably maintain belief state for real-world problems • Optimises policy in a reduced summary space CamInfo Domain • Tourist Information domain for Cambridge, UK Comparison: • GP-Sarsa with polynomial kernel • Active learning GP-Sarsa – during exploration selects the action that has the highest uncertainty estimated by the Gaussian Process • Grid-based Monte Carlo Control Algorithm Would you like to save or delete the message? Your message is deleted. b(s) b’(s) a r belief state (immediate) reward action Would you like to save or delete the message? Would you like to save or delete the message? Q-function value Action Belief state The user asks the system to save or delete the message. The user input is corrupted with noise, so the true dialogue state is unknown. A Gaussian Process models every Q-function value Q(a,b(s)) as a Gaussian distributed random variable. The variance of the Gaussian distribution provides a measure of uncertainty about the approximation. GP-Sarsa: Distribution of Q(a,b(s)) value Action a Which action leads to success? Standard approach: Q(a,b(s)) value

date post
19-Dec-2015
Category

Documents
view
214
download
0

TAGS:

Embed Size (px):

Transcript of Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F....

Gaussian Processes for Fast Policy Optimisation of

POMDP-based Dialogue Managers

M. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young

Cambridge University Engineering Department | {mg436, fj228, sk561, farm2, brmt2, ky219, sjy}@eng.cam.ac.uk

AcknowledgementsThis research was partly funded by the UK EPSRC under grant agreement EP/F013930/1 and

by the EU FP7 Programme under grant agreement 216594 (CLASSiC project:

www.classic-project.org).

Real-world Problem: CamInfo DialoguePOMDP-based Dialogue Management Kernel Function• Models the uncertainty about the dialogue state by maintaining a distribution over all possible dialogue states in every turn

– belief state

• Maximises the overall dialogue success by optimising Q-function Q(a,b(s)) – the highest expected long-term reward for

action a taken in belief state b(s)

Problem: To ensure tractability of optimisation, standard methods discretise the belief space into a small number of points,

causing the loss of information.

Solution: Model the continuous nature of the belief state and include the prior knowledge about the domain to speed up the

learning process.

Gaussian Processes in

Reinforcement Learning

Toy problem: VoiceMail Dialogue

Conclusion• GP-Sarsa can obtain the optimal policy faster than standard methods provided an adequate choice of the kernel

function

• The measure of uncertainty that GP-Sarsa is estimating can be utilised in an Active Learning framework to

additionally speed up the learning process

Results on CamInfo

• Gaussian Process (GP) – non-parametric Bayesian model for function approximation

• For given prior function correlations and some noisy function observations, it estimates the posterior of any function

value

• GP-Sarsa is an online Reinforcement learning algorithm that models Q-function as a Gaussian Process

If the Q-function value was known in one belief state-action pair what is the Q-function value in another belief state for

the same action?

• Prior knowledge about Q-function correlations is incorporated in the kernel function.

• Kernel hyper-parameters can be learned from the data labelled with belief states, actions and rewards. In that way

the kernel captures correlations found in the data.

Results on VoiceMail

Comparison:

• GP-Sarsa with various kernel functions

• Grid-based Monte Carlo Control algorithm

• Exact POMDP solution

In order to estimate the speed of convergence, the policy was evaluated after every training batch:

Hidden Information State Dialogue Manager

• POMDP-based Dialogue Manager that can tractably maintain belief state for real-world problems

• Optimises policy in a reduced summary space

CamInfo Domain

• Tourist Information domain for Cambridge, UK

Comparison:

• GP-Sarsa with polynomial kernel

• Active learning GP-Sarsa – during exploration selects the action that has the highest uncertainty estimated by the

Gaussian Process

• Grid-based Monte Carlo Control Algorithm

Would you like

to save or

delete the

message?Your message

is deleted.

b(s) b’(s)

belief state

(immediate)

reward

action

Would you like

to save or

delete the

message?

Would you like

to save or

delete the

message?

Q-function value ActionBelief state

The user asks the system to save or delete the message.

The user input is corrupted with noise, so the true

dialogue state is unknown.

A Gaussian Process models every Q-function value Q(a,b(s)) as a Gaussian distributed random variable. The variance of the

Gaussian distribution provides a measure of uncertainty about the approximation.

GP-Sarsa:

Distribution of Q(a,b(s)) value Action a

Which action leads to success?Standard approach:

Q(a,b(s)) value

http://www.classic-project.org/

POMDP solution methods - University of Torontodarius/papers/POMDP_survey.pdf · POMDP solution methods Darius Braziunas Department of Computer Science University of Toronto 2003 Abstract

Stochastic Language Generation in Dialogue Using Factored ...s3.amazonaws.com/mairesse/research/papers/cl-flm-draft.pdf · Mairesse, Franc¸ois, & Steve Young. 2014. Stochastic language

Symbolic Perseus: a Generic POMDP Algorithm with ...ppoupart/publications/...2 Outline • Dynamic Pricing as a POMDP • Symbolic Perseus – Generic POMDP solver – Point-based

Natural Language Generation - Amazon S3s3.amazonaws.com/mairesse/research/papers/ART-NLG.pdf · 3 François Mairesse, University of Sheffield 5 Natural language generation objectives

Personalized Robot Tutoring using the Assistive Tutor POMDP … · 2019-12-20 · Personalized Robot Tutoring using the Assistive Tutor POMDP (AT-POMDP) Aditi Ramachandran*, Sarah

Optimal Policies for POMDP

Probabilistic Planning in AgentSpeak using the POMDP …km17304/publications/CIMA-book15.pdf · Probabilistic Planning in AgentSpeak using the POMDP framework Kim Bauters 1, Kevin

Multi-Resolution POMDP Planning for Multi-Object Search in 3D

Fast approximate POMDP planning: Overcoming the curse of history!

Visual Localization and POMDP for Autonomous Indoor …

Designing States, Actions, and Rewards for Using POMDP in Session Search

The Permutable POMDP: Fast Solutions to POMDPs for ... location utterance States: hidden but fixed! ... this POMDP has a special structure: ... not the belief itself. Rome London Paris

MAGIC: Learning Macro-Actions for Online POMDP Planning

Measuring the Returns to R&D · Hall, Mairesse, and Mohnen November 2009 3 Measuring the Returns to R&D Bronwyn H. Hall, Jacques Mairesse, and Pierre Mohnen 1. Introduction Returns

POMDP-based Decision Making for Cognitive Cars using an ... · POMDP-based Decision Making for Cognitive Cars using an Adaptive State Space. Study Thesis of Sebastian Klaas At the

On-line dialogue policy optimisation Milica Gašić Dialogue Systems Group.

Quick and Automatic Selection of POMDP Implementations on ...

Mixed Reality as a Bidirectional Communication Interface ...Virtual Deixis POMDP (PVD-POMDP), that observes a human’s speech, gestures, and eye gaze, and decides when to ask questions

D1.3: POMDP Learning for ISU Dialogue Management · D1.3: POMDP Learning for ISU Dialogue Management Paul Crook, James Henderson, Oliver Lemon, Xingkun Liu Distribution: Public CLASSiC

Approximate POMDP planning: Overcoming the curse of history!

Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F....

Documents

Transcript of Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F....