Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F....

1
Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young Cambridge University Engineering Department | {mg436, fj228, sk561, farm2, brmt2, ky219, sjy}@eng.cam.ac.uk Acknowledgements This research was partly funded by the UK EPSRC under grant agreement EP/F013930/1 and by the EU FP7 Programme under grant agreement 216594 (CLASSiC project: www.classic-project.org ). Real-world Problem: CamInfo Dialogue POMDP-based Dialogue Management Kernel Function Models the uncertainty about the dialogue state by maintaining a distribution over all possible dialogue states in every turn – belief state Maximises the overall dialogue success by optimising Q-function Q(a,b(s)) – the highest expected long-term reward for action a taken in belief state b(s) Problem: To ensure tractability of optimisation, standard methods discretise the belief space into a small number of points, causing the loss of information. Solution: Model the continuous nature of the belief state and include the prior knowledge about the domain to speed up the learning process. Gaussian Processes in Reinforcement Learning Toy problem: VoiceMail Dialogue Conclusion GP-Sarsa can obtain the optimal policy faster than standard methods provided an adequate choice of the kernel function The measure of uncertainty that GP-Sarsa is estimating can be utilised in an Active Learning framework to additionally speed up the learning process Results on CamInfo Gaussian Process (GP) – non-parametric Bayesian model for function approximation For given prior function correlations and some noisy function observations , it estimates the posterior of any function value GP-Sarsa is an online Reinforcement learning algorithm that models Q-function as a Gaussian Process If the Q-function value was known in one belief state-action pair what is the Q- function value in another belief state for the same action? Prior knowledge about Q-function correlations is incorporated in the kernel function. Kernel hyper-parameters can be learned from the data labelled with belief states, actions and rewards. In that way the kernel captures correlations found in the data. Results on VoiceMail Comparison: GP-Sarsa with various kernel functions Grid-based Monte Carlo Control algorithm Exact POMDP solution In order to estimate the speed of convergence, the policy was evaluated after every training batch: Hidden Information State Dialogue Manager POMDP-based Dialogue Manager that can tractably maintain belief state for real-world problems Optimises policy in a reduced summary space CamInfo Domain Tourist Information domain for Cambridge, UK Comparison: GP-Sarsa with polynomial kernel Active learning GP-Sarsa – during exploration selects the action that has the highest uncertainty estimated by the Gaussian Process Grid-based Monte Carlo Control Algorithm Would you like to save or delete the message? Your message is deleted. b(s) b’(s) a r belief state (immediate) reward action Would you like to save or delete the message? Would you like to save or delete the message? Q-function value Action Belief state The user asks the system to save or delete the message. The user input is corrupted with noise, so the true dialogue state is unknown. A Gaussian Process models every Q-function value Q(a,b(s)) as a Gaussian distributed random variable. The variance of the Gaussian distribution provides a measure of uncertainty about the approximation. GP-Sarsa: Distribution of Q(a,b(s)) value Action a Which action leads to success? Standard approach: Q(a,b(s)) value
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F....

Page 1: Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers M. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young.

Gaussian Processes for Fast Policy Optimisation of

POMDP-based Dialogue Managers

M. Gašić, F. Jurčíček, S. Keizer, F. Mairesse, B. Thomson, K. Yu, S. Young

Cambridge University Engineering Department | {mg436, fj228, sk561, farm2, brmt2, ky219, sjy}@eng.cam.ac.uk

AcknowledgementsThis research was partly funded by the UK EPSRC under grant agreement EP/F013930/1 and

by the EU FP7 Programme under grant agreement 216594 (CLASSiC project:

www.classic-project.org).

Real-world Problem: CamInfo DialoguePOMDP-based Dialogue Management Kernel Function• Models the uncertainty about the dialogue state by maintaining a distribution over all possible dialogue states in every turn

– belief state

• Maximises the overall dialogue success by optimising Q-function Q(a,b(s)) – the highest expected long-term reward for

action a taken in belief state b(s)

Problem: To ensure tractability of optimisation, standard methods discretise the belief space into a small number of points,

causing the loss of information.

Solution: Model the continuous nature of the belief state and include the prior knowledge about the domain to speed up the

learning process.

Gaussian Processes in

Reinforcement Learning

Toy problem: VoiceMail Dialogue

Conclusion• GP-Sarsa can obtain the optimal policy faster than standard methods provided an adequate choice of the kernel

function

• The measure of uncertainty that GP-Sarsa is estimating can be utilised in an Active Learning framework to

additionally speed up the learning process

Results on CamInfo

• Gaussian Process (GP) – non-parametric Bayesian model for function approximation

• For given prior function correlations and some noisy function observations, it estimates the posterior of any function

value

• GP-Sarsa is an online Reinforcement learning algorithm that models Q-function as a Gaussian Process

If the Q-function value was known in one belief state-action pair what is the Q-function value in another belief state for

the same action?

• Prior knowledge about Q-function correlations is incorporated in the kernel function.

• Kernel hyper-parameters can be learned from the data labelled with belief states, actions and rewards. In that way

the kernel captures correlations found in the data.

Results on VoiceMail

Comparison:

• GP-Sarsa with various kernel functions

• Grid-based Monte Carlo Control algorithm

• Exact POMDP solution

In order to estimate the speed of convergence, the policy was evaluated after every training batch:

Hidden Information State Dialogue Manager

• POMDP-based Dialogue Manager that can tractably maintain belief state for real-world problems

• Optimises policy in a reduced summary space

CamInfo Domain

• Tourist Information domain for Cambridge, UK

Comparison:

• GP-Sarsa with polynomial kernel

• Active learning GP-Sarsa – during exploration selects the action that has the highest uncertainty estimated by the

Gaussian Process

• Grid-based Monte Carlo Control Algorithm

Would you like

to save or

delete the

message?Your message

is deleted.

b(s) b’(s)

a

r

belief state

(immediate)

reward

action

Would you like

to save or

delete the

message?

Would you like

to save or

delete the

message?

Q-function value ActionBelief state

The user asks the system to save or delete the message.

The user input is corrupted with noise, so the true

dialogue state is unknown.

A Gaussian Process models every Q-function value Q(a,b(s)) as a Gaussian distributed random variable. The variance of the

Gaussian distribution provides a measure of uncertainty about the approximation.

GP-Sarsa:

Distribution of Q(a,b(s)) value Action a

Which action leads to success?Standard approach:

Q(a,b(s)) value