Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness,...

44
Reward Estimation for Dialogue Policy Optimisation Pei-Hao (Eddy) Su 1/40 Pei-Hao (Eddy) Su DeepHack.Turing , 25 July 2017 Reward Estimation for Dialogue Policy Optimisation

Transcript of Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness,...

Page 1: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 1/40

Pei-Hao(Eddy)SuDeepHack.Turing ,25July2017

RewardEstimationforDialoguePolicyOptimisation

Page 2: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 2/40

Motivation ConclusionExperimentsProposal

VariantsofSeq2Seqmodel:[Vinyals andLe2015][Serban etal2016][AI-Rfou etal.2016][Lietal.2016]

DialogueSystems

• Chat-basedAgents– Hopetotalkabouteverything(opendomain)– Nospecificgoal,focusonconversationflow

• Task-orientedSystem– Achieveacertaintask(closeddomain)– Combinationofrules andstatistical components– Groundlanguageusingaknowledgebase(ontology)

• Pipelinedialoguesystems[Hendersonetal.2005,WilliamsandYoung2007]• End-to-Enddialoguesystems[Antoineetal.2017,Wenetal.2017]

WhyarethemiddleagescalledtheDarkAges?

Becausethereweresomanyknights…

Page 3: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 3/40

Motivation ConclusionExperimentsProposal

Task-orientedDialogueSystem

Withpaidsubjects

Task:- Findarestaurant,Chinese,cheap,west- Askphone,address

Hi,HowmayIhelpyou?

Whereinthecitywouldyoulike?

Yim Wah isaniceChineseplace.

Itisat2-4Lensfild Road.

Thanks,goodbye.

IwantacheapChineseRestaurant.

Somewhereinthewest,please.

Great,canyougivemetheaddress?

Ok,thankyou,bye!

Objective:Fail(nophone)

Successevaluation

Subjective:Success(getallheasked)

AmbiguityNotPractical

Page 4: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 4/40

Motivation ConclusionExperimentsProposal

Goal

Definealearningobjective (reward)totrainadialoguesystemon-line fromrealusers

• TasksØ Evaluatethedialogue(rewardmodelling)Ø DealwithunreliableuserratingØ Learnadialoguepolicy

• ModelsØ Recurrentneuralnetworks,Gaussianprocesses

• MethodsØ Reinforcementlearning,On-linelearning,Activelearning

Page 5: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 5/40

Motivation ConclusionExperimentsProposal

Outline

� Motivation– Learningfromrealusers� ProposedFramework� Experiment� Conclusion

Page 6: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 6/40

Motivation ConclusionExperimentsProposal

PipelineSpokenDialogueSystem

NLG

SpeechGeneration

TTS

DialoguePolicy

DialogueManager

BeliefStateTracker

U s e r

ASR SemanticDecoder

SpeechUnderstanding

Inform(name=Yim Wah,area=west)

Somewhereinthewest,please.

Somewhereinthewet,please.

East West … None0.01 0.94 0 0.05

Yim Wah isaniceplaceinthewest.

Area

Page 7: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 7/40

Motivation ConclusionExperimentsProposal

ItbeatGOchampionsin2016and2017

Agentlearnstotakeactionstomaximise totalreward

Agentobservationaction

reward

Environment

Nextmove

ReinforcementLearning101

Page 8: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 8/40

Motivation ConclusionExperimentsProposal

Agentlearnstotakeactionstomaximise totalreward

Agentobservationaction

reward

Environment

DialoguePolicy

=user

ReinforcementLearning101

Inform(area=north) area=north :0.94area=east :0.06Inform_area

Page 9: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 9/40

Motivation ConclusionExperimentsProposal

DialogueManagerinRLframeworkU s e r

RewardRObservationOActionA

Environment

Agent

Correctrewardsareacrucialfactorindialoguepolicytraining

SpeechGeneration

SpeechUnderstanding

DialoguePolicy

Page 10: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 10/40

Motivation ConclusionExperimentsProposal

n DialogueisaspecialRLtask:n Humaninvolvesininteraction andrating (evaluation)ofadialoguen Human-in-the-loopframework:human istroublesomebutuseful

n Rating:correctness,appropriateness,andadequacy

RewardforRL≅ EvaluationforSDS

- Expertrating high quality,high cost- User rating unreliablequality,medium cost- Objectiverating Check desiredaspects,low cost

Page 11: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 11/40

Motivation ConclusionExperimentsProposal

TypicalRewardFunction:n perturnpenalty-1n Largerewardatcompletionif

Ø Typicallyrequirespriorknowledge ofthetask✔ Simulateduser✔ Paidusers(AmazonMechanicalTurk)✖ Realusers

TheReinforcementSignalinSDS

successful|||

Page 12: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 12/40

Motivation ConclusionExperimentsProposal

Howtolearnpolicyfromrealusers?

n Infersuccess(reward)directlyfromdialoguesn Trainarewardestimatorfromdata(Su etal. 2015)

TheReinforcementSignalinSDS

f1 f2 f3 f4 fT

OutputHiddenLayers

1/0

Page 13: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 13/40

Motivation ConclusionExperimentsProposal

RNNRewardEstimatorforPolicyLearning

TheReinforcementSignalinSDS

RNN-systemlearntpolicymorepractically andefficiently thanObjective-baseline

• Needs taskinfo.• LearnsonlyfromObj=Subj dialogue

(500outof~900)

• Notask/userfeedback• Learnsfromevery dialogue(all500)

Objective-Baseline

RNN-system

Page 14: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 14/40

Motivation ConclusionExperimentsProposal

Howtolearnpolicyfromrealusers?

n Infersuccess(reward)directlyfromdialoguesn Trainarewardestimatorfrom data(Su etal. 2015)

n Userratingl Noisyl Difficult/Costlytoobtain

n Robustuserrating model(Suetal. 2016)l Noisyà GaussianProcess withuncertaintyl Difficult/Costlyà ActiveLearning

TheReinforcementSignalinSDS

Page 15: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 15/40

Motivation ConclusionExperimentsProposal

Outline

� Motivation– Learningfromhumanusers� ProposedFramework� Experiment� Conclusion

Page 16: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 16/40

Motivation ConclusionExperimentsProposal

SystemFrameworkU s e r

RatingObservationOActionA

Environment

Agent

SpeechGeneration

SpeechUnderstanding

DialoguePolicy

RewardModel

Page 17: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 17/40

Motivation ConclusionExperimentsProposal

Rewardmodellingonuserbinarysuccessrating

SystemFramework

RewardModel

Success/FailEmbeddingFunction

DialogueRepresentation

ReinforcementSignalQuery

rating

A. B.

Page 18: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 18/40

Motivation ConclusionExperimentsProposal

Mapsadialogueseq toafixed-lengthvector

A.DialogueEmbedding

RewardModel

EmbeddingFunction

f2

f1Turn 1

Turn 2

;Distributionoveruserintention

1-hotsystemaction RescaledTurn

[ ];ft:

- Trainingdata:{f1,…, fT}foreachdialogue

(Vandyke&Suet.al,ASRU2015)

Page 19: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 19/40

Motivation ConclusionExperimentsProposal

A.DialogueEmbedding- Supervised

Re-usethesupervisedRNN

n Lasthiddenlayerasdialoguerepresentation

RewardModel

EmbeddingFunction

f1 f2 f3 f4 fT

OutputHiddenLayers

ds

Page 20: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 20/40

Motivation ConclusionExperimentsProposal

A.DialogueEmbedding- Unsupervised

Bi-LSTMEncoder-Decoder(Seq2Seq)

n Reconstructinputswithvariable-lengthsn =[;]capturesforward-backwardinfon Bottleneckdu isthedialoguerepresentation

n MSEtrainingcriterion:

n ft:input/target,f’t:prediction

RewardModel

EmbeddingFunction

du

Page 21: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 21/40

Motivation ConclusionExperimentsProposal

B.ActiveRewardLearningModel

n Determineclassprobability:𝑝 𝑦 𝒅, 𝐷 ,given𝐷 = {(𝒅+, 𝑦+)}+./0

- where𝑦 = +1,−1

n Handletheissueofnoisy andcostly userrating

n Gaussianprocess(GP)withactivelearning

RewardModel

EmbeddingFunction

Page 22: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 22/40

Motivation ConclusionExperimentsProposal

Gaussianprocessclassifierforsuccessrating

n GPisshownusefulinpolicylearning (Gasic ’14,Casanueva ’15)- Learnfromfewobservations- Providesameasureofuncertainty

n 𝑝 𝑦 = 1 𝒅, 𝐷Ø f∶ latentfunction:𝑅;<=(𝒅) → 𝑅Ø 𝜑:probit function:𝑅 → [0,1]

n 𝑓 𝒅 ~𝐺𝑃(𝑚 𝒅 , 𝑘(𝒅, 𝒅′))

Ø

B.ActiveRewardLearningModel

RewardModel

EmbeddingFunction

Dialogue representation dSu

cces

s pr

ob. p(y=1)

d*

x: labelled data

Late

nt fu

nctio

n f(d)

𝜑

Dialogue representation dSu

cces

s pr

ob. p(y=1)

d*

x: labelled data

Late

nt fu

nctio

n f(d)

= 𝜑 𝑓 𝒅 𝐷

Page 23: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 23/40

Motivation ConclusionExperimentsProposal

Gaussianprocessclassifierforsuccessrating

n Prior:𝑓 𝒅 ~𝐺𝑃(𝑚 𝒅 , 𝑘(𝒅, 𝒅′))

n Predictivedistribution:𝑝(𝑦=1│𝒅,𝐷)=𝜑(𝑓(𝒅│𝐷))

n Predictionon𝒅∗:𝑝 𝑦∗ = 1 𝒅∗, 𝐷 = 𝜑(𝑚∗/ 1 + 𝜎∗O

� )

( Q∗

/RS∗T� → 0 ⇒ 𝜑 V → 0.5)

Dialogue representation dSu

cces

s pr

ob. p(y=1)

d*

x: labelled data

Late

nt fu

nctio

n f(d)

B.ActiveRewardLearningModel

RewardModel

EmbeddingFunction

Page 24: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 24/40

Motivation ConclusionExperimentsProposal

B.ActiveRewardLearningModel

Gaussianprocessclassifierforsuccessrating

🛠Handletheissueofnoisy andcostly userrating

n AddNoiseterm intheRBFkernel- Morenoise->lesscertain

n Activelearning:thresholdonprob.- λ:whentoqueryuserrating

Userratingnoise

Inputcorrelation

RewardModel

EmbeddingFunction

Page 25: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 25/40

Motivation ConclusionExperimentsProposal

B.ActiveRewardLearningModel

CategoriesofActiveLearning RewardModel

EmbeddingFunction

Settles.ActiveLearningLiteratureSurvey.2009

noisyrating

Page 26: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 26/40

Motivation ConclusionExperimentsProposal

ActiveRewardModelintheloop

SystemFramework

{f1,…, fT} d*σ(f1:T)

d*

.

Ingreenarea,query!->Userrates:Failed->Reward:-1*scalar

D ={(d,y)}

Page 27: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 27/40

Motivation ConclusionExperimentsProposal

Outline

� Motivation– Learningfromhumanusers� ProposedFramework� Experiment� Conclusion

Page 28: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 28/40

Motivation ConclusionExperimentsProposal

DialogueRepresentation- Supervised

Visualising dialoguedistribution

n Labelledrestaurantdialoguedatan train:valid:test =1000:1000:1000n dim(ds)=32

n Analysisusingt-SNEondsn Twoclusters:Successfulv.s.Failedn Successful:short,Failed:time-outn Highlyaffectedbytraininglabels

t-SNEplot

RewardModel

EmbeddingFunction

Page 29: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 29/40

Motivation ConclusionExperimentsProposal

DialogueRepresentation- Unsupervised

Visualising dialoguedistribution

n Un-labelledrestaurantdialoguedatan train:valid:test =8565:1199:650n dim(du)=64

n Analysisusingt-SNEondun Colour gradient:shortà long lengthn Successfuldialogues<10turns

n Usersdon’tengageinlongerdialoguesn length correlateshighlytosuccess

t-SNEplot

RewardModel

EmbeddingFunction

Page 30: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 30/40

Motivation ConclusionExperimentsProposal

n Cambridgerestaurantdomain:- ~100venues- 3informable slots:area,pricerange,food- 3requestable slots:addr,phone,postcode

n Reward:n perturn-1,n Whendialogueends,binary(0/1)*20:

n Crowd-sourcedusersfromAmazonMechanicalTurk

EmbedtherewardmodelinSDS

SystemSetup

U s e r

- On-lineGP Proposedmethod

- Subj Userratingonly

- Off-lineRNN(Su.et al.2015)

RNNwith1K simulateddata

RewardModel

SpeechGeneration

SpeechUnderstanding

DialoguePolicy(GPRL)

RewardModel

EmbeddingFunction

Page 31: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 31/40

Motivation ConclusionExperimentsProposal

n Similarperformancen However,Supervisedembeddingrequiresadditionallabelsn Unsupervisedmethodisthusmoredesirable

On-lineDialogueReward&PolicyLearning

Dialoguepolicylearningwithrealusers

Page 32: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 32/40

Motivation ConclusionExperimentsProposal

n Allreached> 85% after500dialoguesn On-lineGP ismorerobustthanSubj inlongerrunn On-lineGP needsonly150queriesfromuserrating

On-lineDialogueReward&PolicyLearning

Dialoguepolicylearningwithrealusers

Dialogues RewardModel Subjective(%)

400- 500Off-lineRNN

SubjOn-lineGP

89.0 +- 1.890.7 +- 1.791.7+- 1.6

500 - 850 SubjOn-line GP

87.1+- 1.090.9 +- 0.9 *

Page 33: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 33/40

Motivation ConclusionExperimentsProposal

Outline

� Motivation– Learningfromhumanusers� ProposedFramework� Experiment� Conclusion

Page 34: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 34/40

Motivation ConclusionExperimentsProposal

n Proposal:anon-lineactiverewardlearningframeworkn UnsupervisedDialogueEmbedding:Bi-LSTMEncoder-Decodern On-lineActiveRewardModel:GPClassifierwithuncertaintythresholdn Reducedataannotation andmitigatenoisyuserratingn Noneedoflabelleddata andusersimulator

n Achievetrulyon-linepolicylearning fromrealusersw/otaskinfo

Conclusion

Page 35: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 35/40

Motivation ConclusionExperimentsProposal

n Extendtherewardmodelto(ordinal)regression/multi-classtaskn Currentlyhandlesonlybinaryclassification

n Methodsforevaluatingthedialogueembeddingn Mostlymeasuredbydownstreamtasks

Discussion

Page 36: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 36/40

Motivation ConclusionExperimentsProposal

Discussion

n Transferknowledgeacrossdomains[1]n Handleambiguousmeaningoflanguages[2]n Learntoreplyinrichercontext[3]n Gethigh-qualitydata[4]

[1]Gašić et.al, PolicyCommitteeforadaptationinmulti-domainspokendialoguesystems,ASRU2015[2]Mrkšić, et.al, Counter-fittingWordVectorstoLinguisticConstraints.NAACL2016[3]Suet.al,Sample-efficientActor-CriticReinforcementLearningwithSupervisedDataforDialogueManagement,SIGDIAL2017[4]Wenet.al,ANetwork-basedEnd-to-EndTrainableTask-orientedDialogueSystem,EACL2017

Page 37: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 37/40

Motivation ConclusionExperimentsProposal

Acknowledgement

• Past&PresentGroupmembers:– SteveYoung(Supervisor)– Milica Gasic (Advisor)– Dongho Kim– Pirros Tsiakoulis– MattHenderson– DavidVandyke– NikolaMrksic– ShawnWen– LinaRojasBarahona– StefanUltes– PawelBudzianowski– InigoCasanueva

• Financialsupports:– TaiwanCambridgePhDScholarship

– FundingfromEngineeringDepartment

Page 38: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 38/40

Motivation ConclusionExperimentsProposal

Page 39: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 39/40

Motivation ConclusionExperimentsProposal

1. Pei-Hao Su,Milica Gašić,NikolaMrkšić,LinaRojas-Barahona,StefanUltes,DavidVandyke,Tsung-Hsien WenandSteveYoung, “On-lineActiveRewardLearningforPolicyOptimisation inSpokenDialogueSystems”. InProceedingofACL2016

2. Pei-Hao Su,DavidVandyke,Milica Gašić,Dongho Kim,NikolaMrkšić,Tsung-HsienWenandSteveYoung, “LearningfromRealUsers:RatingDialogueSuccesswithNeuralNetworksforReinforcementLearninginSpokenDialogueSystems”. InProceedingofInterspeech 2015

3. DavidVandyke, Pei-Hao Su,Milica Gašić,NikolaMrkšić,Tsung-Hsien WenandSteveYoung, “Multi-DomainDialogueSuccessClassifiersforPolicyTraining”. InProceedingofASRU2015

References

Page 40: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 40/40

Motivation ConclusionExperimentsProposal

Chat-basedSystems– OriolVinyals, QuocLe, “ANeuralConversationalModel”. InarXiv 1506.05869– IulianV.Serban, AlessandroSordoni, YoshuaBengio, AaronCourville, JoellePineau, “Building

End-To-EndDialogueSystemsUsingGenerativeHierarchicalNeuralNetworkModels”. InAAAI2016

– JiweiLi, WillMonroe, AlanRitter, MichelGalley, JianfengGao, DanJurafsky,“DeepReinforcementLearningforDialogueGeneration”. InEMNLP2016

– AI-Rfou etal.,“ConversationalContextualCues:TheCaseofPersonalizationandHistoryforResponseRanking”.InarXiv 2016

Task-orientedDialogueSystems– JamesHenderson,OliverLemon,Kallirroi Georgila, “HybridReinforcement/Supervised

LearningforDialoguePoliciesfromCOMMUNICATORdata”. InIJCAIWorkshop2005– JasonWilliamsandSteveYoung, “PartiallyobservableMarkovdecisionprocessesforspoken

dialogsystems”. InCSL2007– AntoineBordes, Y-LanBoureau, JasonWeston,“LearningEnd-to-EndGoal-OrientedDialog”.

InICLR2017– Wenet.al,“ANetwork-basedEnd-to-EndTrainableTask-orientedDialogueSystem”,inEACL

2017

References

Page 41: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 41/40

Motivation ConclusionExperimentsProposal

Questions?

[email protected]://mi.eng.cam.ac.uk/~phs26/

Page 42: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 42/40

Motivation ConclusionExperimentsProposal

Page 43: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 43/40

Motivation ConclusionExperimentsProposal

ExampleDialogues– LowNoise

Page 44: Reward Estimation for Dialogue Policy Optimisation · 2018-06-07 · n Rating: correctness, appropriateness, and adequacy Reward for RL ≅ Evaluation for SDS - Expert rating highquality,

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 44/40

Motivation ConclusionExperimentsProposal

ExampleDialogues– HighNoise