reward estimation for dialogue policy optimisation · 2018-06-07 · n rating: correctness,...
TRANSCRIPT
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 1/40
Pei-Hao(Eddy)SuDeepHack.Turing ,25July2017
RewardEstimationforDialoguePolicyOptimisation
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 2/40
Motivation ConclusionExperimentsProposal
VariantsofSeq2Seqmodel:[Vinyals andLe2015][Serban etal2016][AI-Rfou etal.2016][Lietal.2016]
DialogueSystems
• Chat-basedAgents– Hopetotalkabouteverything(opendomain)– Nospecificgoal,focusonconversationflow
• Task-orientedSystem– Achieveacertaintask(closeddomain)– Combinationofrules andstatistical components– Groundlanguageusingaknowledgebase(ontology)
• Pipelinedialoguesystems[Hendersonetal.2005,WilliamsandYoung2007]• End-to-Enddialoguesystems[Antoineetal.2017,Wenetal.2017]
WhyarethemiddleagescalledtheDarkAges?
Becausethereweresomanyknights…
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 3/40
Motivation ConclusionExperimentsProposal
Task-orientedDialogueSystem
Withpaidsubjects
Task:- Findarestaurant,Chinese,cheap,west- Askphone,address
Hi,HowmayIhelpyou?
Whereinthecitywouldyoulike?
Yim Wah isaniceChineseplace.
Itisat2-4Lensfild Road.
Thanks,goodbye.
IwantacheapChineseRestaurant.
Somewhereinthewest,please.
Great,canyougivemetheaddress?
Ok,thankyou,bye!
Objective:Fail(nophone)
Successevaluation
Subjective:Success(getallheasked)
AmbiguityNotPractical
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 4/40
Motivation ConclusionExperimentsProposal
Goal
Definealearningobjective (reward)totrainadialoguesystemon-line fromrealusers
• TasksØ Evaluatethedialogue(rewardmodelling)Ø DealwithunreliableuserratingØ Learnadialoguepolicy
• ModelsØ Recurrentneuralnetworks,Gaussianprocesses
• MethodsØ Reinforcementlearning,On-linelearning,Activelearning
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 5/40
Motivation ConclusionExperimentsProposal
Outline
� Motivation– Learningfromrealusers� ProposedFramework� Experiment� Conclusion
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 6/40
Motivation ConclusionExperimentsProposal
PipelineSpokenDialogueSystem
NLG
SpeechGeneration
TTS
DialoguePolicy
DialogueManager
BeliefStateTracker
U s e r
ASR SemanticDecoder
SpeechUnderstanding
Inform(name=Yim Wah,area=west)
Somewhereinthewest,please.
Somewhereinthewet,please.
…
East West … None0.01 0.94 0 0.05
Yim Wah isaniceplaceinthewest.
Area
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 7/40
Motivation ConclusionExperimentsProposal
ItbeatGOchampionsin2016and2017
Agentlearnstotakeactionstomaximise totalreward
Agentobservationaction
reward
Environment
Nextmove
ReinforcementLearning101
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 8/40
Motivation ConclusionExperimentsProposal
Agentlearnstotakeactionstomaximise totalreward
Agentobservationaction
reward
Environment
DialoguePolicy
=user
ReinforcementLearning101
Inform(area=north) area=north :0.94area=east :0.06Inform_area
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 9/40
Motivation ConclusionExperimentsProposal
DialogueManagerinRLframeworkU s e r
RewardRObservationOActionA
Environment
Agent
Correctrewardsareacrucialfactorindialoguepolicytraining
SpeechGeneration
SpeechUnderstanding
DialoguePolicy
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 10/40
Motivation ConclusionExperimentsProposal
n DialogueisaspecialRLtask:n Humaninvolvesininteraction andrating (evaluation)ofadialoguen Human-in-the-loopframework:human istroublesomebutuseful
n Rating:correctness,appropriateness,andadequacy
RewardforRL≅ EvaluationforSDS
- Expertrating high quality,high cost- User rating unreliablequality,medium cost- Objectiverating Check desiredaspects,low cost
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 11/40
Motivation ConclusionExperimentsProposal
TypicalRewardFunction:n perturnpenalty-1n Largerewardatcompletionif
Ø Typicallyrequirespriorknowledge ofthetask✔ Simulateduser✔ Paidusers(AmazonMechanicalTurk)✖ Realusers
TheReinforcementSignalinSDS
successful|||
…
﹅
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 12/40
Motivation ConclusionExperimentsProposal
Howtolearnpolicyfromrealusers?
n Infersuccess(reward)directlyfromdialoguesn Trainarewardestimatorfromdata(Su etal. 2015)
TheReinforcementSignalinSDS
…
f1 f2 f3 f4 fT
OutputHiddenLayers
1/0
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 13/40
Motivation ConclusionExperimentsProposal
RNNRewardEstimatorforPolicyLearning
TheReinforcementSignalinSDS
RNN-systemlearntpolicymorepractically andefficiently thanObjective-baseline
• Needs taskinfo.• LearnsonlyfromObj=Subj dialogue
(500outof~900)
• Notask/userfeedback• Learnsfromevery dialogue(all500)
Objective-Baseline
RNN-system
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 14/40
Motivation ConclusionExperimentsProposal
Howtolearnpolicyfromrealusers?
n Infersuccess(reward)directlyfromdialoguesn Trainarewardestimatorfrom data(Su etal. 2015)
n Userratingl Noisyl Difficult/Costlytoobtain
n Robustuserrating model(Suetal. 2016)l Noisyà GaussianProcess withuncertaintyl Difficult/Costlyà ActiveLearning
TheReinforcementSignalinSDS
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 15/40
Motivation ConclusionExperimentsProposal
Outline
� Motivation– Learningfromhumanusers� ProposedFramework� Experiment� Conclusion
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 16/40
Motivation ConclusionExperimentsProposal
SystemFrameworkU s e r
RatingObservationOActionA
Environment
Agent
SpeechGeneration
SpeechUnderstanding
DialoguePolicy
RewardModel
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 17/40
Motivation ConclusionExperimentsProposal
Rewardmodellingonuserbinarysuccessrating
SystemFramework
RewardModel
Success/FailEmbeddingFunction
DialogueRepresentation
ReinforcementSignalQuery
rating
A. B.
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 18/40
Motivation ConclusionExperimentsProposal
Mapsadialogueseq toafixed-lengthvector
A.DialogueEmbedding
RewardModel
EmbeddingFunction
f2
f1Turn 1
Turn 2
;Distributionoveruserintention
1-hotsystemaction RescaledTurn
[ ];ft:
- Trainingdata:{f1,…, fT}foreachdialogue
(Vandyke&Suet.al,ASRU2015)
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 19/40
Motivation ConclusionExperimentsProposal
A.DialogueEmbedding- Supervised
Re-usethesupervisedRNN
n Lasthiddenlayerasdialoguerepresentation
RewardModel
EmbeddingFunction
…
f1 f2 f3 f4 fT
OutputHiddenLayers
ds
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 20/40
Motivation ConclusionExperimentsProposal
A.DialogueEmbedding- Unsupervised
Bi-LSTMEncoder-Decoder(Seq2Seq)
n Reconstructinputswithvariable-lengthsn =[;]capturesforward-backwardinfon Bottleneckdu isthedialoguerepresentation
n MSEtrainingcriterion:
n ft:input/target,f’t:prediction
RewardModel
EmbeddingFunction
du
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 21/40
Motivation ConclusionExperimentsProposal
B.ActiveRewardLearningModel
n Determineclassprobability:𝑝 𝑦 𝒅, 𝐷 ,given𝐷 = {(𝒅+, 𝑦+)}+./0
- where𝑦 = +1,−1
n Handletheissueofnoisy andcostly userrating
n Gaussianprocess(GP)withactivelearning
RewardModel
EmbeddingFunction
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 22/40
Motivation ConclusionExperimentsProposal
Gaussianprocessclassifierforsuccessrating
n GPisshownusefulinpolicylearning (Gasic ’14,Casanueva ’15)- Learnfromfewobservations- Providesameasureofuncertainty
n 𝑝 𝑦 = 1 𝒅, 𝐷Ø f∶ latentfunction:𝑅;<=(𝒅) → 𝑅Ø 𝜑:probit function:𝑅 → [0,1]
n 𝑓 𝒅 ~𝐺𝑃(𝑚 𝒅 , 𝑘(𝒅, 𝒅′))
Ø
B.ActiveRewardLearningModel
RewardModel
EmbeddingFunction
Dialogue representation dSu
cces
s pr
ob. p(y=1)
d*
x: labelled data
Late
nt fu
nctio
n f(d)
𝜑
Dialogue representation dSu
cces
s pr
ob. p(y=1)
d*
x: labelled data
Late
nt fu
nctio
n f(d)
= 𝜑 𝑓 𝒅 𝐷
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 23/40
Motivation ConclusionExperimentsProposal
Gaussianprocessclassifierforsuccessrating
n Prior:𝑓 𝒅 ~𝐺𝑃(𝑚 𝒅 , 𝑘(𝒅, 𝒅′))
n Predictivedistribution:𝑝(𝑦=1│𝒅,𝐷)=𝜑(𝑓(𝒅│𝐷))
n Predictionon𝒅∗:𝑝 𝑦∗ = 1 𝒅∗, 𝐷 = 𝜑(𝑚∗/ 1 + 𝜎∗O
� )
( Q∗
/RS∗T� → 0 ⇒ 𝜑 V → 0.5)
Dialogue representation dSu
cces
s pr
ob. p(y=1)
d*
x: labelled data
Late
nt fu
nctio
n f(d)
B.ActiveRewardLearningModel
RewardModel
EmbeddingFunction
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 24/40
Motivation ConclusionExperimentsProposal
B.ActiveRewardLearningModel
Gaussianprocessclassifierforsuccessrating
🛠Handletheissueofnoisy andcostly userrating
n AddNoiseterm intheRBFkernel- Morenoise->lesscertain
n Activelearning:thresholdonprob.- λ:whentoqueryuserrating
Userratingnoise
Inputcorrelation
RewardModel
EmbeddingFunction
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 25/40
Motivation ConclusionExperimentsProposal
B.ActiveRewardLearningModel
CategoriesofActiveLearning RewardModel
EmbeddingFunction
Settles.ActiveLearningLiteratureSurvey.2009
noisyrating
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 26/40
Motivation ConclusionExperimentsProposal
ActiveRewardModelintheloop
SystemFramework
{f1,…, fT} d*σ(f1:T)
d*
.
Ingreenarea,query!->Userrates:Failed->Reward:-1*scalar
D ={(d,y)}
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 27/40
Motivation ConclusionExperimentsProposal
Outline
� Motivation– Learningfromhumanusers� ProposedFramework� Experiment� Conclusion
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 28/40
Motivation ConclusionExperimentsProposal
DialogueRepresentation- Supervised
Visualising dialoguedistribution
n Labelledrestaurantdialoguedatan train:valid:test =1000:1000:1000n dim(ds)=32
n Analysisusingt-SNEondsn Twoclusters:Successfulv.s.Failedn Successful:short,Failed:time-outn Highlyaffectedbytraininglabels
t-SNEplot
RewardModel
EmbeddingFunction
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 29/40
Motivation ConclusionExperimentsProposal
DialogueRepresentation- Unsupervised
Visualising dialoguedistribution
n Un-labelledrestaurantdialoguedatan train:valid:test =8565:1199:650n dim(du)=64
n Analysisusingt-SNEondun Colour gradient:shortà long lengthn Successfuldialogues<10turns
n Usersdon’tengageinlongerdialoguesn length correlateshighlytosuccess
t-SNEplot
RewardModel
EmbeddingFunction
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 30/40
Motivation ConclusionExperimentsProposal
n Cambridgerestaurantdomain:- ~100venues- 3informable slots:area,pricerange,food- 3requestable slots:addr,phone,postcode
n Reward:n perturn-1,n Whendialogueends,binary(0/1)*20:
n Crowd-sourcedusersfromAmazonMechanicalTurk
EmbedtherewardmodelinSDS
SystemSetup
U s e r
- On-lineGP Proposedmethod
- Subj Userratingonly
- Off-lineRNN(Su.et al.2015)
RNNwith1K simulateddata
RewardModel
SpeechGeneration
SpeechUnderstanding
DialoguePolicy(GPRL)
RewardModel
EmbeddingFunction
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 31/40
Motivation ConclusionExperimentsProposal
n Similarperformancen However,Supervisedembeddingrequiresadditionallabelsn Unsupervisedmethodisthusmoredesirable
On-lineDialogueReward&PolicyLearning
Dialoguepolicylearningwithrealusers
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 32/40
Motivation ConclusionExperimentsProposal
n Allreached> 85% after500dialoguesn On-lineGP ismorerobustthanSubj inlongerrunn On-lineGP needsonly150queriesfromuserrating
On-lineDialogueReward&PolicyLearning
Dialoguepolicylearningwithrealusers
Dialogues RewardModel Subjective(%)
400- 500Off-lineRNN
SubjOn-lineGP
89.0 +- 1.890.7 +- 1.791.7+- 1.6
500 - 850 SubjOn-line GP
87.1+- 1.090.9 +- 0.9 *
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 33/40
Motivation ConclusionExperimentsProposal
Outline
� Motivation– Learningfromhumanusers� ProposedFramework� Experiment� Conclusion
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 34/40
Motivation ConclusionExperimentsProposal
n Proposal:anon-lineactiverewardlearningframeworkn UnsupervisedDialogueEmbedding:Bi-LSTMEncoder-Decodern On-lineActiveRewardModel:GPClassifierwithuncertaintythresholdn Reducedataannotation andmitigatenoisyuserratingn Noneedoflabelleddata andusersimulator
n Achievetrulyon-linepolicylearning fromrealusersw/otaskinfo
Conclusion
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 35/40
Motivation ConclusionExperimentsProposal
n Extendtherewardmodelto(ordinal)regression/multi-classtaskn Currentlyhandlesonlybinaryclassification
n Methodsforevaluatingthedialogueembeddingn Mostlymeasuredbydownstreamtasks
Discussion
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 36/40
Motivation ConclusionExperimentsProposal
Discussion
n Transferknowledgeacrossdomains[1]n Handleambiguousmeaningoflanguages[2]n Learntoreplyinrichercontext[3]n Gethigh-qualitydata[4]
[1]Gašić et.al, PolicyCommitteeforadaptationinmulti-domainspokendialoguesystems,ASRU2015[2]Mrkšić, et.al, Counter-fittingWordVectorstoLinguisticConstraints.NAACL2016[3]Suet.al,Sample-efficientActor-CriticReinforcementLearningwithSupervisedDataforDialogueManagement,SIGDIAL2017[4]Wenet.al,ANetwork-basedEnd-to-EndTrainableTask-orientedDialogueSystem,EACL2017
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 37/40
Motivation ConclusionExperimentsProposal
Acknowledgement
• Past&PresentGroupmembers:– SteveYoung(Supervisor)– Milica Gasic (Advisor)– Dongho Kim– Pirros Tsiakoulis– MattHenderson– DavidVandyke– NikolaMrksic– ShawnWen– LinaRojasBarahona– StefanUltes– PawelBudzianowski– InigoCasanueva
• Financialsupports:– TaiwanCambridgePhDScholarship
– FundingfromEngineeringDepartment
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 38/40
Motivation ConclusionExperimentsProposal
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 39/40
Motivation ConclusionExperimentsProposal
1. Pei-Hao Su,Milica Gašić,NikolaMrkšić,LinaRojas-Barahona,StefanUltes,DavidVandyke,Tsung-Hsien WenandSteveYoung, “On-lineActiveRewardLearningforPolicyOptimisation inSpokenDialogueSystems”. InProceedingofACL2016
2. Pei-Hao Su,DavidVandyke,Milica Gašić,Dongho Kim,NikolaMrkšić,Tsung-HsienWenandSteveYoung, “LearningfromRealUsers:RatingDialogueSuccesswithNeuralNetworksforReinforcementLearninginSpokenDialogueSystems”. InProceedingofInterspeech 2015
3. DavidVandyke, Pei-Hao Su,Milica Gašić,NikolaMrkšić,Tsung-Hsien WenandSteveYoung, “Multi-DomainDialogueSuccessClassifiersforPolicyTraining”. InProceedingofASRU2015
References
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 40/40
Motivation ConclusionExperimentsProposal
Chat-basedSystems– OriolVinyals, QuocLe, “ANeuralConversationalModel”. InarXiv 1506.05869– IulianV.Serban, AlessandroSordoni, YoshuaBengio, AaronCourville, JoellePineau, “Building
End-To-EndDialogueSystemsUsingGenerativeHierarchicalNeuralNetworkModels”. InAAAI2016
– JiweiLi, WillMonroe, AlanRitter, MichelGalley, JianfengGao, DanJurafsky,“DeepReinforcementLearningforDialogueGeneration”. InEMNLP2016
– AI-Rfou etal.,“ConversationalContextualCues:TheCaseofPersonalizationandHistoryforResponseRanking”.InarXiv 2016
Task-orientedDialogueSystems– JamesHenderson,OliverLemon,Kallirroi Georgila, “HybridReinforcement/Supervised
LearningforDialoguePoliciesfromCOMMUNICATORdata”. InIJCAIWorkshop2005– JasonWilliamsandSteveYoung, “PartiallyobservableMarkovdecisionprocessesforspoken
dialogsystems”. InCSL2007– AntoineBordes, Y-LanBoureau, JasonWeston,“LearningEnd-to-EndGoal-OrientedDialog”.
InICLR2017– Wenet.al,“ANetwork-basedEnd-to-EndTrainableTask-orientedDialogueSystem”,inEACL
2017
References
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 41/40
Motivation ConclusionExperimentsProposal
Questions?
[email protected]://mi.eng.cam.ac.uk/~phs26/
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 42/40
Motivation ConclusionExperimentsProposal
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 43/40
Motivation ConclusionExperimentsProposal
ExampleDialogues– LowNoise
RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 44/40
Motivation ConclusionExperimentsProposal
ExampleDialogues– HighNoise