reward estimation for dialogue policy optimisation · 2018-06-07 · n rating: correctness,...

RewardEstimationforDialoguePolicyOptimisationPei-Hao (Eddy)Su 1/40

Pei-Hao(Eddy)SuDeepHack.Turing ,25July2017

RewardEstimationforDialoguePolicyOptimisation


Motivation ConclusionExperimentsProposal

VariantsofSeq2Seqmodel:[Vinyals andLe2015][Serban etal2016][AI-Rfou etal.2016][Lietal.2016]

DialogueSystems

• Chat-basedAgents– Hopetotalkabouteverything(opendomain)– Nospecificgoal,focusonconversationflow

• Task-orientedSystem– Achieveacertaintask(closeddomain)– Combinationofrules andstatistical components– Groundlanguageusingaknowledgebase(ontology)

• Pipelinedialoguesystems[Hendersonetal.2005,WilliamsandYoung2007]• End-to-Enddialoguesystems[Antoineetal.2017,Wenetal.2017]

WhyarethemiddleagescalledtheDarkAges?

Becausethereweresomanyknights…



Task-orientedDialogueSystem

Withpaidsubjects

Task:- Findarestaurant,Chinese,cheap,west- Askphone,address

Hi,HowmayIhelpyou?

Whereinthecitywouldyoulike?

Yim Wah isaniceChineseplace.

Itisat2-4Lensfild Road.

Thanks,goodbye.

IwantacheapChineseRestaurant.

Somewhereinthewest,please.

Great,canyougivemetheaddress?

Ok,thankyou,bye!

Objective:Fail(nophone)

Successevaluation

Subjective:Success(getallheasked)

AmbiguityNotPractical



Goal

Definealearningobjective (reward)totrainadialoguesystemon-line fromrealusers

• TasksØ Evaluatethedialogue(rewardmodelling)Ø DealwithunreliableuserratingØ Learnadialoguepolicy

• ModelsØ Recurrentneuralnetworks,Gaussianprocesses

• MethodsØ Reinforcementlearning,On-linelearning,Activelearning



Outline

� Motivation– Learningfromrealusers� ProposedFramework� Experiment� Conclusion



PipelineSpokenDialogueSystem

NLG

SpeechGeneration

TTS

DialoguePolicy

DialogueManager

BeliefStateTracker

U s e r

ASR SemanticDecoder

SpeechUnderstanding

Inform(name=Yim Wah,area=west)

Somewhereinthewest,please.

Somewhereinthewet,please.

…

East West … None0.01 0.94 0 0.05

Yim Wah isaniceplaceinthewest.

Area



ItbeatGOchampionsin2016and2017

Agentlearnstotakeactionstomaximise totalreward

Agentobservationaction

reward

Environment

Nextmove

ReinforcementLearning101



Agentlearnstotakeactionstomaximise totalreward

Agentobservationaction

reward

Environment

DialoguePolicy

=user

ReinforcementLearning101

Inform(area=north) area=north :0.94area=east :0.06Inform_area



DialogueManagerinRLframeworkU s e r

RewardRObservationOActionA

Environment

Agent

Correctrewardsareacrucialfactorindialoguepolicytraining

SpeechGeneration

SpeechUnderstanding

DialoguePolicy



n DialogueisaspecialRLtask:n Humaninvolvesininteraction andrating (evaluation)ofadialoguen Human-in-the-loopframework:human istroublesomebutuseful

n Rating:correctness,appropriateness,andadequacy

RewardforRL≅ EvaluationforSDS

- Expertrating high quality,high cost- User rating unreliablequality,medium cost- Objectiverating Check desiredaspects,low cost



TypicalRewardFunction:n perturnpenalty-1n Largerewardatcompletionif

Ø Typicallyrequirespriorknowledge ofthetask✔ Simulateduser✔ Paidusers(AmazonMechanicalTurk)✖ Realusers

TheReinforcementSignalinSDS

successful|||

…

﹅



Howtolearnpolicyfromrealusers?

n Infersuccess(reward)directlyfromdialoguesn Trainarewardestimatorfromdata(Su etal. 2015)


…

f1 f2 f3 f4 fT

OutputHiddenLayers

1/0



RNNRewardEstimatorforPolicyLearning


RNN-systemlearntpolicymorepractically andefficiently thanObjective-baseline

• Needs taskinfo.• LearnsonlyfromObj=Subj dialogue

(500outof~900)

• Notask/userfeedback• Learnsfromevery dialogue(all500)

Objective-Baseline

RNN-system



Howtolearnpolicyfromrealusers?

n Infersuccess(reward)directlyfromdialoguesn Trainarewardestimatorfrom data(Su etal. 2015)

n Userratingl Noisyl Difficult/Costlytoobtain

n Robustuserrating model(Suetal. 2016)l Noisyà GaussianProcess withuncertaintyl Difficult/Costlyà ActiveLearning




Outline

� Motivation– Learningfromhumanusers� ProposedFramework� Experiment� Conclusion



SystemFrameworkU s e r

RatingObservationOActionA

Environment

Agent

SpeechGeneration

SpeechUnderstanding

DialoguePolicy

RewardModel



Rewardmodellingonuserbinarysuccessrating

SystemFramework

RewardModel

Success/FailEmbeddingFunction

DialogueRepresentation

ReinforcementSignalQuery

rating

A. B.



Mapsadialogueseq toafixed-lengthvector

A.DialogueEmbedding

RewardModel

EmbeddingFunction

f2

f1Turn 1

Turn 2

;Distributionoveruserintention

1-hotsystemaction RescaledTurn

[ ];ft:

- Trainingdata:{f1,…, fT}foreachdialogue

(Vandyke&Suet.al,ASRU2015)



A.DialogueEmbedding- Supervised

Re-usethesupervisedRNN

n Lasthiddenlayerasdialoguerepresentation

RewardModel

EmbeddingFunction

…

f1 f2 f3 f4 fT

OutputHiddenLayers

ds



A.DialogueEmbedding- Unsupervised

Bi-LSTMEncoder-Decoder(Seq2Seq)

n Reconstructinputswithvariable-lengthsn =[;]capturesforward-backwardinfon Bottleneckdu isthedialoguerepresentation

n MSEtrainingcriterion:

n ft:input/target,f’t:prediction

RewardModel

EmbeddingFunction

du



B.ActiveRewardLearningModel

n Determineclassprobability:𝑝 𝑦 𝒅, 𝐷 ,given𝐷 = {(𝒅+, 𝑦+)}+./0

- where𝑦 = +1,−1

n Handletheissueofnoisy andcostly userrating

n Gaussianprocess(GP)withactivelearning

RewardModel

EmbeddingFunction



Gaussianprocessclassifierforsuccessrating

n GPisshownusefulinpolicylearning (Gasic ’14,Casanueva ’15)- Learnfromfewobservations- Providesameasureofuncertainty

n 𝑝 𝑦 = 1 𝒅, 𝐷Ø f∶ latentfunction:𝑅;<=(𝒅) → 𝑅Ø 𝜑:probit function:𝑅 → [0,1]

n 𝑓 𝒅 ~𝐺𝑃(𝑚 𝒅 , 𝑘(𝒅, 𝒅′))

Ø


RewardModel

EmbeddingFunction

Dialogue representation dSu

cces

s pr

ob. p(y=1)

d*

x: labelled data

Late

nt fu

nctio

n f(d)

𝜑


cces

s pr

ob. p(y=1)

d*

x: labelled data

Late

nt fu

nctio

n f(d)

= 𝜑 𝑓 𝒅 𝐷




n Prior:𝑓 𝒅 ~𝐺𝑃(𝑚 𝒅 , 𝑘(𝒅, 𝒅′))

n Predictivedistribution:𝑝(𝑦=1│𝒅,𝐷)=𝜑(𝑓(𝒅│𝐷))

n Predictionon𝒅∗:𝑝 𝑦∗ = 1 𝒅∗, 𝐷 = 𝜑(𝑚∗/ 1 + 𝜎∗O

� )

( Q∗

/RS∗T� → 0 ⇒ 𝜑 V → 0.5)


cces

s pr

ob. p(y=1)

d*

x: labelled data

Late

nt fu

nctio

n f(d)


RewardModel

EmbeddingFunction





🛠Handletheissueofnoisy andcostly userrating

n AddNoiseterm intheRBFkernel- Morenoise->lesscertain

n Activelearning:thresholdonprob.- λ:whentoqueryuserrating

Userratingnoise

Inputcorrelation

RewardModel

EmbeddingFunction




CategoriesofActiveLearning RewardModel

EmbeddingFunction

Settles.ActiveLearningLiteratureSurvey.2009

noisyrating



ActiveRewardModelintheloop

SystemFramework

{f1,…, fT} d*σ(f1:T)

d*

.

Ingreenarea,query!->Userrates:Failed->Reward:-1*scalar

D ={(d,y)}



Outline




DialogueRepresentation- Supervised

Visualising dialoguedistribution

n Labelledrestaurantdialoguedatan train:valid:test =1000:1000:1000n dim(ds)=32

n Analysisusingt-SNEondsn Twoclusters:Successfulv.s.Failedn Successful:short,Failed:time-outn Highlyaffectedbytraininglabels

t-SNEplot

RewardModel

EmbeddingFunction



DialogueRepresentation- Unsupervised

Visualising dialoguedistribution

n Un-labelledrestaurantdialoguedatan train:valid:test =8565:1199:650n dim(du)=64

n Analysisusingt-SNEondun Colour gradient:shortà long lengthn Successfuldialogues<10turns

n Usersdon’tengageinlongerdialoguesn length correlateshighlytosuccess

t-SNEplot

RewardModel

EmbeddingFunction



n Cambridgerestaurantdomain:- ~100venues- 3informable slots:area,pricerange,food- 3requestable slots:addr,phone,postcode

n Reward:n perturn-1,n Whendialogueends,binary(0/1)*20:

n Crowd-sourcedusersfromAmazonMechanicalTurk

EmbedtherewardmodelinSDS

SystemSetup

U s e r

- On-lineGP Proposedmethod

- Subj Userratingonly

- Off-lineRNN(Su.et al.2015)

RNNwith1K simulateddata

RewardModel

SpeechGeneration

SpeechUnderstanding

DialoguePolicy(GPRL)

RewardModel

EmbeddingFunction



n Similarperformancen However,Supervisedembeddingrequiresadditionallabelsn Unsupervisedmethodisthusmoredesirable

On-lineDialogueReward&PolicyLearning

Dialoguepolicylearningwithrealusers



n Allreached> 85% after500dialoguesn On-lineGP ismorerobustthanSubj inlongerrunn On-lineGP needsonly150queriesfromuserrating

On-lineDialogueReward&PolicyLearning

Dialoguepolicylearningwithrealusers

Dialogues RewardModel Subjective(%)

400- 500Off-lineRNN

SubjOn-lineGP

89.0 +- 1.890.7 +- 1.791.7+- 1.6

500 - 850 SubjOn-line GP

87.1+- 1.090.9 +- 0.9 *



Outline




n Proposal:anon-lineactiverewardlearningframeworkn UnsupervisedDialogueEmbedding:Bi-LSTMEncoder-Decodern On-lineActiveRewardModel:GPClassifierwithuncertaintythresholdn Reducedataannotation andmitigatenoisyuserratingn Noneedoflabelleddata andusersimulator

n Achievetrulyon-linepolicylearning fromrealusersw/otaskinfo

Conclusion



n Extendtherewardmodelto(ordinal)regression/multi-classtaskn Currentlyhandlesonlybinaryclassification

n Methodsforevaluatingthedialogueembeddingn Mostlymeasuredbydownstreamtasks

Discussion



Discussion

n Transferknowledgeacrossdomains[1]n Handleambiguousmeaningoflanguages[2]n Learntoreplyinrichercontext[3]n Gethigh-qualitydata[4]

[1]Gašić et.al, PolicyCommitteeforadaptationinmulti-domainspokendialoguesystems,ASRU2015[2]Mrkšić, et.al, Counter-fittingWordVectorstoLinguisticConstraints.NAACL2016[3]Suet.al,Sample-efficientActor-CriticReinforcementLearningwithSupervisedDataforDialogueManagement,SIGDIAL2017[4]Wenet.al,ANetwork-basedEnd-to-EndTrainableTask-orientedDialogueSystem,EACL2017



Acknowledgement

• Past&PresentGroupmembers:– SteveYoung(Supervisor)– Milica Gasic (Advisor)– Dongho Kim– Pirros Tsiakoulis– MattHenderson– DavidVandyke– NikolaMrksic– ShawnWen– LinaRojasBarahona– StefanUltes– PawelBudzianowski– InigoCasanueva

• Financialsupports:– TaiwanCambridgePhDScholarship

– FundingfromEngineeringDepartment



1. Pei-Hao Su,Milica Gašić,NikolaMrkšić,LinaRojas-Barahona,StefanUltes,DavidVandyke,Tsung-Hsien WenandSteveYoung, “On-lineActiveRewardLearningforPolicyOptimisation inSpokenDialogueSystems”. InProceedingofACL2016

2. Pei-Hao Su,DavidVandyke,Milica Gašić,Dongho Kim,NikolaMrkšić,Tsung-HsienWenandSteveYoung, “LearningfromRealUsers:RatingDialogueSuccesswithNeuralNetworksforReinforcementLearninginSpokenDialogueSystems”. InProceedingofInterspeech 2015

3. DavidVandyke, Pei-Hao Su,Milica Gašić,NikolaMrkšić,Tsung-Hsien WenandSteveYoung, “Multi-DomainDialogueSuccessClassifiersforPolicyTraining”. InProceedingofASRU2015

References



Chat-basedSystems– OriolVinyals, QuocLe, “ANeuralConversationalModel”. InarXiv 1506.05869– IulianV.Serban, AlessandroSordoni, YoshuaBengio, AaronCourville, JoellePineau, “Building

End-To-EndDialogueSystemsUsingGenerativeHierarchicalNeuralNetworkModels”. InAAAI2016

– JiweiLi, WillMonroe, AlanRitter, MichelGalley, JianfengGao, DanJurafsky,“DeepReinforcementLearningforDialogueGeneration”. InEMNLP2016

– AI-Rfou etal.,“ConversationalContextualCues:TheCaseofPersonalizationandHistoryforResponseRanking”.InarXiv 2016

Task-orientedDialogueSystems– JamesHenderson,OliverLemon,Kallirroi Georgila, “HybridReinforcement/Supervised

LearningforDialoguePoliciesfromCOMMUNICATORdata”. InIJCAIWorkshop2005– JasonWilliamsandSteveYoung, “PartiallyobservableMarkovdecisionprocessesforspoken

dialogsystems”. InCSL2007– AntoineBordes, Y-LanBoureau, JasonWeston,“LearningEnd-to-EndGoal-OrientedDialog”.

InICLR2017– Wenet.al,“ANetwork-basedEnd-to-EndTrainableTask-orientedDialogueSystem”,inEACL

2017

References



Questions?

[email protected]://mi.eng.cam.ac.uk/~phs26/



ExampleDialogues– LowNoise



ExampleDialogues– HighNoise

reward estimation for dialogue policy optimisation · 2018-06-07 · n rating: correctness,...

Documents