adapting discriminative reranking to grounded language learning

Adapting Discriminative Reranking to Grounded Language LearningJoohyun Kim and Raymond J. MooneyDepartment of Computer ScienceThe University of Texas at AustinThe 51st Annual Meeting of the Association for Computational LinguisticsAugust 5, 201311Discriminative RerankingEffective approach to improve performance of generative models with secondary discriminative modelApplied to various NLP tasksSyntactic parsing (Collins, ICML 2000; Collins, ACL 2002; Charniak & Johnson, ACL 2005)Semantic parsing (Lu et al., EMNLP 2008; Ge and Mooney, ACL 2006)Part-of-speech tagging (Collins, EMNLP 2002)Semantic role labeling (Toutanova et al., ACL 2005)Named entity recognition (Collins, ACL 2002)Machine translation (Shen et al., NAACL 2004; Fraser and Marcu, ACL 2006) Surface realization in language generation (White & Rajkumar, EMNLP 2009; Konstas & Lapata, ACL 2012)Goal: Adapt discriminative reranking to grounded language learning22Discriminative RerankingGenerative modelTrained model outputs the best result with max probabilityTrainedGenerativeModel1-best candidate with maximum probabilityCandidate 1Testing Example33Discriminative RerankingCan we do better?Secondary discriminative model picks the best out of n-best candidates from baseline modelTrainedBaselineGenerativeModelGENn-best candidatesCandidate 1Candidate 2Candidate 3Candidate 4Candidate nTesting ExampleTrainedSecondaryDiscriminativeModelBest predictionOutput44Discriminative RerankingTraining secondary discriminative modelTrainedBaselineGenerativeModelGENn-best training candidatesCandidate 1Candidate 2Candidate 3Candidate 4Candidate nprobabilityTraining Example55Discriminative RerankingTraining secondary discriminative modelDiscriminative model parameter is updated with comparison between the best predicated candidate and the gold standardTrainedBaselineGenerativeModelGENn-best training candidatesCandidate 1Candidate 2Candidate 3Candidate 4Candidate nTraining ExampleSecondaryDiscriminativeModelGold StandardReferenceBest predictionCompareUpdateTrain66Grounded Language LearningThe process to acquire the semantics of natural language with respect to relevant perceptual contextsSupervision is ambiguous, appearing as surrounding perceptual environmentsNot typical supervised learning taskOne or some of the perceptual contexts are relevantNo single gold-standard per training exampleNo Standard Discriminative Reranking Available!77Learn to interpret and follow navigation instructions e.g. Go down this hall and make a right when you see an elevator to your left Use virtual worlds and instructor/follower data from MacMahon et al. (2006)No prior linguistic knowledgeInfer language semantics by observing how humans follow instructions

Navigation Task (Chen & Mooney, 2011)88

HCLSSBCHELEH Hat Rack

L Lamp

E Easel

S Sofa

B Barstool

C - Chair

Sample Environment (Chen & Mooney, 2011)99Executing Test Instruction

1010Start

3

H

4

11EndSample Navigation InstructionInstruction: Take your first left. Go all the way down until you hit a dead end.11

3

H

4

Observed primitive actions:Forward, Turn Left, Forward, Forward12StartEndEncountering environments:back: BLUE HALLWAYfront: BLUE HALLWAYleft: CONCRETE HALLWAYright/back/front: YELLOW HALLWAYfront/back: HATRACKright: CONCRETE HALLWAYfront: WALLright/left: WALLSample Navigation InstructionInstruction: Take your first left. Go all the way down until you hit a dead end.12

3

H

4

Observed primitive actions:Forward, Turn Left, Forward, Forward13StartEndEncountering environments:back: BLUE HALLWAYfront: BLUE HALLWAYleft: CONCRETE HALLWAYright/back/front: YELLOW HALLWAYfront/back: HATRACKright: CONCRETE HALLWAYfront: WALLright/left: WALLSample Navigation InstructionInstruction: Take your first left. Go all the way down until you hit a dead end.13

3

H

4

Take your first left. Go all the way down until you hit a dead end. Go towards the coat hanger and turn left at it. Go straight down the hallway and the dead end is position 4. Walk to the hat rack. Turn left. The carpet should have green octagons. Go to the end of this alley. This is p-4. Walk forward once. Turn left. Walk forward twice.

14StartEndSample Navigation Instruction14Task ObjectiveLearn the underlying meanings of instructions by observing human actions for the instructions Learn to map instructions (NL) into correct formal plan of actions (meaning representations, MR)Learn from high ambiguityTraining input of NL instruction / landmarks plan (Chen and Mooney, 2011) pairsLandmarks planDescribe actions in the environment along with notable objects encountered on the wayOverestimate the meaning of the instruction, including unnecessary detailsOnly subset of the plan is relevant for the instruction1515Challenge16Instruction:"at the easel, go left and then take a right onto the blue path at the corner"Landmarks plan:Travel ( steps: 1 ) ,Verify ( at: EASEL , side: CONCRETE HALLWAY ) ,Turn ( LEFT ) ,Verify ( front: CONCRETE HALLWAY ) ,Travel ( steps: 1 ) ,Verify ( side: BLUE HALLWAY , front: WALL ) ,Turn ( RIGHT ) ,Verify ( back: WALL , front: BLUE HALLWAY , front: CHAIR , front: HATRACK , left: WALL , right: EASEL )16Challenge17Instruction:"at the easel, go left and then take a right onto the blue path at the corner"Landmarks plan:Travel ( steps: 1 ) ,Verify ( at: EASEL , side: CONCRETE HALLWAY ) ,Turn ( LEFT ) ,Verify ( front: CONCRETE HALLWAY ) ,Travel ( steps: 1 ) ,Verify ( side: BLUE HALLWAY , front: WALL ) ,Turn ( RIGHT ) ,Verify ( back: WALL , front: BLUE HALLWAY , front: CHAIR , front: HATRACK , left: WALL , right: EASEL )17Challenge18Instruction:"at the easel, go left and then take a right onto the blue path at the corner"Correctplan:Travel ( steps: 1 ) ,Verify ( at: EASEL , side: CONCRETE HALLWAY ) ,Turn ( LEFT ) ,Verify ( front: CONCRETE HALLWAY ) ,Travel ( steps: 1 ) ,Verify ( side: BLUE HALLWAY , front: WALL ) ,Turn ( RIGHT ) ,Verify ( back: WALL , front: BLUE HALLWAY , front: CHAIR , front: HATRACK , left: WALL , right: EASEL )Exponential Number of Possibilities! Combinatorial matching problem between instruction and landmarks plan18Baseline Generative ModelPCFG Induction Model for Grounded Language Learning (Kim & Mooney, EMNLP 2012)Transform grounded language learning into standard PCFG grammar induction taskSet of pre-defined PCFG conversion rulesProbabilistic relationship of formal meaning representations (MRs) and natural language phrases (NLs)Use semantic lexicon Help define generative process of larger semantic concepts (MRs) hierarchically generating smaller concepts and finally NL phrases1919TurnLEFTat:SOFATravelVerifyTurngoGenerative ProcessNL:Context MRRelevantLexemesTurnLEFTfront:BLUEHALLfront:EASELsteps:2left:HATRACKVerifyTravelVerifyTurnRIGHTat:SOFAVerifyat:CHAIRL1L2andlefttothesofa2020How can we apply discriminative reranking?Impossible to apply standard discriminative reranking to grounded language learningLack of a single gold-standard reference for each training exampleInstead, provides weak supervision of surrounding perceptual context (landmarks plan)Use response feedback from perceptual world Evaluate candidate formal meaning representations (MRs) by executing them in simulated worldsUsed in evaluating the final end-task, plan executionWeak indication of whether a candidate is good/badMultiple candidate parses for parameter updateResponse signal is weak and distributed over all candidates2121Reranking Model: Averaged Perceptron (Collins, ICML 2000)Parameter weight vector is updated when trained model predicts a wrong candidateTrainedBaselineGenerativeModelGENn-best candidatesCandidate 1Candidate 2Candidate 3Candidate 4Candidate nTraining ExampleGold StandardReferenceBest predictionUpdatefeaturevector-0.161.21-1.091.460.592222Reranking Model: Averaged Perceptron (Collins, ICML 2000)Our baseline model with navigation taskCandidates: parse trees from baseline modelTrainedBaselineGenerativeModelGENn-best candidatesTraining ExampleBest predictionUpdatefeaturevector-0.161.21-1.091.460.59Kim & Mooney, 20122323Response-based Weight UpdateA single gold-standard reference parse for each training example does not existPick a pseudo-gold parse out of all candidatesEvaluate composed MR plans from candidate parses MARCO (MacMahon et al. AAAI 2006) execution module runs and evaluates each candidate MR in the worldAlso used for evaluating end-goal, plan execution performanceRecord Execution Success RateWhether each candidate MR reaches the intended destinationMARCO is nondeterministic, average over 10 trialsPrefer the candidate with the best success rate during training

2424Response-based UpdateSelect pseudo-gold reference based on MARCO execution resultsn-best candidatesCandidate 1Candidate 2Candidate 3Candidate 4Candidate nPseudo-goldReferenceBest predictionUpdateDerivedMRsFeature vector differenceMARCOExecutionModuleExecutionSuccessRate 1.790.21-1.091.460.592525Weight Update with Multiple ParsesCandidates other than pseudo-gold could be useful Multiple parses may have same max execution ratesLow execution rates could also mean correct plan given indirect supervision of human follower actionsMR plans are underspecified or ignorable details attachedSometimes inaccurate, but contain correct MR components to reach the desired goalWeight update with multiple candidate parsesUse candidates with higher execution rates than currently best-predicted candidateUpdate with feature difference is weighted with difference between execution rates2626Weight Update with Multiple ParsesWeight update with multiple candidates that have higher execution rate than currently predicted parsen-best candidatesCandidate 1Candidate 2Candidate 3Candidate 4Candidate nBest predictionUpdate (1)DerivedMRsMARCOExecutionModuleExecutionSuccessRate 1.241.83-1.091.460.592727Weight Update with Multiple ParsesWeight update with multiple candidates that have higher execution rate than currently predicted parsen-best candidatesCandidate 1Candidate 2Candidate 3Candidate 4Candidate nBest predictionUpdate (2)DerivedMRsMARCOExecutionModuleExecutionSuccessRate 1.241.83-1.091.460.592828FeaturesBinary indicator whether a certain composition of nonterminals appear in parse tree(Collins, EMNLP 2002, Lu et al., EMNLP 2008, Ge & Mooney, ACL 2006)

Turn left andfind the sofathen turn around the cornerL1: Turn(LEFT), Verify(front:SOFA, back:EASEL), Travel(steps:2), Verify(at:SOFA), Turn(RIGHT)L2: Turn(LEFT), Verify(front:SOFA)L3: Travel(steps:2), Verify(at:SOFA), Turn(RIGHT)L4: Turn(LEFT)L5: Travel(), Verify(at:SOFA)L6: Turn()2929Data3 maps, 6 instructors, 1-15 followers/directionSegmented into single sentence steps to make the learning easier (Chen & Mooney, 2011)Align each single sentence instruction with landmarks planUse single-sentence version for training, both paragraph and single-sentence for testing30Take the wood path towards the easel. At the easel, go left and then take a right on the the blue path at the corner. Follow the blue path towards the chair and at the chair, take a right towards the stool. When you reach the stool, you are at 7.ParagraphSingle sentenceTake the wood path towards the easel.

At the easel, go left and then take a right on the the blue path at the corner.

Turn, Forward, Turn left, Forward, Turn right, Forward x 3, Turn right, Forward TurnForward, Turn left, Forward, Turn right30EvaluationsLeave-one-map-out approach2 maps for training and 1 map for testingParse accuracy Evaluate how good the derived MR is from parsing novel sentences in test dataUse partial parse accuracy as metricPlan execution accuracy (end goal)Test how well the formal MR plan output reaches the destinationOnly successful if the final position matches exactlyCompared with Kim & Mooney, 2012 (Baseline)All reranking results use 50-best parsesTry to get 50-best distinct composed MR plans and according parses out of 1,000,000-best parsesMany parse trees differ insignificantly, leading to same derived MR plansGenerate sufficiently large 1,000,000-best parse trees from baseline model3131Response-based Update vs. BaselineParse AccuracyPlan ExecutionF1SingleParagraphBaseline74.8157.2220.17Gold-Standard78.2652.5719.33Response73.3259.6522.62vs. BaselineResponse-based approach performs better in the final end-task, plan execution.Optimize the model against plan execution

3232Response-based vs. Gold-Standard UpdateParse AccuracyPlan ExecutionF1SingleParagraphBaseline74.8157.2220.17Gold-Standard78.2652.5719.33Response73.3259.6522.62Gold-Standard UpdateGold standard data available only for evaluation purposeGrounded language learning does not supportvs. Gold-Standard UpdateGold-Standard is better in parse accuracy Response-based approach is better in plan execution Gold-Standard misses some critical MR elements for reaching the goal.Reranking is possible even when gold-standard reference does not exist for training dataUse responses from perceptual environments instead (end-task related)3333Response-based Update with Multiple vs. Single ParsesUsing multiple parses is better than using a single parse.Single-best pseudo-gold parse provides only weak feedbackCandidates with low execution rates mostly produce underspecified plans or plans with ignorable details, but capturing gist of preferred actionsA variety of preferable parses help improve the amount and the quality of weak feedback for better modelParse AccuracyPlan ExecutionF1SingleParagraphSingle73.3259.6522.62Multi73.4362.8126.573434ConclusionAdapting discriminative reranking to grounded language learningLack of a single gold-standard parse during trainingUsing response-based feedback can be alternativeProvided by natural responses from the perceptual worldWeak supervision of response feedback can be improved using multiple preferable parses3535Thank you for your time!Questions?3636

adapting discriminative reranking to grounded language learning

Documents