interactive language acquisition with one-shot visual ... · one-shot learning has been...
TRANSCRIPT
Interactive Language Acquisition with One-shot Visual Concept Learning through a Conversational Game
Haichao Zhang , Haonan Yu, and Wei Xu
Baidu Research - Institute of Deep Learning, Sunnyvale USA
Presenter: Zhexiong Liu
Outline
1. Introduction
2. Related work
3. Conversational game
4. Proposed approach
5. Experimental result
6. Conclusion
Introduction: Task description
Learn an intelligent agent that can communicate with as well as learn from humans.
Communication
One-shot Learning
Introduction: Motivation from humans
• Humans can learn from the consequences of the responses in the form of verbal and behavioral feedback
• Humans have shown an ability to learn new concepts from small amount of data
• The agents should have skills of actively seeking, memorizing, and developing the one-shot learning ability
Related work: Achievement & limitation
Approach Supervised Language Learning Reinforcement Learning for Sequences
Achievement Capture the statistics of training data
Select actions from a candidate sequences set
Limitation Less flexible for acquiring new knowledge without retraining
Learn to generate a new sequence action
Related work: Achievement & limitation
Approach Communication and Emergence of Language
One-shot Learning and Active Learning
Achievement Use a guesser-responder setting to achieve goal
One-shot learning has been investigated in image classification
Limitation Obtain transferable speaking and one-shot ability
Target language and one-shot learning via conversational interaction
Related work: Challenge of recent approaches
• Hardly flexible for acquiring new knowledge without inefficient retraining or catastrophic forgetting
• Applications require rapid learning from a small dataset
Conversational Game: Game Rule
• Participants: teacher & learner/agent• Strategies: teach & learn a concept
Conversational Game: Teacher Speaks
Teacher randomly selects an
object and interacts with the
learner about the object.
• posing a question
• saying nothing
• making a statement
Conversational Game: Learner Reward
Leaner interact with the teacher’s response
• Raise a question à reward (+1)
• Correct statement à reward (+1)
• Incorrect responses à reward (-1)
• Say nothing/silence à reward (-1)
Proposed approach: Imitation & Reinforce
Imitation Function:
• Agent perceives sentences and images, and save extracted information for later use
Reinforce Function:
• Agent leverages feedback from the teacher to converse adaptively by adjusting the action policy
Proposed approach: Network Structure
Proposed approach: Imitation Learning
Imitation is achieved by predicting probability of teacher’s future sentences with the image as well as conversation history
equals to the multiplication of each predicting words in the sentences
Where is the is the last state of the RNN, which is the summarization of
Proposed approach: Imitation Learning
The generation of next word is adapted from • the predictive distribution of next word• the information in the external memory
Where and represent the probability of next word as well as word in external
Proposed approach: Reading Memory
A visual encoder implemented as a CNN is used to encode the visual
image into a visual key
Where and are memories for visual and sentence modalities
Proposed approach: Writing memory
Memory write is similar to reading but with a content importance gate
controlling whether the content should be written into memory.
Where are the memory, content, and gate respectively.
For the visual content . For the sentence content ,
where is the word embedding, is the attention vector in BiLSTM
Proposed approach: Reinforcement Learning
Generate the agent response from a distribution over all possible sequences
Share the parameters of imitation network with a modulator to learn in an adaptive manner
Policy is adjusted by maximizing expected future reward
Experiments: Dataset
• Text dataset
Experiment: Dataset
• 40 animal classes with 408 images in total, with about 10 images per class
• The fruit dataset contains 16 classes and 48 images in total with 3 images per class.
• Train on animal test on fruit
Experiments: Setup
The training algorithm is implemented on the deep learning platform PaddlePaddle.
• batch size is 16
• learning rate is 1×10−5
• weight decay rate is 1.6 × 10−3
• word embedding dimension d is 1024
• visual image size is 32×32
Experiments: Baseline model
• Reinforce: the same network structure as the proposed model and trained using RL only.
• Imitation: the same structure as proposed model and trained using Imitation only
• Imitation + Gaussian + RL: a joint imitation and reinforcement method using a Gaussian policy
Experiments: Learning on word
• Proposed reaches the highest success rate (97.4%) and average reward (+1.1)
Experiments: Learning with Image Variations
• Objective: testify the impact of within-class image variations on one-shot learning
Model trained without (a, c) and with (b, d) image variations on the the Animal dataset.
Experiments: Learning on Sentence
• Extract useful information which could appear at different locations of the sentence
• Learner has to adaptively fuse information from RNN and external memory to generate a complete sentence.
Experiments: Proposed Approach
Conclusion
• Contribution• The author presented an effective approach for grounded language
acquisition with one-shot visual concept learning through joint imitation and reinforcement learning.
• Limitation• While offering flexibility in training, a synthetic task has limited
amount of variation compared to real-world scenarios with natural languages.