interactive language acquisition with one-shot visual ... · one-shot learning has been...

Interactive Language Acquisition with One-shot Visual Concept Learning through a Conversational Game

Haichao Zhang , Haonan Yu, and Wei Xu

Baidu Research - Institute of Deep Learning, Sunnyvale USA

Presenter: Zhexiong Liu

Outline

1. Introduction

2. Related work

3. Conversational game

4. Proposed approach

5. Experimental result

6. Conclusion

Introduction: Task description

Learn an intelligent agent that can communicate with as well as learn from humans.

Communication

One-shot Learning

Introduction: Motivation from humans

• Humans can learn from the consequences of the responses in the form of verbal and behavioral feedback

• Humans have shown an ability to learn new concepts from small amount of data

• The agents should have skills of actively seeking, memorizing, and developing the one-shot learning ability

Related work: Achievement & limitation

Approach Supervised Language Learning Reinforcement Learning for Sequences

Achievement Capture the statistics of training data

Select actions from a candidate sequences set

Limitation Less flexible for acquiring new knowledge without retraining

Learn to generate a new sequence action

Related work: Achievement & limitation

Approach Communication and Emergence of Language

One-shot Learning and Active Learning

Achievement Use a guesser-responder setting to achieve goal

One-shot learning has been investigated in image classification

Limitation Obtain transferable speaking and one-shot ability

Target language and one-shot learning via conversational interaction

Related work: Challenge of recent approaches

• Hardly flexible for acquiring new knowledge without inefficient retraining or catastrophic forgetting

• Applications require rapid learning from a small dataset

Conversational Game: Game Rule

• Participants: teacher & learner/agent• Strategies: teach & learn a concept

Conversational Game: Teacher Speaks

Teacher randomly selects an

object and interacts with the

learner about the object.

• posing a question

• saying nothing

• making a statement

Conversational Game: Learner Reward

Leaner interact with the teacher’s response

• Raise a question à reward (+1)

• Correct statement à reward (+1)

• Incorrect responses à reward (-1)

• Say nothing/silence à reward (-1)

Proposed approach: Imitation & Reinforce

Imitation Function:

• Agent perceives sentences and images, and save extracted information for later use

Reinforce Function:

• Agent leverages feedback from the teacher to converse adaptively by adjusting the action policy

Proposed approach: Network Structure

Proposed approach: Imitation Learning

Imitation is achieved by predicting probability of teacher’s future sentences with the image as well as conversation history

equals to the multiplication of each predicting words in the sentences

Where is the is the last state of the RNN, which is the summarization of

Proposed approach: Imitation Learning

The generation of next word is adapted from • the predictive distribution of next word• the information in the external memory

Where and represent the probability of next word as well as word in external

Proposed approach: Reading Memory

A visual encoder implemented as a CNN is used to encode the visual

image into a visual key

Where and are memories for visual and sentence modalities

Proposed approach: Writing memory

Memory write is similar to reading but with a content importance gate

controlling whether the content should be written into memory.

Where are the memory, content, and gate respectively.

For the visual content . For the sentence content ,

where is the word embedding, is the attention vector in BiLSTM

Proposed approach: Reinforcement Learning

Generate the agent response from a distribution over all possible sequences

Share the parameters of imitation network with a modulator to learn in an adaptive manner

Policy is adjusted by maximizing expected future reward

Experiments: Dataset

• Text dataset

Experiment: Dataset

• 40 animal classes with 408 images in total, with about 10 images per class

• The fruit dataset contains 16 classes and 48 images in total with 3 images per class.

• Train on animal test on fruit

Experiments: Setup

The training algorithm is implemented on the deep learning platform PaddlePaddle.

• batch size is 16

• learning rate is 1×10−5

• weight decay rate is 1.6 × 10−3

• word embedding dimension d is 1024

• visual image size is 32×32

Experiments: Baseline model

• Reinforce: the same network structure as the proposed model and trained using RL only.

• Imitation: the same structure as proposed model and trained using Imitation only

• Imitation + Gaussian + RL: a joint imitation and reinforcement method using a Gaussian policy

Experiments: Learning on word

• Proposed reaches the highest success rate (97.4%) and average reward (+1.1)

Experiments: Learning with Image Variations

• Objective: testify the impact of within-class image variations on one-shot learning

Model trained without (a, c) and with (b, d) image variations on the the Animal dataset.

Experiments: Learning on Sentence

• Extract useful information which could appear at different locations of the sentence

• Learner has to adaptively fuse information from RNN and external memory to generate a complete sentence.

Experiments: Proposed Approach

Conclusion

• Contribution• The author presented an effective approach for grounded language

acquisition with one-shot visual concept learning through joint imitation and reinforcement learning.

• Limitation• While offering flexibility in training, a synthetic task has limited

amount of variation compared to real-world scenarios with natural languages.

interactive language acquisition with one-shot visual ... · one-shot learning has been...

Documents