vision & learning labvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfoutline for...

96
Deep Learning for Computer Vision Spring 2019 http://vllab.ee.ntu.edu.tw/dlcv.html (primary) https://ceiba.ntu.edu.tw/1072CommE5052 (grade, etc.) FB: DLCV Spring 2019 Yu-Chiang Frank Wang 王鈺強, Associate Professor Dept. Electrical Engineering, National Taiwan University 2019/06/12

Upload: others

Post on 14-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Deep Learning for Computer VisionSpring 2019

http://vllab.ee.ntu.edu.tw/dlcv.html (primary)

https://ceiba.ntu.edu.tw/1072CommE5052 (grade, etc.)

FB: DLCV Spring 2019

Yu-Chiang Frank Wang 王鈺強, Associate Professor

Dept. Electrical Engineering, National Taiwan University

2019/06/12

Page 2: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

What’s to be Covered …

• Learning Beyond Images (Part II)• Audio-Visual Event Localization• Spatial Audio Generation• Decomposing Sounds of Visual Objects

• Few-Shot Learning• Slides by Chia-Ching Lin ([email protected])

• About Final Presentation• Date/time: 6/25 Tue 1:30pm-5pm

2

𝒞𝒞base 𝒞𝒞novel

Page 3: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

What’s to be Covered …

• Learning Beyond Images (Part II)• Audio-Visual Event Localization• Spatial Audio Generation• Decomposing Sounds of Visual Objects

• Few-Shot Learning• Slides by Chia-Ching Lin

• About Final Presentation• Date/time: 6/25 Tue 1:30pm-5pm

3

𝒞𝒞base 𝒞𝒞novel

Page 4: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• Goal• Separating mixed sounds into separate ones corresponding to the associated objects• Can be done in supervised or (preferably) unsupervised way

• References• The Sound of Pixels, ECCV 2018• The Sound of Motions, Arxiv• Co-Separating Sounds of Visual Objects, Arxiv

4

Page 5: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• The Sound of Pixels, ECCV 2018

5

Training pipeline: visual inputs + mixed audio sources.

mixed audio

No GT audio data

Using the audio from the respective video as training object.

Page 6: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• The Sound of Pixels, ECCV 2018

6

Page 7: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• The Sound of Pixels, ECCV 2018

7

mixed audio

evaluation

Page 8: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• The Sound of Pixels, ECCV 2018• Evaluation on MUSIC dataset• Project page

8

Page 9: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• The Sound of Pixels, ECCV 2018• Evaluation: NSDR, SIR and SAR (Ground truth and predicted source are needed)• NSDR (Normalized Signal-to-Distortion Ratio)• SIR (Signal-to-Interference Ratio)• SAR (Signal-to-Artifact Ratio) ….. Higher is better for all metrics, which means noise

is lower.

9

Page 10: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• The Sound of Pixels, ECCV 2018• Evaluation

10

Page 11: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• The Sound of Pixels, ECCV 2018• Evaluation

11

For each pixel in a video frame, they take vectorized log spectrogram magnitudes, and project them onto 3D RGB space using PCA for visualization purposes.

Page 12: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• The Sound of Pixels, ECCV 2018• Demo video

12

Page 13: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Network architecture

• Fusion module

13

Using appearance feature to calculate attention weight. The weight is applied on trajectory feature (from optical flow).

the first frame as input

Page 14: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Network architecture

• Fusion module• Temporal aligning: (fv and fs are visual and

sound; γ and β are simple linear layer.)

14

Page 15: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Evaluation: Evaluation: SDR, SIR and SAR (Ground truth and predicted

source are needed)• Dataset: MUSIC-21 (the extension of MUSIC)

15

Experiment on mixture for 2 videos Experiment on mixture for 3,4 videos

Page 16: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Evaluation

16

They project sound features into a low dimensional space, and visualize them in RGB space.

Page 17: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Disentangle sounds in realistic videos, even in cases where an object was not

observed individually during training.

17

Individual instrument

Page 18: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Network architecture• Work with pre-trained object detector (faster RCNN)

18

Training pipeline: visual inputs + mixed audio sources. (Following the sound of pixel ECCV’18)

Page 19: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Network architecture

• Audio-Visual Separator (for separating the sound of object)

19

The same module from 2.5D visual sound CVPR’19. (the same author)

Page 20: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.

• Network architecture

20

Page 21: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.

• Training Loss• Co-separation loss: the distance predicted and ground truth spectrogram

• Object-consistency loss: separated spectrogram should be consistent with the category of the visual object (ResNet-18 audio classifier)

21

|V1| and |V2| are the number of detected objects

Page 22: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Evaluation: SDR, SIR and SAR• Dataset: MUSIC• Project page (demo video only)

22

single-source videos (solo) only; multi-source videos (solo + duet)

Page 23: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Decomposing Sounds of Visual Objects

• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Demo Video

23

Page 24: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

What’s to be Covered …

• Learning Beyond Images (Part II)• Audio-Visual Event Localization• Spatial Audio Generation• Decomposing Sounds of Visual Objects

• Few-Shot Learning• Slides by Chia-Ching Lin ([email protected])

• About Final Presentation• Date/time: 6/25 Tue 1:30pm-5pm

24

𝒞𝒞base 𝒞𝒞novel

Page 25: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Outline for Few-Shot Learning

• Introduction• Hallucination Approaches (learning to augment)

• Attribute-Guided Augmentation• GAN-based hallucination• Data Augmentation GAN• Hallucinating by Analogy• Jointly Trained Hallucinator• Semantics-Guided Hallucination

• Meta-Learning Approaches• Initialization-based methods (learning to fine-tune)

• Optimization as a Model• Model-Agnostic Meta-Learning

• Metric-learning methods (learning to compare)• Siamese Networks• Prototypical Networks• Matching Networks• Relation Networks

• Recent Topics

Page 26: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Motivation of FSL

• Training deep neural networks for visual classification typically requires a large amount of labeled training data (e.g., ImageNet)

• Humans, on the other hand, are generally able to learn novel concepts with very little supervision (even one shot per class works)

• Example: which character below is the same as ?

• This motivates few-shot learning (FSL), in which only few samples would be available for all or selected object categories during learning

Introduction

[1] G. Koch et al., "Siamese Neural Networks for One-Shot Image Recognition," ICML Deep Learning Workshop, 2015

Page 27: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

FSL Scenarios

• All classes have very few examples per class (e.g., omniglot* involves 1623 characters, each has 20 examples)

• Involve a set of base classes 𝒞𝒞base (possibly with many examples per class), and a set of novel classes 𝒞𝒞novel with few examples per class

Introduction

* https://github.com/brendenlake/omniglot[2] B. Hariharan et al., "Low-shot Visual Recognition by Shrinking and Hallucinating Features," ICCV, 2017

𝒞𝒞base 𝒞𝒞novel

Page 28: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

FSL Approaches

• Two main levels of approaches to tackle few-shot learning*

• Data-level (make more examples)• Exploitation of external data• Augmentation• Data hallucination• etc.

• Parameter-level (add constraints on the parameters to alleviating overfitting)• Regularization• Meta learning (including metric learning)• etc.• “Data augmentation and regularization techniques alleviate overfitting in low

data regimes, but do not solve it.” [3]

Introduction

* https://medium.com/sap-machine-learning-research/deep-few-shot-learning-a1caa289f18[3] O. Vinyals et al., "Matching Networks for One Shot Learning," NIPS, 2016

Page 29: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Works Related to FSL

• Also called low-shot learning or one-shot learning in the literature

• Related to transfer learning (𝒞𝒞base source; 𝒞𝒞novel target) with the following differences

• Not necessarily few-shot in the target domain• In many transfer learning problems (e.g., domain adaptation), source and target

domains share the same set of classes, whereas 𝒞𝒞base and 𝒞𝒞novel are disjoint in FSL

Introduction

source (SVHN) Target (MNIST)

Page 30: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Works Related to FSL

• Different from zero-shot learning (𝒞𝒞base seen; 𝒞𝒞novel unseen) • Training

• Seen classes: images and semantic information (e.g., attributes: [zebra-striped, four-legged, deer-like face, …])

• Unseen classes: only semantic information

• Testing

Introduction

[4] X. Wang and Y. Ye, "Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs," CVPR, 2018

seen

unseen

……

𝑓𝑓( ) = [0.7, 0.9, 0.8, …] okapi

[1, 1, 0, …]

[1, 1, 1, …]?

attributes

zebra

okapi

labelimage

deer[0, 1, 1, …]

Page 31: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

FSL Tasks

• N-class k-shot • Only the test example �𝑥𝑥 is presented during testing, with the goal of predicting the

correct label of �𝑥𝑥 among total N classes

• Mainly used in hallucination approaches

• N-way k-shot• An additional labeled support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑁𝑁𝑘𝑘 from N different classes is available for each query example �𝑥𝑥 , and the task is to select the most similar image(s) from 𝑆𝑆 for �𝑥𝑥

• N need not be equal to the total amount of classes

• Mainly used in meta-learning approaches with episode-based training and testing• For each episode, further “training” is needed to rapidly adapt to new concepts in 𝑆𝑆

Introduction

𝑆𝑆 = { }, �𝑥𝑥 = , 𝑓𝑓 �𝑥𝑥, 𝑆𝑆 = [0.6, 0.1, 0.1, 0.1, 0.1] ∈ ℛ5

�𝑥𝑥 = , 𝑓𝑓 �𝑥𝑥 = [0.001, 0.005, …, 0.15, …, 0.001 ] ∈ ℛ1623

Page 32: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Performance Evaluation in FSL

• Closed-world: evaluates performances on novel classes 𝒞𝒞novel only

• Open-world: evaluates performances on joint set of classes 𝒞𝒞base ∪ 𝒞𝒞novel

Introduction

Page 33: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Outline

• Introduction• Hallucination Approaches (learning to augment)

• Attribute-Guided Augmentation• GAN-based hallucination• Data Augmentation GAN• Hallucinating by Analogy• Jointly Trained Hallucinator• Semantics-Guided Hallucination

• Meta-Learning Approaches• Initialization-based methods (learning to fine-tune)

• Optimization as a Model• Model-Agnostic Meta-Learning

• Metric-learning methods (learning to compare)• Siamese Networks• Prototypical Networks• Matching Networks• Relation Networks

• Recent Topics

Page 34: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Data Hallucination

• “Many modes of (intra-class) variation (for example camera pose, translation, lighting changes, and even articulation) are shared across categories” [5]

• “As humans, our knowledge of these shared modes of (intra-class) variation may allow us to visualize what a novel object might look like in other poses or surroundings” [5]

• We can thus hallucinate additional examples for novel classes by transferring modes of variation from the base classes

Hallucination Approaches (learning to augment)

[5] Y.-X. Wang et al., "Low-Shot Learning from Imaginary Data," CVPR, 2018

Page 35: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Attribute-Guided Augmentation [6] (1/2)

• Idea• Images with attribute annotations (e.g., depth) are used to learn multiple mappings

to synthesize additional features with desired attribute values (e.g., 1 m 3 m)• The mapping functions are object-agnostic and operate in feature space

Hallucination Approaches (learning to augment)

[6] M. Dixit et al., "AGA: Attribute-Guided Augmentation," CVPR, 2017

Page 36: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Attribute-Guided Augmentation [6] (1/2)

• Idea• Images with attribute annotations (e.g., depth) are used to learn multiple mappings

to synthesize additional features with desired attribute values (e.g., 1 m 3 m)• The mapping functions are object-agnostic and operate in feature space

Hallucination Approaches (learning to augment)

[6] M. Dixit et al., "AGA: Attribute-Guided Augmentation," CVPR, 2017

Step 1Train a predictor for each attribute (e.g., depth) in a supervised manner

𝛾𝛾depth( ) = 1.2, 𝛾𝛾depth( ) = 3.5, …

Page 37: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Attribute-Guided Augmentation [6] (1/2)

• Idea• Images with attribute annotations (e.g., depth) are used to learn multiple mappings

to synthesize additional features with desired attribute values (e.g., 1 m 3 m)• The mapping functions are object-agnostic and operate in feature space

Hallucination Approaches (learning to augment)

[6] M. Dixit et al., "AGA: Attribute-Guided Augmentation," CVPR, 2017

Step 1Train a predictor for each attribute (e.g., depth) in a supervised manner

𝛾𝛾depth( ) = 1.2, 𝛾𝛾depth( ) = 3.5, …

𝜙𝜙[1,2]3 ( ) =

𝜙𝜙[3,4]2 ( ) =

Step 2Train attribute-guided augmenters based on the trained attribute predictor (e.g., 𝛾𝛾depth)

Page 38: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Attribute-Guided Augmentation [6] (1/2)

• Idea• Images with attribute annotations (e.g., depth) are used to learn multiple mappings

to synthesize additional features with desired attribute values (e.g., 1 m 3 m)• The mapping functions are object-agnostic and operate in feature space

Hallucination Approaches (learning to augment)

[6] M. Dixit et al., "AGA: Attribute-Guided Augmentation," CVPR, 2017

Step 3Apply the trained mapping functions (𝜙𝜙𝑖𝑖𝑘𝑘’s) to augment additional features for novel classes

Step 2Train attribute-guided augmenters based on the trained attribute predictor (e.g., 𝛾𝛾depth)

Step 1Train a predictor for each attribute (e.g., depth) in a supervised manner

𝛾𝛾depth( ) = 1.2, 𝛾𝛾depth( ) = 3.5, …

𝜙𝜙[1,2]3 ( ) =

𝜙𝜙[3,4]2 ( ) =

𝜙𝜙[1,2]3 ( ) =

𝜙𝜙[3,4]2 ( ) =

𝜙𝜙[1,2]3 ( ) =

Page 39: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Attribute-Guided Augmentation [6] (2/2)

• Attribute predictors (𝛾𝛾)• One for each attribute (e.g., 𝛾𝛾depth and 𝛾𝛾pose), trained in a supervised manner

• Attribute-guided augmenter (𝜙𝜙)• For a specific attribute, multiple mapping functions are trained for different original

values (divided into 𝐼𝐼 intervals 𝑙𝑙𝑖𝑖 ,ℎ𝑖𝑖 𝑖𝑖=1𝐼𝐼 ) and target values ( 𝑡𝑡𝑘𝑘 𝑘𝑘=1

𝑇𝑇 ), with each 𝜙𝜙𝑖𝑖𝑘𝑘implemented similarly as an autoencoder:

• Suppose that there are 𝐴𝐴 attributes, 𝐼𝐼 intervals per attribute, and 𝑇𝑇 target values, we need to train totally 𝐴𝐴 attribute predictors and roughly 𝐴𝐴 � 𝐼𝐼 � 𝑇𝑇 augmenters

Hallucination Approaches (learning to augment)

[6] M. Dixit et al., "AGA: Attribute-Guided Augmentation," CVPR, 2017

𝐿𝐿 𝐱𝐱, 𝑡𝑡𝑘𝑘;𝜙𝜙𝑖𝑖𝑘𝑘 = 𝛾𝛾(𝜙𝜙𝑖𝑖𝑘𝑘 𝐱𝐱 ) − 𝑡𝑡𝑘𝑘2

+ 𝜆𝜆 𝜙𝜙𝑖𝑖𝑘𝑘 𝐱𝐱 − 𝐱𝐱2

target attribute mismatch penaltyfeature extracted by fast RCNN (FC7-layer activations) regularizer to preserve the class

of the augmented feature

Page 40: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Cross-Modal Hallucination [7] (1/2)

• Idea• The lack of data in one modality (e.g., image) can be compensated by abundant data

in the other modality (e.g., text) through previously learned alignments between two modalities

• In this paper, fine-grained images with detailed textual descriptions are used to build a text-conditional GAN for image generation

• Generated images should be not only realistic but also class-discriminative

Hallucination Approaches (learning to augment)

[7] F. Pahde et al., "Cross-modal Hallucination for Few-shot Fine-grained Recognition," CVPR Workshop, 2018

Page 41: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Cross-Modal Hallucination [7] (2/2)

• Discriminative text-conditional GAN (tcGAN)• First, train a tcGAN on samples from 𝒞𝒞base with regular objective function:

• Next, augment ℒ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑁𝑁 by adding a class-discriminative loss (similar to ACGAN) and fine-tune the tcGAN on the few-shot samples from 𝒞𝒞novel with the compound losses:

• 𝐷𝐷∗ = argmax𝐷𝐷ℒ 𝐷𝐷 and 𝐺𝐺∗ = argmin𝑡𝑡ℒ 𝐺𝐺

Hallucination Approaches (learning to augment)

[7] F. Pahde et al., "Cross-modal Hallucination for Few-shot Fine-grained Recognition," CVPR Workshop, 2018

ℒ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑁𝑁 𝐺𝐺,𝐷𝐷 = 𝔼𝔼𝐼𝐼,𝑇𝑇[𝑙𝑙𝑙𝑙𝑙𝑙𝐷𝐷 𝐼𝐼,𝑇𝑇 ] + 𝔼𝔼𝑧𝑧,𝑇𝑇[𝑙𝑙𝑙𝑙𝑙𝑙 1 − 𝐷𝐷(𝐺𝐺 𝑧𝑧,𝑇𝑇 ,𝑇𝑇) ]

ℒ 𝐷𝐷 = ℒ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑁𝑁 𝐺𝐺,𝐷𝐷 + 𝔼𝔼[𝑃𝑃 𝑐𝑐|𝐼𝐼 ]ℒ 𝐺𝐺 = ℒ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑁𝑁 𝐺𝐺,𝐷𝐷 − 𝔼𝔼[𝑃𝑃(𝑐𝑐|𝐺𝐺 𝑧𝑧,𝑇𝑇 )]

Select top-scored generated images computed by 𝐷𝐷∗

𝑇𝑇: text embedding𝐼𝐼: image embedding

𝑐𝑐: class label

Page 42: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Remarks

• The aforementioned hallucination approaches (Attribute-Guided Augmentation (AGA) [6] and Cross-Modal Hallucination (CMH) [7]) leveraged datasets with expensive annotations

• Furthermore, the modes of intra-class variation of generated images or features come from fixed pre-specified rules (AGA: augmenters with fixed target attribute values; CMH: pre-specified instance-level textual descriptions)

Hallucination Approaches (learning to augment)

[6] M. Dixit et al., "AGA: Attribute-Guided Augmentation," CVPR, 2017[7] F. Pahde et al., "Cross-modal Hallucination for Few-shot Fine-grained Recognition," CVPR Workshop, 2018

Page 43: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Data Augmentation GAN [8] (1/2)

• Idea• Typical data augmentation techniques use a very limited set of a priori known

invariances (e.g., translations, rotations, flips, addition of Gaussian noise, etc.) that are easy to invoke

• We can learn a model of a larger invariance space, through training a conditional GAN in the source domain (𝒞𝒞base), and apply it to the target domain (𝒞𝒞novel)

Hallucination Approaches (learning to augment)

[8] A. Antoniou et al., "Data Augmentation Generative Adversarial Networks," ICLR Workshop, 2018

𝒞𝒞base 𝒞𝒞novel

Page 44: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Data Augmentation GAN [8] (2/2)

• Data Augmentation GAN

Hallucination Approaches (learning to augment)

[8] A. Antoniou et al., "Data Augmentation Generative Adversarial Networks," ICLR Workshop, 2018

(Left) Generator𝐫𝐫𝑖𝑖 = 𝐸𝐸𝐸𝐸𝑐𝑐(𝐱𝐱𝑖𝑖)𝐳𝐳𝑖𝑖 ∼ 𝑁𝑁(𝟎𝟎, 𝐈𝐈)𝐱𝐱𝑔𝑔 = 𝐷𝐷𝐷𝐷𝑐𝑐(𝐳𝐳𝑖𝑖 , 𝐫𝐫𝑖𝑖)

(Right) Discriminator𝐷𝐷(𝐱𝐱𝑖𝑖 ,𝐱𝐱𝑗𝑗) Real pair𝐷𝐷(𝐱𝐱𝑖𝑖 ,𝐱𝐱𝑔𝑔) Fake pair

Why not just 𝐱𝐱𝑗𝑗 and 𝐱𝐱𝑔𝑔? prevent the Generator from simply autoencodingthe original image 𝐱𝐱𝑖𝑖 to improve diversity

Page 45: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Hallucination by Analogy [2] (1/2)

• Idea• Modern recognition models are trained on large labeled datasets like ImageNet• In many realistic scenarios, the trained model might encounter novel classes that it

also needs to recognize, but with very few training examples available• To deal with the above challenges faced by recognition systems in the wild, the

authors proposed a FSL benchmark in two phases:

Hallucination Approaches (learning to augment)

[2] B. Hariharan et al., "Low-shot Visual Recognition by Shrinking and Hallucinating Features," ICCV, 2017

𝐻𝐻 data hallucinator

Page 46: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Hallucination by Analogy [2] (2/2)

• Analogy-based data hallucinator• Train 𝐻𝐻 using analogy quadruplets 𝑎𝑎1,𝑎𝑎2, 𝑏𝑏1,𝑏𝑏2 , where 𝑎𝑎1,𝑎𝑎2 belong to some

class, 𝑏𝑏1, 𝑏𝑏2 belong to another class, and 𝑎𝑎1:𝑎𝑎2 ∷ 𝑏𝑏1: 𝑏𝑏2 holds

• Goal:

Hallucination Approaches (learning to augment)

[2] B. Hariharan et al., "Low-shot Visual Recognition by Shrinking and Hallucinating Features," ICCV, 2017

training quadruplets collected from base classes

𝑎𝑎1 𝑎𝑎2 𝑏𝑏1 𝑏𝑏2

sampled from a base class novel sample

𝐻𝐻 , , =

𝑎𝑎1 𝑎𝑎2 𝑏𝑏1hallucinated sample

�𝑏𝑏2

Page 47: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Jointly Trained Hallucinator [5]

• Idea• The hallucinated examples should be useful for classification tasks, rather than being

diverse or realistic that may still fail to improve the FSL performances• The authors proposed to train a conditional-GAN-based data hallucinator (𝐺𝐺 𝑥𝑥, 𝑧𝑧 )

jointly with the meta-learning module (ℎ) in an end-to-end manner

Hallucination Approaches (learning to augment)

back-propagation

forward pass

[5] Y.-X. Wang et al., "Low-Shot Learning from Imaginary Data," CVPR, 2018

Page 48: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Semantics-Guided Hallucination [9] (1/5)

• Idea• Most existing data-level FSL approaches did not explicitly consider relationships

between semantic concepts of different object categories during hallucination• Object categories with similar semantic concepts should share similar data

distributions• We can exploits semantic information into the hallucination process, and thus the

augmented data would be able to exhibit semantics-oriented modes of variation

Hallucination Approaches (learning to augment)

[9] C.-C. Lin et al., "Semantics-Guided Data Hallucination for Few-Shot Visual Classification," ICIP, 2019

Page 49: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Semantics-Guided Hallucination [9] (2/5)

• Two-phase learning framework followed [2] and [5]

𝜙𝜙: feature extractor𝑊𝑊: base-class classifier𝐻𝐻: hallucinatorℎ: meta-learning module𝑉𝑉: FSL (base+novel) classifier

Hallucination Approaches (learning to augment)

[2] B. Hariharan et al., "Low-shot Visual Recognition by Shrinking and Hallucinating Features," ICCV, 2017[5] Y.-X. Wang et al., "Low-Shot Learning from Imaginary Data," CVPR, 2018[9] C.-C. Lin et al., "Semantics-Guided Data Hallucination for Few-Shot Visual Classification," ICIP, 2019

Page 50: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Semantics-Guided Hallucination [9] (3/5)

• Semantics-guided hallucinator (right) involves an additional noise generator (encoder 𝐸𝐸 followed by a sampler) that produces conditioned noise vectors as if they are sampled from a semantics-dependent distribution

Hallucination Approaches (learning to augment)

𝑅𝑅𝑆𝑆: 300-dim word embedding vectors of label names, 85-dim attribute vectors for Animals with Attribute dataset, … etc.

[9] C.-C. Lin et al., "Semantics-Guided Data Hallucination for Few-Shot Visual Classification," ICIP, 2019

Page 51: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Semantics-Guided Hallucination [9] (4/5)

• Quantitative results: top-5 accuracy on 𝒞𝒞fin

1-shot (std) 2-shot (std) 5-shot (std)

Baseline 28.96 3.86 46.20 4.17 60.85 1.20

Analogy [2] 38.51 4.95 49.76 2.44 65.27 2.31

Superclass 46.76 2.56 57.01 1.76 67.86 1.76

Meta [5] 47.65 4.13 59.65 2.48 70.82 2.04

Meta* [5] 49.95 5.15 58.64 3.09 71.15 1.43

SGH [9] 55.84 2.10 66.28 2.13 74.14 1.40

1-shot (std) 2-shot (std) 5-shot (std)

Baseline 67.36 1.54 73.59 1.55 80.35 0.48

Analogy [2] 71.65 1.97 75.83 0.89 81.34 0.92

Superclass 74.65 1.04 78.44 0.72 82.30 0.68

Meta [5] 70.88 1.47 77.07 1.23 81.86 0.85

Meta* [5] 74.36 2.02 77.60 1.21 81.67 0.52

SGH [9] 76.68 0.83 80.18 0.81 82.95 0.54

Baseline: No hallucination, just repeat the 𝐸𝐸 = 1,2, or 5 samples per novel class at handAnalogy: Analogy-based hallucination proposed in [2]Superclass: Consider superclass information using analogy-based hallucination proposed in [2]Meta: GAN-based hallucinator trained by meta-learning proposed in [5]Meta*: Almost the same as Meta, with one more dense layerSGH: Semantics-guided hallucination using embedding vectors of label names proposed in [9]

𝐶𝐶novelfin 𝐶𝐶novelfin ∪ 𝐶𝐶basefin

Hallucination Approaches (learning to augment)

[2] B. Hariharan et al., "Low-shot Visual Recognition by Shrinking and Hallucinating Features," ICCV, 2017[5] Y.-X. Wang et al., "Low-Shot Learning from Imaginary Data," CVPR, 2018[9] C.-C. Lin et al., "Semantics-Guided Data Hallucination for Few-Shot Visual Classification," ICIP, 2019

Page 52: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Semantics-Guided Hallucination [9] (5/5)

• Qualitative results: visualization of hallucinated features

Our model can generate additional data that possess more semantic-dependent modes of variation: trees (pine and willow) distribute along a line, while medium-sized mammals (porcupine and raccoon) distribute in a region more like an ellipse

Hallucination Approaches (learning to augment)

[9] C.-C. Lin et al., "Semantics-Guided Data Hallucination for Few-Shot Visual Classification," ICIP, 2019

Page 53: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Outline

• Introduction• Hallucination Approaches (learning to augment)

• Attribute-Guided Augmentation• GAN-based hallucination• Data Augmentation GAN• Hallucinating by Analogy• Jointly Trained Hallucinator• Semantics-Guided Hallucination

• Meta-Learning Approaches• Initialization-based methods (learning to fine-tune)

• Optimization as a Model• Model-Agnostic Meta-Learning

• Metric-learning methods (learning to compare)• Siamese Networks• Prototypical Networks• Matching Networks• Relation Networks

• Recent Topics

Page 54: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Meta Learning

• A learning-to-learn approach• “Aims at training a second network (meta-learner, 𝐹𝐹) to extract data from a

base network (learner or classifier, 𝑓𝑓), so that classification at a meta level can be performed” [10]

• “Meta-learning suggests framing the learning problem at two levels. The first is quick acquisition of knowledge within each separate task presented. This process is guided by the second, which involves slower extraction of information learned across all the tasks.” [11]

Meta-Learning Approaches

[10] W.-H. Chu et al., "Learning Semantics-Guided Visual Attention for Few-shot Image Classification," ICIP, 2018[11] S. Ravi et al., "Optimization as a Model for Few-Shot Learning," ICLR, 2017

𝑓𝑓 = 𝐹𝐹 𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖𝑡𝑡𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖𝑡𝑡

𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑓𝑓 �𝑥𝑥 , �𝑥𝑥 ∈ 𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡

meta-learner

learner

Page 55: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Episode

• An episode is an N-way k-shot task consists a support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1𝑁𝑁𝑘𝑘 and a query set

𝑄𝑄 = �𝑥𝑥𝑗𝑗• Further “training” is needed within each episode

• Episode-based training makes it more faithful to the test environment

Meta-Learning Approaches

[11] S. Ravi et al., "Optimization as a Model for Few-Shot Learning," ICLR, 2017

(Disjoint set of classes)

Support set 𝑆𝑆 with N=5 and k=1

Query set 𝑄𝑄, each example belongs to one of the N classes in 𝑆𝑆

Training episodes (tasks)

Testing episodes (tasks)

Different set of N classes across episodes

Page 56: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Episode

• An episode is an N-way k-shot task consists a support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1𝑁𝑁𝑘𝑘 and a query set

𝑄𝑄 = �𝑥𝑥𝑗𝑗• Further “training” is needed within each episode

• Episode-based training makes it more faithful to the test environment

Meta-Learning Approaches

[11] S. Ravi et al., "Optimization as a Model for Few-Shot Learning," ICLR, 2017

(Disjoint set of classes)

Support set 𝑆𝑆 with N=5 and k=1

Query set 𝑄𝑄, each example belongs to one of the N classes in 𝑆𝑆

Training episodes (tasks)

Testing episodes (tasks)

Different set of N classes across episodes

In some literature…

Meta-training set

Meta-testing set

training set testing set

Page 57: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Outline

• Introduction• Hallucination Approaches (learning to augment)

• Attribute-Guided Augmentation• GAN-based hallucination• Data Augmentation GAN• Hallucinating by Analogy• Jointly Trained Hallucinator• Semantic-Guided Hallucination

• Meta-Learning Approaches• Initialization-based methods (learning to fine-tune)

• Optimization as a Model• Model-Agnostic Meta-Learning

• Metric-learning methods (learning to compare)• Siamese Networks• Prototypical Networks• Matching Networks• Relation Networks

• Recent Topics

Page 58: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Optimization as a Model [11] (1/2)

• Idea• Update model parameters by gradient descent ≈ update cell state in an LSTM:

• Model parameter 𝜃𝜃𝑡𝑡 Cell state 𝑐𝑐𝑡𝑡• Learning rate 𝛼𝛼𝑡𝑡 Input gate 𝑖𝑖𝑡𝑡 (allow different learning rates for different dimension)• (New) Shrinking rate to forget previous parameter (𝜃𝜃𝑡𝑡−1) Forget gate 𝑓𝑓𝑡𝑡• Gradient 𝛻𝛻𝜃𝜃𝑡𝑡−1ℒ𝑡𝑡Minus input activation −�̃�𝑐𝑡𝑡

• Thus, the authors proposed training a meta-learner LSTM to learn an update rule for training a learner neural network (e.g., an image classifier)

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

[11] S. Ravi et al., "Optimization as a Model for Few-Shot Learning," ICLR, 2017

𝜃𝜃𝑡𝑡 = 𝜃𝜃𝑡𝑡−1 − 𝛼𝛼𝑡𝑡𝛻𝛻𝜃𝜃𝑡𝑡−1ℒ𝑡𝑡

𝑐𝑐𝑡𝑡 = 𝑓𝑓𝑡𝑡⨀𝑐𝑐𝑡𝑡−1 + 𝑖𝑖𝑡𝑡⨀�̃�𝑐𝑡𝑡

Page 59: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Optimization as a Model [11] (2/2)

• Given a training task (𝑆𝑆,𝑄𝑄)

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

[11] S. Ravi et al., "Optimization as a Model for Few-Shot Learning," ICLR, 2017

(LSTM)

(𝑀𝑀)

(initial cell state)

∈ 𝑆𝑆 ∈ 𝑄𝑄

Page 60: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Optimization as a Model [11] (2/2)

• Given a training task (𝑆𝑆,𝑄𝑄)

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

[11] S. Ravi et al., "Optimization as a Model for Few-Shot Learning," ICLR, 2017

(LSTM)

(𝑀𝑀)

(initial cell state)

∈ 𝑆𝑆 ∈ 𝑄𝑄

At each time step, the Meta-learner (LSTM) proposes new parameters 𝜃𝜃𝑡𝑡 for the Learner𝑀𝑀 by

𝑖𝑖𝑡𝑡 = 𝜎𝜎 𝐖𝐖𝐼𝐼 � 𝛻𝛻𝑡𝑡,ℒ𝑡𝑡,𝜃𝜃𝑡𝑡−1, 𝑖𝑖𝑡𝑡−1 + 𝐛𝐛𝐼𝐼𝑓𝑓𝑡𝑡 = 𝜎𝜎 𝐖𝐖𝐹𝐹 � 𝛻𝛻𝑡𝑡,ℒ𝑡𝑡,𝜃𝜃𝑡𝑡−1,𝑓𝑓𝑡𝑡−1 + 𝐛𝐛𝐹𝐹

𝜃𝜃𝑡𝑡 = 𝑓𝑓𝑡𝑡⨀𝜃𝜃𝑡𝑡−1 + 𝑖𝑖𝑡𝑡⨀𝛻𝛻𝑡𝑡

Page 61: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Optimization as a Model [11] (2/2)

• Given a training task (𝑆𝑆,𝑄𝑄)

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

[11] S. Ravi et al., "Optimization as a Model for Few-Shot Learning," ICLR, 2017

(LSTM)

(𝑀𝑀)

(initial cell state)

∈ 𝑆𝑆 ∈ 𝑄𝑄

At each time step, the Meta-learner (LSTM) proposes new parameters 𝜃𝜃𝑡𝑡 for the Learner𝑀𝑀 by

𝑖𝑖𝑡𝑡 = 𝜎𝜎 𝐖𝐖𝐼𝐼 � 𝛻𝛻𝑡𝑡,ℒ𝑡𝑡,𝜃𝜃𝑡𝑡−1, 𝑖𝑖𝑡𝑡−1 + 𝐛𝐛𝐼𝐼𝑓𝑓𝑡𝑡 = 𝜎𝜎 𝐖𝐖𝐹𝐹 � 𝛻𝛻𝑡𝑡,ℒ𝑡𝑡,𝜃𝜃𝑡𝑡−1,𝑓𝑓𝑡𝑡−1 + 𝐛𝐛𝐹𝐹

𝜃𝜃𝑡𝑡 = 𝑓𝑓𝑡𝑡⨀𝜃𝜃𝑡𝑡−1 + 𝑖𝑖𝑡𝑡⨀𝛻𝛻𝑡𝑡[Note] The parameters of LSTM (𝐖𝐖𝐼𝐼, 𝐛𝐛𝐼𝐼, 𝐖𝐖𝐹𝐹, 𝐛𝐛𝐹𝐹) are not updated by (𝛻𝛻𝑡𝑡,ℒ𝑡𝑡)

Page 62: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Optimization as a Model [11] (2/2)

• Given a training task (𝑆𝑆,𝑄𝑄)

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

[11] S. Ravi et al., "Optimization as a Model for Few-Shot Learning," ICLR, 2017

(LSTM)

(𝑀𝑀)

(initial cell state)

∈ 𝑆𝑆 ∈ 𝑄𝑄

After 𝑇𝑇 steps, LSTM is update based on loss computed by final parameters of the learner (𝜃𝜃𝑇𝑇+1) using X, Y ∈ 𝑄𝑄

Trainable LSTM parameters: 𝐖𝐖𝐼𝐼, 𝐛𝐛𝐼𝐼, 𝐖𝐖𝐹𝐹, 𝐛𝐛𝐹𝐹, and also 𝜃𝜃0

At each time step, the Meta-learner (LSTM) proposes new parameters 𝜃𝜃𝑡𝑡 for the Learner𝑀𝑀 by

𝑖𝑖𝑡𝑡 = 𝜎𝜎 𝐖𝐖𝐼𝐼 � 𝛻𝛻𝑡𝑡,ℒ𝑡𝑡,𝜃𝜃𝑡𝑡−1, 𝑖𝑖𝑡𝑡−1 + 𝐛𝐛𝐼𝐼𝑓𝑓𝑡𝑡 = 𝜎𝜎 𝐖𝐖𝐹𝐹 � 𝛻𝛻𝑡𝑡,ℒ𝑡𝑡,𝜃𝜃𝑡𝑡−1,𝑓𝑓𝑡𝑡−1 + 𝐛𝐛𝐹𝐹

𝜃𝜃𝑡𝑡 = 𝑓𝑓𝑡𝑡⨀𝜃𝜃𝑡𝑡−1 + 𝑖𝑖𝑡𝑡⨀𝛻𝛻𝑡𝑡[Note] The parameters of LSTM (𝐖𝐖𝐼𝐼, 𝐛𝐛𝐼𝐼, 𝐖𝐖𝐹𝐹, 𝐛𝐛𝐹𝐹) are not updated by (𝛻𝛻𝑡𝑡,ℒ𝑡𝑡)

Page 63: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Optimization as a Model [11] (2/2)

• Given a training task (𝑆𝑆,𝑄𝑄)

• Given a testing task (𝑆𝑆′,𝑄𝑄′), just use the trained LSTM and 𝑆𝑆′ to update the learner 𝑀𝑀 to get 𝜃𝜃𝑇𝑇+1 to predict 𝐗𝐗 ∈ 𝑄𝑄′

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

[11] S. Ravi et al., "Optimization as a Model for Few-Shot Learning," ICLR, 2017

(LSTM)

(𝑀𝑀)

(initial cell state)

∈ 𝑆𝑆 ∈ 𝑄𝑄

After 𝑇𝑇 steps, LSTM is update based on loss computed by final parameters of the learner (𝜃𝜃𝑇𝑇+1) using X, Y ∈ 𝑄𝑄

Trainable LSTM parameters: 𝐖𝐖𝐼𝐼, 𝐛𝐛𝐼𝐼, 𝐖𝐖𝐹𝐹, 𝐛𝐛𝐹𝐹, and also 𝜃𝜃0

At each time step, the Meta-learner (LSTM) proposes new parameters 𝜃𝜃𝑡𝑡 for the Learner𝑀𝑀 by

𝑖𝑖𝑡𝑡 = 𝜎𝜎 𝐖𝐖𝐼𝐼 � 𝛻𝛻𝑡𝑡,ℒ𝑡𝑡,𝜃𝜃𝑡𝑡−1, 𝑖𝑖𝑡𝑡−1 + 𝐛𝐛𝐼𝐼𝑓𝑓𝑡𝑡 = 𝜎𝜎 𝐖𝐖𝐹𝐹 � 𝛻𝛻𝑡𝑡,ℒ𝑡𝑡,𝜃𝜃𝑡𝑡−1,𝑓𝑓𝑡𝑡−1 + 𝐛𝐛𝐹𝐹

𝜃𝜃𝑡𝑡 = 𝑓𝑓𝑡𝑡⨀𝜃𝜃𝑡𝑡−1 + 𝑖𝑖𝑡𝑡⨀𝛻𝛻𝑡𝑡[Note] The parameters of LSTM (𝐖𝐖𝐼𝐼, 𝐛𝐛𝐼𝐼, 𝐖𝐖𝐹𝐹, 𝐛𝐛𝐹𝐹) are not updated by (𝛻𝛻𝑡𝑡,ℒ𝑡𝑡)

Page 64: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Model-Agnostic Fast Adaptation [12] (1/3)

• Idea• Fast Adaptation: learn a good weight initialization to be fine-tuned efficiently

• Feature learning standpoint: make the representation suitable to many tasks, such that fine-tuning the parameters slightly can produce good results

• Dynamical systems standpoint: make the parameters lie in the region with high sensitivityto loss functions from new tasks, such that small changes in the parameters can lead to large improvements

• Model-agnostic: compatible with any model trained with gradient descent (classification, regression, and reinforcement learning, etc.)

• Model-agnostic meta learning (MAML)

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

[12] C. Finn et al., "Model-Agnostic Meta Learning for Fast Adaptation of Deep Networks," ICML 2017

𝜃𝜃1′𝜃𝜃2′𝜃𝜃3′

ℒ𝒯𝒯1(𝑓𝑓𝜃𝜃1′)

ℒ𝒯𝒯2(𝑓𝑓𝜃𝜃2′)ℒ𝒯𝒯3(𝑓𝑓𝜃𝜃3′)𝜃𝜃

𝜃𝜃

Page 65: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Model-Agnostic Fast Adaptation [12] (1/3)

• Idea• Fast Adaptation: learn a good weight initialization to be fine-tuned efficiently

• Feature learning standpoint: make the representation suitable to many tasks, such that fine-tuning the parameters slightly can produce good results

• Dynamical systems standpoint: make the parameters lie in the region with high sensitivityto loss functions from new tasks, such that small changes in the parameters can lead to large improvements

• Model-agnostic: compatible with any model trained with gradient descent (classification, regression, and reinforcement learning, etc.)

• Model-agnostic meta learning (MAML)

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

[12] C. Finn et al., "Model-Agnostic Meta Learning for Fast Adaptation of Deep Networks," ICML 2017

model adapting to task 𝒯𝒯𝑖𝑖 using a single gradient update based on 𝑆𝑆 (few-shot)

𝜃𝜃1′𝜃𝜃2′𝜃𝜃3′

ℒ𝒯𝒯1(𝑓𝑓𝜃𝜃1′)

ℒ𝒯𝒯2(𝑓𝑓𝜃𝜃2′)ℒ𝒯𝒯3(𝑓𝑓𝜃𝜃3′)𝜃𝜃

𝜃𝜃

Page 66: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Model-Agnostic Fast Adaptation [12] (1/3)

• Idea• Fast Adaptation: learn a good weight initialization to be fine-tuned efficiently

• Feature learning standpoint: make the representation suitable to many tasks, such that fine-tuning the parameters slightly can produce good results

• Dynamical systems standpoint: make the parameters lie in the region with high sensitivityto loss functions from new tasks, such that small changes in the parameters can lead to large improvements

• Model-agnostic: compatible with any model trained with gradient descent (classification, regression, and reinforcement learning, etc.)

• Model-agnostic meta learning (MAML)

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

model adapting to task 𝒯𝒯𝑖𝑖 using a single gradient update based on 𝑆𝑆 (few-shot)

[12] C. Finn et al., "Model-Agnostic Meta Learning for Fast Adaptation of Deep Networks," ICML 2017

meta-objective function across multiple tasks (evaluated on 𝑄𝑄 with larger amounts of data)

𝜃𝜃1′𝜃𝜃2′𝜃𝜃3′

ℒ𝒯𝒯1(𝑓𝑓𝜃𝜃1′)

ℒ𝒯𝒯2(𝑓𝑓𝜃𝜃2′)ℒ𝒯𝒯3(𝑓𝑓𝜃𝜃3′)𝜃𝜃

𝜃𝜃

Page 67: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Model-Agnostic Fast Adaptation [12] (1/3)

• Idea• Fast Adaptation: learn a good weight initialization to be fine-tuned efficiently

• Feature learning standpoint: make the representation suitable to many tasks, such that fine-tuning the parameters slightly can produce good results

• Dynamical systems standpoint: make the parameters lie in the region with high sensitivityto loss functions from new tasks, such that small changes in the parameters can lead to large improvements

• Model-agnostic: compatible with any model trained with gradient descent (classification, regression, and reinforcement learning, etc.)

• Model-agnostic meta learning (MAML)

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

[12] C. Finn et al., "Model-Agnostic Meta Learning for Fast Adaptation of Deep Networks," ICML 2017

model adapting to task 𝒯𝒯𝑖𝑖 using a single gradient update based on 𝑆𝑆 (few-shot)

meta-objective function across multiple tasks (evaluated on 𝑄𝑄 with larger amounts of data)

update 𝜃𝜃 by gradient descent based on the meta-objective

𝜃𝜃1′𝜃𝜃2′𝜃𝜃3′

ℒ𝒯𝒯1(𝑓𝑓𝜃𝜃1′)

ℒ𝒯𝒯2(𝑓𝑓𝜃𝜃2′)ℒ𝒯𝒯3(𝑓𝑓𝜃𝜃3′)𝜃𝜃

𝜃𝜃

Page 68: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Model-Agnostic Fast Adaptation [12] (1/3)

• Idea• Fast Adaptation: learn a good weight initialization to be fine-tuned efficiently

• Feature learning standpoint: make the representation suitable to many tasks, such that fine-tuning the parameters slightly can produce good results

• Dynamical systems standpoint: make the parameters lie in the region with high sensitivityto loss functions from new tasks, such that small changes in the parameters can lead to large improvements

• Model-agnostic: compatible with any model trained with gradient descent (classification, regression, and reinforcement learning, etc.)

• Model-agnostic meta learning (MAML)

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

[12] C. Finn et al., "Model-Agnostic Meta Learning for Fast Adaptation of Deep Networks," ICML 2017

𝜃𝜃1′𝜃𝜃2′𝜃𝜃3′

ℒ𝒯𝒯1(𝑓𝑓𝜃𝜃1′)

ℒ𝒯𝒯2(𝑓𝑓𝜃𝜃2′)ℒ𝒯𝒯3(𝑓𝑓𝜃𝜃3′)𝜃𝜃

𝜃𝜃

optimize model parameter 𝜃𝜃 such that it can quickly adapt to new tasks

model adapting to task 𝒯𝒯𝑖𝑖 using a single gradient update based on 𝑆𝑆 (few-shot)

meta-objective function across multiple tasks (evaluated on 𝑄𝑄 with larger amounts of data)

update 𝜃𝜃 by gradient descent based on the meta-objective

Page 69: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Model-Agnostic Fast Adaptation [12] (2/3)

• Sinusoidal regression: 𝑦𝑦 = 𝐴𝐴 � sin 𝑥𝑥 + 𝐵𝐵• Training tasks: random sinusoidal functions with 𝐴𝐴 ∈ [0.1, 5.0] and 𝐵𝐵 ∈ [0,𝜋𝜋], each

has 10 observed points 𝑥𝑥 sampled uniformly from [−5.0, 5.0]• Testing tasks: the following ground truth functions ( ) with K=5 or 10 observed

points ( )

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

[12] C. Finn et al., "Model-Agnostic Meta Learning for Fast Adaptation of Deep Networks," ICML 2017

Model trained on all training tasks with many steps of gradient descentMAML trained with one-step gradient descent

Page 70: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Model-Agnostic Fast Adaptation [12] (3/3)

• Quantitative results of sinusoidal regression (MSE)

• Remark• Not work for multimodal tasks (e.g., not only sinusoidal functions, but mix of

sinusoidal functions, linear functions, and quadratic functions)

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)

[12] C. Finn et al., "Model-Agnostic Meta Learning for Fast Adaptation of Deep Networks," ICML 2017

Page 71: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Outline

• Introduction• Hallucination Approaches (learning to augment)

• Attribute-Guided Augmentation• GAN-based hallucination• Data Augmentation GAN• Hallucinating by Analogy• Jointly Trained Hallucinator• Semantic-Guided Hallucination

• Meta-Learning Approaches• Initialization-based methods (learning to fine-tune)

• Optimization as a Model• Model-Agnostic Meta-Learning

• Metric-learning methods (learning to compare)• Siamese Networks• Prototypical Networks• Matching Networks• Relation Networks

• Recent Topics

Page 72: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Metric Learning

• “Many non-parametric models allow novel examples to be rapidly assimilated, whilst not suffering from catastrophic forgetting. Some models in this family (e.g., nearest neighbors) do not require any training but performance depends on the chosen metric” [3]

• Euclidean distance between two feature vectors 𝐱𝐱 and 𝐲𝐲: 𝐱𝐱 − 𝐲𝐲 𝑇𝑇 𝐱𝐱 − 𝐲𝐲• Mahalanobis distance between two feature vectors 𝐱𝐱 and 𝐲𝐲 of the same distribution

with covariance matrix 𝚺𝚺: 𝐱𝐱 − 𝐲𝐲 𝑇𝑇𝚺𝚺−1 𝐱𝐱 − 𝐲𝐲• learning a metric is effectively learning a linear transformation 𝐀𝐀 of the input space,

such that 𝑑𝑑 𝐱𝐱, 𝐲𝐲 = 𝐱𝐱 − 𝐲𝐲 𝑇𝑇𝐀𝐀𝑇𝑇𝐀𝐀 𝐱𝐱 − 𝐲𝐲 = 𝐀𝐀𝐱𝐱 − 𝐀𝐀𝐲𝐲 𝑇𝑇 𝐀𝐀𝐱𝐱 − 𝐀𝐀𝐲𝐲 , on which KNN performs well

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[3] O. Vinyals et al., "Matching Networks for One Shot Learning," NIPS, 2016[13] J. Snell et al., "Prototypical Networks for Few-Shot Learning," NIPS, 2017

Page 73: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Siamese Networks [1] (1/3)

• Idea• Networks which do well at verification (to identify input pairs belonging to the same

class or different classes) should generalize to one-shot classification• Thus, the authors proposed to train a verification network and apply it to the one-

shot learning tasks during testing

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[1] G. Koch et al., "Siamese Neural Networks for One-Shot Image Recognition," ICML Deep Learning Workshop, 2015

Page 74: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

• Siamese network: twin networks and a common final layer

Siamese Networks [1] (2/3)

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[1] G. Koch et al., "Siamese Neural Networks for One-Shot Image Recognition," ICML Deep Learning Workshop, 2015

Page 75: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

• Siamese network: twin networks and a common final layer

Siamese Networks [1] (2/3)

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[1] G. Koch et al., "Siamese Neural Networks for One-Shot Image Recognition," ICML Deep Learning Workshop, 2015

twin networks (weight shared) extract features

𝐡𝐡1,𝐿𝐿−1

𝐡𝐡2,𝐿𝐿−1

Page 76: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

• Siamese network: twin networks and a common final layer

Siamese Networks [1] (2/3)

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[1] G. Koch et al., "Siamese Neural Networks for One-Shot Image Recognition," ICML Deep Learning Workshop, 2015

final layer computes the L1 component-wise distance between the highest-level feature representations

sigmoid trainable weighting parameter

twin networks (weight shared) extract features

𝐡𝐡1,𝐿𝐿−1

𝐡𝐡2,𝐿𝐿−1

Page 77: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

• Siamese network: twin networks and a common final layer

• Training: supervised learning with regularized (negative) cross-entropy objective

Siamese Networks [1] (2/3)

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[1] G. Koch et al., "Siamese Neural Networks for One-Shot Image Recognition," ICML Deep Learning Workshop, 2015

1 for positive pairs; 0 for negative pairs

twin networks (weight shared) extract features

𝐡𝐡1,𝐿𝐿−1

𝐡𝐡2,𝐿𝐿−1

final layer computes the L1 component-wise distance between the highest-level feature representations

sigmoid trainable weighting parameter

Page 78: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

• Siamese network: twin networks and a common final layer• Testing: given a testing task (𝑆𝑆,𝑄𝑄) with support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘 , predict the label for the query example �𝑥𝑥 ∈ 𝑄𝑄 as 𝑦𝑦𝑚𝑚 corresponding to the minimum distance: 𝑚𝑚 = argmin𝑖𝑖𝐩𝐩(𝑥𝑥𝑖𝑖 , �𝑥𝑥)

Siamese Networks [1] (3/3)

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[1] G. Koch et al., "Siamese Neural Networks for One-Shot Image Recognition," ICML Deep Learning Workshop, 2015

𝐡𝐡𝑖𝑖,𝐿𝐿−1

�̂�𝐡𝐿𝐿−1

query example �𝑥𝑥

𝐩𝐩(𝑥𝑥𝑖𝑖 , �𝑥𝑥) = 𝜎𝜎(∑𝑗𝑗 𝛼𝛼𝑗𝑗|𝐡𝐡𝑖𝑖,𝐿𝐿−1𝑗𝑗 − �̂�𝐡𝐿𝐿−1

(𝑗𝑗) |)

support example 𝑥𝑥𝑖𝑖 ∈ 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘

Page 79: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Prototypical Networks [13]

• Idea• Assumption: There exists an embedding space on which features belong to the same

class cluster together around a single prototype for that class• Thus, the authors proposed to learn a non-linear mapping 𝑓𝑓𝜙𝜙 and take the mean

representation 𝐜𝐜𝑘𝑘 for each class as its prototype in the embedding space:

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[13] J. Snell et al., "Prototypical Networks for Few-Shot Learning," NIPS, 2017

𝑓𝑓𝜙𝜙

𝑓𝑓𝜙𝜙

𝑓𝑓𝜙𝜙

support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘

Page 80: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Prototypical Networks [13]

• Idea• Assumption: There exists an embedding space on which features belong to the same

class cluster together around a single prototype for that class• Thus, the authors proposed to learn a non-linear mapping 𝑓𝑓𝜙𝜙 and take the mean

representation 𝐜𝐜𝑘𝑘 for each class as its prototype in the embedding space:

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[13] J. Snell et al., "Prototypical Networks for Few-Shot Learning," NIPS, 2017

, where 𝑆𝑆𝑘𝑘 ⊂ 𝑆𝑆 is the subset of support set 𝑆𝑆 with class 𝑘𝑘

𝐜𝐜1

𝐜𝐜2

𝐜𝐜3

(similar to the centers in the center-loss methods)

Page 81: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Prototypical Networks [13]

• Idea• Assumption: There exists an embedding space on which features belong to the same

class cluster together around a single prototype for that class• Thus, the authors proposed to learn a non-linear mapping 𝑓𝑓𝜙𝜙 and take the mean

representation 𝐜𝐜𝑘𝑘 for each class as its prototype in the embedding space:

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[13] J. Snell et al., "Prototypical Networks for Few-Shot Learning," NIPS, 2017

𝐜𝐜1

𝐜𝐜2

𝐜𝐜3

, where 𝑑𝑑(. , . ) is the Euclidean distance

𝑓𝑓𝜙𝜙query example �𝑥𝑥

(similar to clustering)

Page 82: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Matching Networks [3] (1/3)

• Idea• Inspired by recent advances in the attention mechanism, which is defined to access

an augmented memory containing useful information to solve the task at hand• Thus, the authors proposed a weighted nearest-neighbor classifier, which uses an

attention mechanism over a learned embedding of the support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1𝑘𝑘

to predict the label of the query example �𝑥𝑥 as:

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[3] O. Vinyals et al., "Matching Networks for One Shot Learning," NIPS, 2016

with

𝑐𝑐 . , . : cosine similarity

support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘

query example �𝑥𝑥

Page 83: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Matching Networks [3] (2/3)

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[1] G. Koch et al., "Siamese Neural Networks for One-Shot Image Recognition," ICML Deep Learning Workshop, 2015[13] J. Snell et al., "Prototypical Networks for Few-Shot Learning," NIPS, 2017[3] O. Vinyals et al., "Matching Networks for One Shot Learning," NIPS, 2016

• Simple form: 𝑙𝑙 = 𝑓𝑓• Similar to Siamese network [1]

• Also similar to prototypical network [13] for one-shot learning

support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘

query example �𝑥𝑥

Page 84: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Matching Networks [3] (3/3)

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[3] O. Vinyals et al., "Matching Networks for One Shot Learning," NIPS, 2016

• Full context embedding (FCE)• Each element in 𝑆𝑆 should not be embedded independently of other elements

• 𝑙𝑙 𝑥𝑥𝑖𝑖 𝑙𝑙 𝑆𝑆 as a bidirectional LSTM by considering the whole 𝑆𝑆 as a sequence• Also, 𝑆𝑆 should be able to modify how we embed �𝑥𝑥

• 𝑓𝑓 �𝑥𝑥 𝑓𝑓 �𝑥𝑥,𝑆𝑆 as an LSTM with read-attention over 𝑙𝑙 𝑆𝑆 : attLSTM 𝑓𝑓′ �𝑥𝑥 ,𝑙𝑙 𝑆𝑆 ,𝐾𝐾 , where 𝑓𝑓′ �𝑥𝑥 is the (fixed) CNN feature, and 𝐾𝐾 is the number of unrolling steps

• Experiment results on miniImageNet

support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘

query example �𝑥𝑥

bidirectionalLSTM

attLSTM

Page 85: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Relation Networks [14]

• Idea• Other metric-learning approaches focus on learning a transferrable embedding

function with a fixed metric (e.g., Euclidean distance, cosine similarity, …)• The authors proposed to train a Relation Network (RN) to explicitly learn a

transferrable deep distance metric comparing the relation between images

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[14] F. Sung et al., "Learning to Compare: Relation Network for Few-Shot Learning," CVPR, 2018

support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘

query example �𝑥𝑥

embedding module relation module

feature concatenation

compute relation score

Page 86: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Relation Networks [14]

• Idea• Other metric-learning approaches focus on learning a transferrable embedding

function with a fixed metric (e.g., Euclidean distance, cosine similarity, …)• The authors proposed to train a Relation Network (RN) to explicitly learn a

transferrable deep distance metric comparing the relation between images

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[14] F. Sung et al., "Learning to Compare: Relation Network for Few-Shot Learning," CVPR, 2018

support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘

query example �𝑥𝑥

embedding module relation module

𝑥𝑥𝑖𝑖 → 𝑓𝑓𝜑𝜑 𝑥𝑥𝑖𝑖�𝑥𝑥 → 𝑓𝑓𝜑𝜑 �𝑥𝑥

feature concatenation

compute relation score

Page 87: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Relation Networks [14]

• Idea• Other metric-learning approaches focus on learning a transferrable embedding

function with a fixed metric (e.g., Euclidean distance, cosine similarity, …)• The authors proposed to train a Relation Network (RN) to explicitly learn a

transferrable deep distance metric comparing the relation between images

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[14] F. Sung et al., "Learning to Compare: Relation Network for Few-Shot Learning," CVPR, 2018

support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘

query example �𝑥𝑥

embedding module relation module

𝒞𝒞(𝑓𝑓𝜑𝜑 𝑥𝑥𝑖𝑖 ,𝑓𝑓𝜑𝜑 �𝑥𝑥 )

𝑥𝑥𝑖𝑖 → 𝑓𝑓𝜑𝜑 𝑥𝑥𝑖𝑖�𝑥𝑥 → 𝑓𝑓𝜑𝜑 �𝑥𝑥

feature concatenation

compute relation score

Page 88: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Relation Networks [14]

• Idea• Other metric-learning approaches focus on learning a transferrable embedding

function with a fixed metric (e.g., Euclidean distance, cosine similarity, …)• The authors proposed to train a Relation Network (RN) to explicitly learn a

transferrable deep distance metric comparing the relation between images

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[14] F. Sung et al., "Learning to Compare: Relation Network for Few-Shot Learning," CVPR, 2018

support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘

query example �𝑥𝑥

embedding module relation module

𝑟𝑟 = 𝑙𝑙𝜙𝜙(𝒞𝒞(𝑓𝑓𝜑𝜑 𝑥𝑥𝑖𝑖 ,𝑓𝑓𝜑𝜑 �𝑥𝑥 ))

𝒞𝒞(𝑓𝑓𝜑𝜑 𝑥𝑥𝑖𝑖 ,𝑓𝑓𝜑𝜑 �𝑥𝑥 )

𝑥𝑥𝑖𝑖 → 𝑓𝑓𝜑𝜑 𝑥𝑥𝑖𝑖�𝑥𝑥 → 𝑓𝑓𝜑𝜑 �𝑥𝑥

feature concatenation

compute relation score

Page 89: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Remarks

• Some works can be extended to zero-shot learning: the support set contains a semantic embedding vector (𝐯𝐯𝑘𝑘) for each of the training classes, instead of few-shot images

• In this case, we can use a second heterogeneous embedding function to embed the semantic embedding vectors

• Prototypical Networks:

• Relation Networks: 𝑟𝑟 = 𝑙𝑙𝜙𝜙(𝒞𝒞(𝑓𝑓𝜑𝜑 𝑥𝑥𝑖𝑖 ,𝑓𝑓𝜑𝜑 �𝑥𝑥 )) 𝑟𝑟 = 𝑙𝑙𝜙𝜙(𝒞𝒞(𝑓𝑓𝜑𝜑2 𝐯𝐯𝑘𝑘 ,𝑓𝑓𝜑𝜑1 �𝑥𝑥 ))

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[13] J. Snell et al., "Prototypical Networks for Few-Shot Learning," NIPS, 2017

𝑙𝑙𝜗𝜗 𝑙𝑙𝜗𝜗

𝑙𝑙𝜗𝜗

𝑓𝑓𝜙𝜙(𝐱𝐱𝑖𝑖) 𝐜𝐜𝑘𝑘 = 𝑙𝑙𝜗𝜗(𝐯𝐯𝑘𝑘)

Page 90: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Outline

• Introduction• Hallucination Approaches (learning to augment)

• Attribute-Guided Augmentation• GAN-based hallucination• Data Augmentation GAN• Hallucinating by Analogy• Jointly Trained Hallucinator• Semantic-Guided Hallucination

• Meta-Learning Approaches• Initialization-based methods (learning to fine-tune)

• Optimization as a Model• Meta Networks• Model-Agnostic Meta-Learning

• Metric-learning methods (learning to compare)• Siamese Networks• Prototypical Networks• Matching Networks• Relation Networks

• Recent Topics

Page 91: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

A Closer Look at FSL [15]

• Idea• The authors proposed a consistent comparative analysis of several representative FSL

methods, and found that• Deeper backbones significantly reduce the gap across methods when domain differences

are limited• A slightly modified baseline method can surprisingly achieve competitive performance• As the domain difference grows larger, the adaptation based on a few novel class instances

becomes more important and simple baseline outperforms representative FSL methods

Meta-Learning Approaches – Recent Topics

[15] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, "A Closer Look at Few-shot Classification," ICLR, 2019

Page 92: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Multimodal MAML [16] (1/3)

• Idea• MAML seeks a single initialization shared across the entire task distribution (e.g.,

sinusoidal regression)• In multimodal regression problems, different modes (sinusoidal, linear, or quadratic)

may require substantially different parameters• The authors proposed a multimodal MAML algorithm that is able to identify the

mode of the task and then modulate its meta-learned prior accordingly

Meta-Learning Approaches – Recent Topics

[12] C. Finn et al., "Model-Agnostic Meta Learning for Fast Adaptation of Deep Networks," ICML 2017[16] R. Vuorio et al., "Toward Multimodal Model-Agnostic Meta-Learning," NIPS Workshop, 2018

sampled a task (support set 𝑆𝑆)

task-specific parameters

(original MAML [12])

modulate the meta-learned prior parameters (𝜃𝜃𝑖𝑖’s) to fit the task (e.g., activate or deactivate a certain neuron of a FC layer)

Page 93: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Multimodal MAML [16] (2/3)

• Multimodal regression• Sinusoidal: 𝑦𝑦 = 𝐴𝐴 � sin 𝜔𝜔 � 𝑥𝑥 + 𝑏𝑏• Linear: 𝑦𝑦 = 𝐴𝐴 � 𝑥𝑥 + 𝑏𝑏• Quadratic: 𝑦𝑦 = 𝐴𝐴 � (𝑥𝑥 − 𝑐𝑐)2+𝑏𝑏

Meta-Learning Approaches – Recent Topics

[16] R. Vuorio et al., "Toward Multimodal Model-Agnostic Meta-Learning," NIPS Workshop, 2018

(without any gradient update)

(five steps of gradient updates)

mode-specific MAML

Page 94: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

Multimodal MAML [16] (3/3)

• Quantitative results (MSE)

• t-SNE visualization of task embeddings

Meta-Learning Approaches – Recent Topics

[16] R. Vuorio et al., "Toward Multimodal Model-Agnostic Meta-Learning," NIPS Workshop, 2018

(mode-specific)

different modulation schemes

Page 95: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

References[1] G. Koch, R. Zemel, and R. Salakhutdinov, "Siamese Neural Networks for One-Shot Image Recognition," ICML Deep Learning

Workshop, 2015[2] B. Hariharan and R. Girshick, "Low-shot Visual Recognition by Shrinking and Hallucinating Features," ICCV, 2017[3] O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra, "Matching Networks for One Shot Learning," NIPS, 2016[4] X. Wang and Y. Ye, "Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs," CVPR, 2018

[5] Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan, "Low-Shot Learning from Imaginary Data," CVPR, 2018

[6] M. Dixit, R. Kwitt, M. Niethammer, and N. Vasconcelos, "AGA: Attribute-Guided Augmentation," CVPR, 2017[7] F. Pahde, P. Jahnichen, T. Klein, and M. Nabi, "Cross-modal Hallucination for Few-shot Fine-grained Recognition," CVPR

Workshop, 2018[8] A. Antoniou, A. Storkey, and H. Edwards, "Data Augmentation Generative Adversarial Networks", ICLR Workshop, 2018

[9] C.-C. Lin, Y.-C. F. Wang, C.-L. Lei, and K.-T. Chen, "Semantics-Guided Data Hallucination for Few-Shot Visual Classification," ICIP, 2019

[10] W.-H. Chu and Y.-C. F. Wang, "Learning Semantics-Guided Visual Attention for Few-shot Image Classification," ICIP, 2018[11] S. Ravi and H. Larochelle, "Optimization as a Model for Few-Shot Learning," ICLR, 2017

[12] C. Finn, P. Abbeel, and S. Levine, "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks," ICML 2017

[13] J. Snell, K. Swersky, and R. Zemel, "Prototypical Networks for Few-Shot Learning," NIPS, 2017[14] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales, "Learning to Compare: Relation Network for Few-Shot

Learning," CVPR, 2018[15] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, "A Closer Look at Few-shot Classification," ICLR, 2019[16] R. Vuorio, S.-H. Sun, H. Hu, and J. J. Lim, "Toward Multimodal Model-Agnostic Meta-Learning," NIPS Workshop, 2018

Page 96: VISION & LEARNING LABvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfOutline for Few-Shot Learning • Introduction • Hallucination Approaches (learning to augment) •

What We’ve Learned This Semester…

• Outline• ML101, Image Representation & Recognition• Intro to NNs & CNNs• CNN for Classification, Detection, & Segmentation• Visualization of NN/CNN• Generative Adversarial Networks• Transfer Learning & Representation Disentanglement• Recurrent Neural Net & Its Applications• Learning Beyond Images / Learning from Audi-Visual Data• Few-Shot Learning

• Remarks

96