vision & learning labvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfoutline for...

Deep Learning for Computer VisionSpring 2019

http://vllab.ee.ntu.edu.tw/dlcv.html (primary)

https://ceiba.ntu.edu.tw/1072CommE5052 (grade, etc.)

FB: DLCV Spring 2019

Yu-Chiang Frank Wang 王鈺強, Associate Professor

Dept. Electrical Engineering, National Taiwan University

2019/06/12

http://vllab.ee.ntu.edu.tw/dlcv.html

https://ceiba.ntu.edu.tw/1072CommE5052

What’s to be Covered …

• Learning Beyond Images (Part II)• Audio-Visual Event Localization• Spatial Audio Generation• Decomposing Sounds of Visual Objects

• Few-Shot Learning• Slides by Chia-Ching Lin ([email protected])

• About Final Presentation• Date/time: 6/25 Tue 1:30pm-5pm

2

𝒞𝒞base 𝒞𝒞novel

mailto:[email protected]



• Few-Shot Learning• Slides by Chia-Ching Lin


3


Decomposing Sounds of Visual Objects

• Goal• Separating mixed sounds into separate ones corresponding to the associated objects• Can be done in supervised or (preferably) unsupervised way

• References• The Sound of Pixels, ECCV 2018• The Sound of Motions, Arxiv• Co-Separating Sounds of Visual Objects, Arxiv

4


• The Sound of Pixels, ECCV 2018

5

Training pipeline: visual inputs + mixed audio sources.

mixed audio

No GT audio data

Using the audio from the respective video as training object.



6



7

mixed audio

evaluation


• The Sound of Pixels, ECCV 2018• Evaluation on MUSIC dataset• Project page

8

http://sound-of-pixels.csail.mit.edu/


• The Sound of Pixels, ECCV 2018• Evaluation: NSDR, SIR and SAR (Ground truth and predicted source are needed)• NSDR (Normalized Signal-to-Distortion Ratio)• SIR (Signal-to-Interference Ratio)• SAR (Signal-to-Artifact Ratio) ….. Higher is better for all metrics, which means noise

is lower.

9


• The Sound of Pixels, ECCV 2018• Evaluation

10


• The Sound of Pixels, ECCV 2018• Evaluation

11

For each pixel in a video frame, they take vectorized log spectrogram magnitudes, and project them onto 3D RGB space using PCA for visualization purposes.


• The Sound of Pixels, ECCV 2018• Demo video

12

https://youtu.be/LjPEAn6ehwo?t=71


• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Network architecture

• Fusion module

13

Using appearance feature to calculate attention weight. The weight is applied on trajectory feature (from optical flow).

the first frame as input


• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Network architecture

• Fusion module• Temporal aligning: (fv and fs are visual and

sound; γ and β are simple linear layer.)

14


• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Evaluation: Evaluation: SDR, SIR and SAR (Ground truth and predicted

source are needed)• Dataset: MUSIC-21 (the extension of MUSIC)

15

Experiment on mixture for 2 videos Experiment on mixture for 3,4 videos


• The Sound of Motions (Zhao, Gan, Ma, & Torralba), Arxiv• Evaluation

16

They project sound features into a low dimensional space, and visualize them in RGB space.


• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Disentangle sounds in realistic videos, even in cases where an object was not

observed individually during training.

17

Individual instrument


• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Network architecture• Work with pre-trained object detector (faster RCNN)

18

Training pipeline: visual inputs + mixed audio sources. (Following the sound of pixel ECCV’18)


• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Network architecture

• Audio-Visual Separator (for separating the sound of object)

19

The same module from 2.5D visual sound CVPR’19. (the same author)

Decomposing Sounds of Visual Objects• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.

• Network architecture

20

Decomposing Sounds of Visual Objects• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.

• Training Loss• Co-separation loss: the distance predicted and ground truth spectrogram

• Object-consistency loss: separated spectrogram should be consistent with the category of the visual object (ResNet-18 audio classifier)

21

|V1| and |V2| are the number of detected objects


• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Evaluation: SDR, SIR and SAR• Dataset: MUSIC• Project page (demo video only)

22

single-source videos (solo) only; multi-source videos (solo + duet)

http://vision.cs.utexas.edu/projects/coseparation/


• Co-Separating Sounds of Visual Object (Gao & Graumann), Arxiv.• Demo Video

23

https://youtu.be/D9KAkrIzieg?t=26



• Few-Shot Learning• Slides by Chia-Ching Lin ([email protected])


24


mailto:[email protected]

Outline for Few-Shot Learning

• Introduction• Hallucination Approaches (learning to augment)

• Attribute-Guided Augmentation• GAN-based hallucination• Data Augmentation GAN• Hallucinating by Analogy• Jointly Trained Hallucinator• Semantics-Guided Hallucination

• Meta-Learning Approaches• Initialization-based methods (learning to fine-tune)

• Optimization as a Model• Model-Agnostic Meta-Learning

• Metric-learning methods (learning to compare)• Siamese Networks• Prototypical Networks• Matching Networks• Relation Networks

• Recent Topics

Motivation of FSL

• Training deep neural networks for visual classification typically requires a large amount of labeled training data (e.g., ImageNet)

• Humans, on the other hand, are generally able to learn novel concepts with very little supervision (even one shot per class works)

• Example: which character below is the same as ?

• This motivates few-shot learning (FSL), in which only few samples would be available for all or selected object categories during learning

Introduction

[1] G. Koch et al., "Siamese Neural Networks for One-Shot Image Recognition," ICML Deep Learning Workshop, 2015

FSL Scenarios

• All classes have very few examples per class (e.g., omniglot* involves 1623 characters, each has 20 examples)

• Involve a set of base classes 𝒞𝒞base (possibly with many examples per class), and a set of novel classes 𝒞𝒞novel with few examples per class

Introduction

* https://github.com/brendenlake/omniglot[2] B. Hariharan et al., "Low-shot Visual Recognition by Shrinking and Hallucinating Features," ICCV, 2017


https://github.com/brendenlake/omniglot

FSL Approaches

• Two main levels of approaches to tackle few-shot learning*

• Data-level (make more examples)• Exploitation of external data• Augmentation• Data hallucination• etc.

• Parameter-level (add constraints on the parameters to alleviating overfitting)• Regularization• Meta learning (including metric learning)• etc.• “Data augmentation and regularization techniques alleviate overfitting in low

data regimes, but do not solve it.” [3]

Introduction

* https://medium.com/sap-machine-learning-research/deep-few-shot-learning-a1caa289f18[3] O. Vinyals et al., "Matching Networks for One Shot Learning," NIPS, 2016

https://medium.com/sap-machine-learning-research/deep-few-shot-learning-a1caa289f18

Works Related to FSL

• Also called low-shot learning or one-shot learning in the literature

• Related to transfer learning (𝒞𝒞base source; 𝒞𝒞novel target) with the following differences

• Not necessarily few-shot in the target domain• In many transfer learning problems (e.g., domain adaptation), source and target

domains share the same set of classes, whereas 𝒞𝒞base and 𝒞𝒞novel are disjoint in FSL

Introduction

source (SVHN) Target (MNIST)

Works Related to FSL

• Different from zero-shot learning (𝒞𝒞base seen; 𝒞𝒞novel unseen) • Training

• Seen classes: images and semantic information (e.g., attributes: [zebra-striped, four-legged, deer-like face, …])

• Unseen classes: only semantic information

• Testing

Introduction

[4] X. Wang and Y. Ye, "Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs," CVPR, 2018

seen

unseen

……

𝑓𝑓( ) = [0.7, 0.9, 0.8, …] okapi

[1, 1, 0, …]

[1, 1, 1, …]?

attributes

zebra

okapi

labelimage

deer[0, 1, 1, …]

FSL Tasks

• N-class k-shot • Only the test example �𝑥𝑥 is presented during testing, with the goal of predicting the

correct label of �𝑥𝑥 among total N classes

• Mainly used in hallucination approaches

• N-way k-shot• An additional labeled support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑁𝑁𝑘𝑘 from N different classes is available for each query example �𝑥𝑥 , and the task is to select the most similar image(s) from 𝑆𝑆 for �𝑥𝑥

• N need not be equal to the total amount of classes

• Mainly used in meta-learning approaches with episode-based training and testing• For each episode, further “training” is needed to rapidly adapt to new concepts in 𝑆𝑆

Introduction

𝑆𝑆 = { }, �𝑥𝑥 = , 𝑓𝑓 �𝑥𝑥, 𝑆𝑆 = [0.6, 0.1, 0.1, 0.1, 0.1] ∈ ℛ5

�𝑥𝑥 = , 𝑓𝑓 �𝑥𝑥 = [0.001, 0.005, …, 0.15, …, 0.001 ] ∈ ℛ1623

Performance Evaluation in FSL

• Closed-world: evaluates performances on novel classes 𝒞𝒞novel only

• Open-world: evaluates performances on joint set of classes 𝒞𝒞base ∪ 𝒞𝒞novel

Introduction

Outline






• Recent Topics

Data Hallucination

• “Many modes of (intra-class) variation (for example camera pose, translation, lighting changes, and even articulation) are shared across categories” [5]

• “As humans, our knowledge of these shared modes of (intra-class) variation may allow us to visualize what a novel object might look like in other poses or surroundings” [5]

• We can thus hallucinate additional examples for novel classes by transferring modes of variation from the base classes

Hallucination Approaches (learning to augment)

[5] Y.-X. Wang et al., "Low-Shot Learning from Imaginary Data," CVPR, 2018

Attribute-Guided Augmentation [6] (1/2)

• Idea• Images with attribute annotations (e.g., depth) are used to learn multiple mappings

to synthesize additional features with desired attribute values (e.g., 1 m 3 m)• The mapping functions are object-agnostic and operate in feature space


[6] M. Dixit et al., "AGA: Attribute-Guided Augmentation," CVPR, 2017






Step 1Train a predictor for each attribute (e.g., depth) in a supervised manner

𝛾𝛾depth( ) = 1.2, 𝛾𝛾depth( ) = 3.5, …







𝛾𝛾depth( ) = 1.2, 𝛾𝛾depth( ) = 3.5, …

𝜙𝜙[1,2]3 ( ) =

𝜙𝜙[3,4]2 ( ) =

Step 2Train attribute-guided augmenters based on the trained attribute predictor (e.g., 𝛾𝛾depth)






Step 3Apply the trained mapping functions (𝜙𝜙𝑖𝑖𝑘𝑘’s) to augment additional features for novel classes

Step 2Train attribute-guided augmenters based on the trained attribute predictor (e.g., 𝛾𝛾depth)


𝛾𝛾depth( ) = 1.2, 𝛾𝛾depth( ) = 3.5, …

𝜙𝜙[1,2]3 ( ) =

𝜙𝜙[3,4]2 ( ) =

𝜙𝜙[1,2]3 ( ) =

𝜙𝜙[3,4]2 ( ) =

𝜙𝜙[1,2]3 ( ) =


• Attribute predictors (𝛾𝛾)• One for each attribute (e.g., 𝛾𝛾depth and 𝛾𝛾pose), trained in a supervised manner

• Attribute-guided augmenter (𝜙𝜙)• For a specific attribute, multiple mapping functions are trained for different original

values (divided into 𝐼𝐼 intervals 𝑙𝑙𝑖𝑖 ,ℎ𝑖𝑖 𝑖𝑖=1𝐼𝐼 ) and target values ( 𝑡𝑡𝑘𝑘 𝑘𝑘=1

𝑇𝑇 ), with each 𝜙𝜙𝑖𝑖𝑘𝑘implemented similarly as an autoencoder:

• Suppose that there are 𝐴𝐴 attributes, 𝐼𝐼 intervals per attribute, and 𝑇𝑇 target values, we need to train totally 𝐴𝐴 attribute predictors and roughly 𝐴𝐴 � 𝐼𝐼 � 𝑇𝑇 augmenters



𝐿𝐿 𝐱𝐱, 𝑡𝑡𝑘𝑘;𝜙𝜙𝑖𝑖𝑘𝑘 = 𝛾𝛾(𝜙𝜙𝑖𝑖𝑘𝑘 𝐱𝐱 ) − 𝑡𝑡𝑘𝑘2

+ 𝜆𝜆 𝜙𝜙𝑖𝑖𝑘𝑘 𝐱𝐱 − 𝐱𝐱2

target attribute mismatch penaltyfeature extracted by fast RCNN (FC7-layer activations) regularizer to preserve the class

of the augmented feature

Cross-Modal Hallucination [7] (1/2)

• Idea• The lack of data in one modality (e.g., image) can be compensated by abundant data

in the other modality (e.g., text) through previously learned alignments between two modalities

• In this paper, fine-grained images with detailed textual descriptions are used to build a text-conditional GAN for image generation

• Generated images should be not only realistic but also class-discriminative


[7] F. Pahde et al., "Cross-modal Hallucination for Few-shot Fine-grained Recognition," CVPR Workshop, 2018

Cross-Modal Hallucination [7] (2/2)

• Discriminative text-conditional GAN (tcGAN)• First, train a tcGAN on samples from 𝒞𝒞base with regular objective function:

• Next, augment ℒ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑁𝑁 by adding a class-discriminative loss (similar to ACGAN) and fine-tune the tcGAN on the few-shot samples from 𝒞𝒞novel with the compound losses:

• 𝐷𝐷∗ = argmax𝐷𝐷ℒ 𝐷𝐷 and 𝐺𝐺∗ = argmin𝑡𝑡ℒ 𝐺𝐺


[7] F. Pahde et al., "Cross-modal Hallucination for Few-shot Fine-grained Recognition," CVPR Workshop, 2018

ℒ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑁𝑁 𝐺𝐺,𝐷𝐷 = 𝔼𝔼𝐼𝐼,𝑇𝑇[𝑙𝑙𝑙𝑙𝑙𝑙𝐷𝐷 𝐼𝐼,𝑇𝑇 ] + 𝔼𝔼𝑧𝑧,𝑇𝑇[𝑙𝑙𝑙𝑙𝑙𝑙 1 − 𝐷𝐷(𝐺𝐺 𝑧𝑧,𝑇𝑇 ,𝑇𝑇) ]

ℒ 𝐷𝐷 = ℒ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑁𝑁 𝐺𝐺,𝐷𝐷 + 𝔼𝔼[𝑃𝑃 𝑐𝑐|𝐼𝐼 ]ℒ 𝐺𝐺 = ℒ𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑁𝑁 𝐺𝐺,𝐷𝐷 − 𝔼𝔼[𝑃𝑃(𝑐𝑐|𝐺𝐺 𝑧𝑧,𝑇𝑇 )]

Select top-scored generated images computed by 𝐷𝐷∗

𝑇𝑇: text embedding𝐼𝐼: image embedding

𝑐𝑐: class label

Remarks

• The aforementioned hallucination approaches (Attribute-Guided Augmentation (AGA) [6] and Cross-Modal Hallucination (CMH) [7]) leveraged datasets with expensive annotations

• Furthermore, the modes of intra-class variation of generated images or features come from fixed pre-specified rules (AGA: augmenters with fixed target attribute values; CMH: pre-specified instance-level textual descriptions)


[6] M. Dixit et al., "AGA: Attribute-Guided Augmentation," CVPR, 2017[7] F. Pahde et al., "Cross-modal Hallucination for Few-shot Fine-grained Recognition," CVPR Workshop, 2018

Data Augmentation GAN [8] (1/2)

• Idea• Typical data augmentation techniques use a very limited set of a priori known

invariances (e.g., translations, rotations, flips, addition of Gaussian noise, etc.) that are easy to invoke

• We can learn a model of a larger invariance space, through training a conditional GAN in the source domain (𝒞𝒞base), and apply it to the target domain (𝒞𝒞novel)


[8] A. Antoniou et al., "Data Augmentation Generative Adversarial Networks," ICLR Workshop, 2018


Data Augmentation GAN [8] (2/2)

• Data Augmentation GAN


[8] A. Antoniou et al., "Data Augmentation Generative Adversarial Networks," ICLR Workshop, 2018

(Left) Generator𝐫𝐫𝑖𝑖 = 𝐸𝐸𝐸𝐸𝑐𝑐(𝐱𝐱𝑖𝑖)𝐳𝐳𝑖𝑖 ∼ 𝑁𝑁(𝟎𝟎, 𝐈𝐈)𝐱𝐱𝑔𝑔 = 𝐷𝐷𝐷𝐷𝑐𝑐(𝐳𝐳𝑖𝑖 , 𝐫𝐫𝑖𝑖)

(Right) Discriminator𝐷𝐷(𝐱𝐱𝑖𝑖 ,𝐱𝐱𝑗𝑗) Real pair𝐷𝐷(𝐱𝐱𝑖𝑖 ,𝐱𝐱𝑔𝑔) Fake pair

Why not just 𝐱𝐱𝑗𝑗 and 𝐱𝐱𝑔𝑔? prevent the Generator from simply autoencodingthe original image 𝐱𝐱𝑖𝑖 to improve diversity

Hallucination by Analogy [2] (1/2)

• Idea• Modern recognition models are trained on large labeled datasets like ImageNet• In many realistic scenarios, the trained model might encounter novel classes that it

also needs to recognize, but with very few training examples available• To deal with the above challenges faced by recognition systems in the wild, the

authors proposed a FSL benchmark in two phases:


[2] B. Hariharan et al., "Low-shot Visual Recognition by Shrinking and Hallucinating Features," ICCV, 2017

𝐻𝐻 data hallucinator

Hallucination by Analogy [2] (2/2)

• Analogy-based data hallucinator• Train 𝐻𝐻 using analogy quadruplets 𝑎𝑎1,𝑎𝑎2, 𝑏𝑏1,𝑏𝑏2 , where 𝑎𝑎1,𝑎𝑎2 belong to some

class, 𝑏𝑏1, 𝑏𝑏2 belong to another class, and 𝑎𝑎1:𝑎𝑎2 ∷ 𝑏𝑏1: 𝑏𝑏2 holds

• Goal:


[2] B. Hariharan et al., "Low-shot Visual Recognition by Shrinking and Hallucinating Features," ICCV, 2017

training quadruplets collected from base classes

𝑎𝑎1 𝑎𝑎2 𝑏𝑏1 𝑏𝑏2

sampled from a base class novel sample

𝐻𝐻 , , =

𝑎𝑎1 𝑎𝑎2 𝑏𝑏1hallucinated sample

�𝑏𝑏2

Jointly Trained Hallucinator [5]

• Idea• The hallucinated examples should be useful for classification tasks, rather than being

diverse or realistic that may still fail to improve the FSL performances• The authors proposed to train a conditional-GAN-based data hallucinator (𝐺𝐺 𝑥𝑥, 𝑧𝑧 )

jointly with the meta-learning module (ℎ) in an end-to-end manner


back-propagation

forward pass

[5] Y.-X. Wang et al., "Low-Shot Learning from Imaginary Data," CVPR, 2018

Semantics-Guided Hallucination [9] (1/5)

• Idea• Most existing data-level FSL approaches did not explicitly consider relationships

between semantic concepts of different object categories during hallucination• Object categories with similar semantic concepts should share similar data

distributions• We can exploits semantic information into the hallucination process, and thus the

augmented data would be able to exhibit semantics-oriented modes of variation


[9] C.-C. Lin et al., "Semantics-Guided Data Hallucination for Few-Shot Visual Classification," ICIP, 2019


• Two-phase learning framework followed [2] and [5]

𝜙𝜙: feature extractor𝑊𝑊: base-class classifier𝐻𝐻: hallucinatorℎ: meta-learning module𝑉𝑉: FSL (base+novel) classifier


[2] B. Hariharan et al., "Low-shot Visual Recognition by Shrinking and Hallucinating Features," ICCV, 2017[5] Y.-X. Wang et al., "Low-Shot Learning from Imaginary Data," CVPR, 2018[9] C.-C. Lin et al., "Semantics-Guided Data Hallucination for Few-Shot Visual Classification," ICIP, 2019


• Semantics-guided hallucinator (right) involves an additional noise generator (encoder 𝐸𝐸 followed by a sampler) that produces conditioned noise vectors as if they are sampled from a semantics-dependent distribution


𝑅𝑅𝑆𝑆: 300-dim word embedding vectors of label names, 85-dim attribute vectors for Animals with Attribute dataset, … etc.



• Quantitative results: top-5 accuracy on 𝒞𝒞fin

1-shot (std) 2-shot (std) 5-shot (std)

Baseline 28.96 3.86 46.20 4.17 60.85 1.20

Analogy [2] 38.51 4.95 49.76 2.44 65.27 2.31

Superclass 46.76 2.56 57.01 1.76 67.86 1.76

Meta [5] 47.65 4.13 59.65 2.48 70.82 2.04

Meta* [5] 49.95 5.15 58.64 3.09 71.15 1.43

SGH [9] 55.84 2.10 66.28 2.13 74.14 1.40

1-shot (std) 2-shot (std) 5-shot (std)

Baseline 67.36 1.54 73.59 1.55 80.35 0.48

Analogy [2] 71.65 1.97 75.83 0.89 81.34 0.92

Superclass 74.65 1.04 78.44 0.72 82.30 0.68

Meta [5] 70.88 1.47 77.07 1.23 81.86 0.85

Meta* [5] 74.36 2.02 77.60 1.21 81.67 0.52

SGH [9] 76.68 0.83 80.18 0.81 82.95 0.54

Baseline: No hallucination, just repeat the 𝐸𝐸 = 1,2, or 5 samples per novel class at handAnalogy: Analogy-based hallucination proposed in [2]Superclass: Consider superclass information using analogy-based hallucination proposed in [2]Meta: GAN-based hallucinator trained by meta-learning proposed in [5]Meta*: Almost the same as Meta, with one more dense layerSGH: Semantics-guided hallucination using embedding vectors of label names proposed in [9]

𝐶𝐶novelfin 𝐶𝐶novelfin ∪ 𝐶𝐶basefin


[2] B. Hariharan et al., "Low-shot Visual Recognition by Shrinking and Hallucinating Features," ICCV, 2017[5] Y.-X. Wang et al., "Low-Shot Learning from Imaginary Data," CVPR, 2018[9] C.-C. Lin et al., "Semantics-Guided Data Hallucination for Few-Shot Visual Classification," ICIP, 2019


• Qualitative results: visualization of hallucinated features

Our model can generate additional data that possess more semantic-dependent modes of variation: trees (pine and willow) distribute along a line, while medium-sized mammals (porcupine and raccoon) distribute in a region more like an ellipse



Outline






• Recent Topics

Meta Learning

• A learning-to-learn approach• “Aims at training a second network (meta-learner, 𝐹𝐹) to extract data from a

base network (learner or classifier, 𝑓𝑓), so that classification at a meta level can be performed” [10]

• “Meta-learning suggests framing the learning problem at two levels. The first is quick acquisition of knowledge within each separate task presented. This process is guided by the second, which involves slower extraction of information learned across all the tasks.” [11]

Meta-Learning Approaches

[10] W.-H. Chu et al., "Learning Semantics-Guided Visual Attention for Few-shot Image Classification," ICIP, 2018[11] S. Ravi et al., "Optimization as a Model for Few-Shot Learning," ICLR, 2017

𝑓𝑓 = 𝐹𝐹 𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖𝑡𝑡𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖𝑡𝑡

𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑓𝑓 �𝑥𝑥 , �𝑥𝑥 ∈ 𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡

meta-learner

learner

Episode

• An episode is an N-way k-shot task consists a support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1𝑁𝑁𝑘𝑘 and a query set

𝑄𝑄 = �𝑥𝑥𝑗𝑗• Further “training” is needed within each episode

• Episode-based training makes it more faithful to the test environment


[11] S. Ravi et al., "Optimization as a Model for Few-Shot Learning," ICLR, 2017

(Disjoint set of classes)

Support set 𝑆𝑆 with N=5 and k=1

Query set 𝑄𝑄, each example belongs to one of the N classes in 𝑆𝑆

Training episodes (tasks)

Testing episodes (tasks)

Different set of N classes across episodes

Episode

• An episode is an N-way k-shot task consists a support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1𝑁𝑁𝑘𝑘 and a query set

𝑄𝑄 = �𝑥𝑥𝑗𝑗• Further “training” is needed within each episode

• Episode-based training makes it more faithful to the test environment



(Disjoint set of classes)

Support set 𝑆𝑆 with N=5 and k=1

Query set 𝑄𝑄, each example belongs to one of the N classes in 𝑆𝑆

Training episodes (tasks)

Testing episodes (tasks)

Different set of N classes across episodes

In some literature…

Meta-training set

Meta-testing set

training set testing set

Outline


• Attribute-Guided Augmentation• GAN-based hallucination• Data Augmentation GAN• Hallucinating by Analogy• Jointly Trained Hallucinator• Semantic-Guided Hallucination




• Recent Topics

Optimization as a Model [11] (1/2)

• Idea• Update model parameters by gradient descent ≈ update cell state in an LSTM:

• Model parameter 𝜃𝜃𝑡𝑡 Cell state 𝑐𝑐𝑡𝑡• Learning rate 𝛼𝛼𝑡𝑡 Input gate 𝑖𝑖𝑡𝑡 (allow different learning rates for different dimension)• (New) Shrinking rate to forget previous parameter (𝜃𝜃𝑡𝑡−1) Forget gate 𝑓𝑓𝑡𝑡• Gradient 𝛻𝛻𝜃𝜃𝑡𝑡−1ℒ𝑡𝑡Minus input activation −�̃�𝑐𝑡𝑡

• Thus, the authors proposed training a meta-learner LSTM to learn an update rule for training a learner neural network (e.g., an image classifier)

Meta-Learning Approaches – Initialization-based methods (learning to fine-tune)


𝜃𝜃𝑡𝑡 = 𝜃𝜃𝑡𝑡−1 − 𝛼𝛼𝑡𝑡𝛻𝛻𝜃𝜃𝑡𝑡−1ℒ𝑡𝑡

𝑐𝑐𝑡𝑡 = 𝑓𝑓𝑡𝑡⨀𝑐𝑐𝑡𝑡−1 + 𝑖𝑖𝑡𝑡⨀�̃�𝑐𝑡𝑡


• Given a training task (𝑆𝑆,𝑄𝑄)



(LSTM)

(𝑀𝑀)

(initial cell state)

∈ 𝑆𝑆 ∈ 𝑄𝑄





(LSTM)

(𝑀𝑀)



At each time step, the Meta-learner (LSTM) proposes new parameters 𝜃𝜃𝑡𝑡 for the Learner𝑀𝑀 by

𝑖𝑖𝑡𝑡 = 𝜎𝜎 𝐖𝐖𝐼𝐼 � 𝛻𝛻𝑡𝑡,ℒ𝑡𝑡,𝜃𝜃𝑡𝑡−1, 𝑖𝑖𝑡𝑡−1 + 𝐛𝐛𝐼𝐼𝑓𝑓𝑡𝑡 = 𝜎𝜎 𝐖𝐖𝐹𝐹 � 𝛻𝛻𝑡𝑡,ℒ𝑡𝑡,𝜃𝜃𝑡𝑡−1,𝑓𝑓𝑡𝑡−1 + 𝐛𝐛𝐹𝐹

𝜃𝜃𝑡𝑡 = 𝑓𝑓𝑡𝑡⨀𝜃𝜃𝑡𝑡−1 + 𝑖𝑖𝑡𝑡⨀𝛻𝛻𝑡𝑡





(LSTM)

(𝑀𝑀)





𝜃𝜃𝑡𝑡 = 𝑓𝑓𝑡𝑡⨀𝜃𝜃𝑡𝑡−1 + 𝑖𝑖𝑡𝑡⨀𝛻𝛻𝑡𝑡[Note] The parameters of LSTM (𝐖𝐖𝐼𝐼, 𝐛𝐛𝐼𝐼, 𝐖𝐖𝐹𝐹, 𝐛𝐛𝐹𝐹) are not updated by (𝛻𝛻𝑡𝑡,ℒ𝑡𝑡)





(LSTM)

(𝑀𝑀)



After 𝑇𝑇 steps, LSTM is update based on loss computed by final parameters of the learner (𝜃𝜃𝑇𝑇+1) using X, Y ∈ 𝑄𝑄

Trainable LSTM parameters: 𝐖𝐖𝐼𝐼, 𝐛𝐛𝐼𝐼, 𝐖𝐖𝐹𝐹, 𝐛𝐛𝐹𝐹, and also 𝜃𝜃0






• Given a testing task (𝑆𝑆′,𝑄𝑄′), just use the trained LSTM and 𝑆𝑆′ to update the learner 𝑀𝑀 to get 𝜃𝜃𝑇𝑇+1 to predict 𝐗𝐗 ∈ 𝑄𝑄′



(LSTM)

(𝑀𝑀)



After 𝑇𝑇 steps, LSTM is update based on loss computed by final parameters of the learner (𝜃𝜃𝑇𝑇+1) using X, Y ∈ 𝑄𝑄

Trainable LSTM parameters: 𝐖𝐖𝐼𝐼, 𝐛𝐛𝐼𝐼, 𝐖𝐖𝐹𝐹, 𝐛𝐛𝐹𝐹, and also 𝜃𝜃0




Model-Agnostic Fast Adaptation [12] (1/3)

• Idea• Fast Adaptation: learn a good weight initialization to be fine-tuned efficiently

• Feature learning standpoint: make the representation suitable to many tasks, such that fine-tuning the parameters slightly can produce good results

• Dynamical systems standpoint: make the parameters lie in the region with high sensitivityto loss functions from new tasks, such that small changes in the parameters can lead to large improvements

• Model-agnostic: compatible with any model trained with gradient descent (classification, regression, and reinforcement learning, etc.)

• Model-agnostic meta learning (MAML)


[12] C. Finn et al., "Model-Agnostic Meta Learning for Fast Adaptation of Deep Networks," ICML 2017

𝜃𝜃1′𝜃𝜃2′𝜃𝜃3′

ℒ𝒯𝒯1(𝑓𝑓𝜃𝜃1′)

ℒ𝒯𝒯2(𝑓𝑓𝜃𝜃2′)ℒ𝒯𝒯3(𝑓𝑓𝜃𝜃3′)𝜃𝜃

𝜃𝜃









model adapting to task 𝒯𝒯𝑖𝑖 using a single gradient update based on 𝑆𝑆 (few-shot)




𝜃𝜃










meta-objective function across multiple tasks (evaluated on 𝑄𝑄 with larger amounts of data)




𝜃𝜃











update 𝜃𝜃 by gradient descent based on the meta-objective




𝜃𝜃












𝜃𝜃

optimize model parameter 𝜃𝜃 such that it can quickly adapt to new tasks



update 𝜃𝜃 by gradient descent based on the meta-objective


• Sinusoidal regression: 𝑦𝑦 = 𝐴𝐴 � sin 𝑥𝑥 + 𝐵𝐵• Training tasks: random sinusoidal functions with 𝐴𝐴 ∈ [0.1, 5.0] and 𝐵𝐵 ∈ [0,𝜋𝜋], each

has 10 observed points 𝑥𝑥 sampled uniformly from [−5.0, 5.0]• Testing tasks: the following ground truth functions ( ) with K=5 or 10 observed

points ( )



Model trained on all training tasks with many steps of gradient descentMAML trained with one-step gradient descent


• Quantitative results of sinusoidal regression (MSE)

• Remark• Not work for multimodal tasks (e.g., not only sinusoidal functions, but mix of

sinusoidal functions, linear functions, and quadratic functions)



Outline






• Recent Topics

Metric Learning

• “Many non-parametric models allow novel examples to be rapidly assimilated, whilst not suffering from catastrophic forgetting. Some models in this family (e.g., nearest neighbors) do not require any training but performance depends on the chosen metric” [3]

• Euclidean distance between two feature vectors 𝐱𝐱 and 𝐲𝐲: 𝐱𝐱 − 𝐲𝐲 𝑇𝑇 𝐱𝐱 − 𝐲𝐲• Mahalanobis distance between two feature vectors 𝐱𝐱 and 𝐲𝐲 of the same distribution

with covariance matrix 𝚺𝚺: 𝐱𝐱 − 𝐲𝐲 𝑇𝑇𝚺𝚺−1 𝐱𝐱 − 𝐲𝐲• learning a metric is effectively learning a linear transformation 𝐀𝐀 of the input space,

such that 𝑑𝑑 𝐱𝐱, 𝐲𝐲 = 𝐱𝐱 − 𝐲𝐲 𝑇𝑇𝐀𝐀𝑇𝑇𝐀𝐀 𝐱𝐱 − 𝐲𝐲 = 𝐀𝐀𝐱𝐱 − 𝐀𝐀𝐲𝐲 𝑇𝑇 𝐀𝐀𝐱𝐱 − 𝐀𝐀𝐲𝐲 , on which KNN performs well

Meta-Learning Approaches – Metric-learning methods (learning to compare)

[3] O. Vinyals et al., "Matching Networks for One Shot Learning," NIPS, 2016[13] J. Snell et al., "Prototypical Networks for Few-Shot Learning," NIPS, 2017

Siamese Networks [1] (1/3)

• Idea• Networks which do well at verification (to identify input pairs belonging to the same

class or different classes) should generalize to one-shot classification• Thus, the authors proposed to train a verification network and apply it to the one-

shot learning tasks during testing



• Siamese network: twin networks and a common final layer








twin networks (weight shared) extract features

𝐡𝐡1,𝐿𝐿−1






final layer computes the L1 component-wise distance between the highest-level feature representations

sigmoid trainable weighting parameter





• Training: supervised learning with regularized (negative) cross-entropy objective




1 for positive pairs; 0 for negative pairs




final layer computes the L1 component-wise distance between the highest-level feature representations

sigmoid trainable weighting parameter

• Siamese network: twin networks and a common final layer• Testing: given a testing task (𝑆𝑆,𝑄𝑄) with support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘 , predict the label for the query example �𝑥𝑥 ∈ 𝑄𝑄 as 𝑦𝑦𝑚𝑚 corresponding to the minimum distance: 𝑚𝑚 = argmin𝑖𝑖𝐩𝐩(𝑥𝑥𝑖𝑖 , �𝑥𝑥)




𝐡𝐡𝑖𝑖,𝐿𝐿−1

�̂�𝐡𝐿𝐿−1

query example �𝑥𝑥

𝐩𝐩(𝑥𝑥𝑖𝑖 , �𝑥𝑥) = 𝜎𝜎(∑𝑗𝑗 𝛼𝛼𝑗𝑗|𝐡𝐡𝑖𝑖,𝐿𝐿−1𝑗𝑗 − �̂�𝐡𝐿𝐿−1

(𝑗𝑗) |)

support example 𝑥𝑥𝑖𝑖 ∈ 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘

Prototypical Networks [13]

• Idea• Assumption: There exists an embedding space on which features belong to the same

class cluster together around a single prototype for that class• Thus, the authors proposed to learn a non-linear mapping 𝑓𝑓𝜙𝜙 and take the mean

representation 𝐜𝐜𝑘𝑘 for each class as its prototype in the embedding space:


[13] J. Snell et al., "Prototypical Networks for Few-Shot Learning," NIPS, 2017

𝑓𝑓𝜙𝜙

𝑓𝑓𝜙𝜙

𝑓𝑓𝜙𝜙

support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1

𝑘𝑘







, where 𝑆𝑆𝑘𝑘 ⊂ 𝑆𝑆 is the subset of support set 𝑆𝑆 with class 𝑘𝑘

𝐜𝐜1

𝐜𝐜2

𝐜𝐜3

(similar to the centers in the center-loss methods)







𝐜𝐜1

𝐜𝐜2

𝐜𝐜3

, where 𝑑𝑑(. , . ) is the Euclidean distance

𝑓𝑓𝜙𝜙query example �𝑥𝑥

(similar to clustering)

Matching Networks [3] (1/3)

• Idea• Inspired by recent advances in the attention mechanism, which is defined to access

an augmented memory containing useful information to solve the task at hand• Thus, the authors proposed a weighted nearest-neighbor classifier, which uses an

attention mechanism over a learned embedding of the support set 𝑆𝑆 = 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1𝑘𝑘

to predict the label of the query example �𝑥𝑥 as:


[3] O. Vinyals et al., "Matching Networks for One Shot Learning," NIPS, 2016

with

𝑐𝑐 . , . : cosine similarity


𝑘𝑘




[1] G. Koch et al., "Siamese Neural Networks for One-Shot Image Recognition," ICML Deep Learning Workshop, 2015[13] J. Snell et al., "Prototypical Networks for Few-Shot Learning," NIPS, 2017[3] O. Vinyals et al., "Matching Networks for One Shot Learning," NIPS, 2016

• Simple form: 𝑙𝑙 = 𝑓𝑓• Similar to Siamese network [1]

• Also similar to prototypical network [13] for one-shot learning


𝑘𝑘




[3] O. Vinyals et al., "Matching Networks for One Shot Learning," NIPS, 2016

• Full context embedding (FCE)• Each element in 𝑆𝑆 should not be embedded independently of other elements

• 𝑙𝑙 𝑥𝑥𝑖𝑖 𝑙𝑙 𝑆𝑆 as a bidirectional LSTM by considering the whole 𝑆𝑆 as a sequence• Also, 𝑆𝑆 should be able to modify how we embed �𝑥𝑥

• 𝑓𝑓 �𝑥𝑥 𝑓𝑓 �𝑥𝑥,𝑆𝑆 as an LSTM with read-attention over 𝑙𝑙 𝑆𝑆 : attLSTM 𝑓𝑓′ �𝑥𝑥 ,𝑙𝑙 𝑆𝑆 ,𝐾𝐾 , where 𝑓𝑓′ �𝑥𝑥 is the (fixed) CNN feature, and 𝐾𝐾 is the number of unrolling steps

• Experiment results on miniImageNet


𝑘𝑘


bidirectionalLSTM

attLSTM

Relation Networks [14]

• Idea• Other metric-learning approaches focus on learning a transferrable embedding

function with a fixed metric (e.g., Euclidean distance, cosine similarity, …)• The authors proposed to train a Relation Network (RN) to explicitly learn a

transferrable deep distance metric comparing the relation between images


[14] F. Sung et al., "Learning to Compare: Relation Network for Few-Shot Learning," CVPR, 2018


𝑘𝑘


embedding module relation module

feature concatenation

compute relation score








𝑘𝑘



𝑥𝑥𝑖𝑖 → 𝑓𝑓𝜑𝜑 𝑥𝑥𝑖𝑖�𝑥𝑥 → 𝑓𝑓𝜑𝜑 �𝑥𝑥










𝑘𝑘



𝒞𝒞(𝑓𝑓𝜑𝜑 𝑥𝑥𝑖𝑖 ,𝑓𝑓𝜑𝜑 �𝑥𝑥 )











𝑘𝑘



𝑟𝑟 = 𝑙𝑙𝜙𝜙(𝒞𝒞(𝑓𝑓𝜑𝜑 𝑥𝑥𝑖𝑖 ,𝑓𝑓𝜑𝜑 �𝑥𝑥 ))

𝒞𝒞(𝑓𝑓𝜑𝜑 𝑥𝑥𝑖𝑖 ,𝑓𝑓𝜑𝜑 �𝑥𝑥 )




Remarks

• Some works can be extended to zero-shot learning: the support set contains a semantic embedding vector (𝐯𝐯𝑘𝑘) for each of the training classes, instead of few-shot images

• In this case, we can use a second heterogeneous embedding function to embed the semantic embedding vectors

• Prototypical Networks:

• Relation Networks: 𝑟𝑟 = 𝑙𝑙𝜙𝜙(𝒞𝒞(𝑓𝑓𝜑𝜑 𝑥𝑥𝑖𝑖 ,𝑓𝑓𝜑𝜑 �𝑥𝑥 )) 𝑟𝑟 = 𝑙𝑙𝜙𝜙(𝒞𝒞(𝑓𝑓𝜑𝜑2 𝐯𝐯𝑘𝑘 ,𝑓𝑓𝜑𝜑1 �𝑥𝑥 ))



𝑙𝑙𝜗𝜗 𝑙𝑙𝜗𝜗

𝑙𝑙𝜗𝜗

𝑓𝑓𝜙𝜙(𝐱𝐱𝑖𝑖) 𝐜𝐜𝑘𝑘 = 𝑙𝑙𝜗𝜗(𝐯𝐯𝑘𝑘)

Outline




• Optimization as a Model• Meta Networks• Model-Agnostic Meta-Learning


• Recent Topics

A Closer Look at FSL [15]

• Idea• The authors proposed a consistent comparative analysis of several representative FSL

methods, and found that• Deeper backbones significantly reduce the gap across methods when domain differences

are limited• A slightly modified baseline method can surprisingly achieve competitive performance• As the domain difference grows larger, the adaptation based on a few novel class instances

becomes more important and simple baseline outperforms representative FSL methods

Meta-Learning Approaches – Recent Topics

[15] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, "A Closer Look at Few-shot Classification," ICLR, 2019

Multimodal MAML [16] (1/3)

• Idea• MAML seeks a single initialization shared across the entire task distribution (e.g.,

sinusoidal regression)• In multimodal regression problems, different modes (sinusoidal, linear, or quadratic)

may require substantially different parameters• The authors proposed a multimodal MAML algorithm that is able to identify the

mode of the task and then modulate its meta-learned prior accordingly


[12] C. Finn et al., "Model-Agnostic Meta Learning for Fast Adaptation of Deep Networks," ICML 2017[16] R. Vuorio et al., "Toward Multimodal Model-Agnostic Meta-Learning," NIPS Workshop, 2018

sampled a task (support set 𝑆𝑆)

task-specific parameters

(original MAML [12])

modulate the meta-learned prior parameters (𝜃𝜃𝑖𝑖’s) to fit the task (e.g., activate or deactivate a certain neuron of a FC layer)


• Multimodal regression• Sinusoidal: 𝑦𝑦 = 𝐴𝐴 � sin 𝜔𝜔 � 𝑥𝑥 + 𝑏𝑏• Linear: 𝑦𝑦 = 𝐴𝐴 � 𝑥𝑥 + 𝑏𝑏• Quadratic: 𝑦𝑦 = 𝐴𝐴 � (𝑥𝑥 − 𝑐𝑐)2+𝑏𝑏


[16] R. Vuorio et al., "Toward Multimodal Model-Agnostic Meta-Learning," NIPS Workshop, 2018

(without any gradient update)

(five steps of gradient updates)

mode-specific MAML


• Quantitative results (MSE)

• t-SNE visualization of task embeddings


[16] R. Vuorio et al., "Toward Multimodal Model-Agnostic Meta-Learning," NIPS Workshop, 2018

(mode-specific)

different modulation schemes

References[1] G. Koch, R. Zemel, and R. Salakhutdinov, "Siamese Neural Networks for One-Shot Image Recognition," ICML Deep Learning

Workshop, 2015[2] B. Hariharan and R. Girshick, "Low-shot Visual Recognition by Shrinking and Hallucinating Features," ICCV, 2017[3] O. Vinyals, C. Blundell, T. P. Lillicrap, K. Kavukcuoglu, and D. Wierstra, "Matching Networks for One Shot Learning," NIPS, 2016[4] X. Wang and Y. Ye, "Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs," CVPR, 2018

[5] Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan, "Low-Shot Learning from Imaginary Data," CVPR, 2018

[6] M. Dixit, R. Kwitt, M. Niethammer, and N. Vasconcelos, "AGA: Attribute-Guided Augmentation," CVPR, 2017[7] F. Pahde, P. Jahnichen, T. Klein, and M. Nabi, "Cross-modal Hallucination for Few-shot Fine-grained Recognition," CVPR

Workshop, 2018[8] A. Antoniou, A. Storkey, and H. Edwards, "Data Augmentation Generative Adversarial Networks", ICLR Workshop, 2018

[9] C.-C. Lin, Y.-C. F. Wang, C.-L. Lei, and K.-T. Chen, "Semantics-Guided Data Hallucination for Few-Shot Visual Classification," ICIP, 2019

[10] W.-H. Chu and Y.-C. F. Wang, "Learning Semantics-Guided Visual Attention for Few-shot Image Classification," ICIP, 2018[11] S. Ravi and H. Larochelle, "Optimization as a Model for Few-Shot Learning," ICLR, 2017

[12] C. Finn, P. Abbeel, and S. Levine, "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks," ICML 2017

[13] J. Snell, K. Swersky, and R. Zemel, "Prototypical Networks for Few-Shot Learning," NIPS, 2017[14] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales, "Learning to Compare: Relation Network for Few-Shot

Learning," CVPR, 2018[15] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, "A Closer Look at Few-shot Classification," ICLR, 2019[16] R. Vuorio, S.-H. Sun, H. Hu, and J. J. Lim, "Toward Multimodal Model-Agnostic Meta-Learning," NIPS Workshop, 2018

What We’ve Learned This Semester…

• Outline• ML101, Image Representation & Recognition• Intro to NNs & CNNs• CNN for Classification, Detection, & Segmentation• Visualization of NN/CNN• Generative Adversarial Networks• Transfer Learning & Representation Disentanglement• Recurrent Neural Net & Its Applications• Learning Beyond Images / Learning from Audi-Visual Data• Few-Shot Learning

• Remarks

96

vision & learning labvllab.ee.ntu.edu.tw/uploads/1/1/1/6/111696467/dlcv_w16.pdfoutline for...

Documents