matching networks for one shot learning

Copyright(C)DeNACo.,Ltd.AllRightsReserved.

AISystemDept.SystemManagementUnit

KazukiFujikawa

Matching Networks for One Shot Learninghttps://papers.nips.cc/paper/6385-matching-networks-for-one-shot-learning

論⽂紹介

1

NIPS2016読み会@PreferredNetworks2017/01/19


n  One-shot learning with attention and memory⁃  Learn a concept from one or only a few training examples⁃  Train a fully end-to-end nearest neighbor classifier: incorporating

the best characteristics from both parametric and non-parametric models

⁃  Improved one-shot accuracy on Omniglot from 88.0% to 93.2% compared to competing approaches

2

Abstract

Figure 1: Matching Networks architecture

train it by showing only a few examples per class, switching the task from minibatch to minibatch,much like how it will be tested when presented with a few examples of a new task.

Besides our contributions in defining a model and training criterion amenable for one-shot learning,we contribute by the definition of tasks that can be used to benchmark other approaches on bothImageNet and small scale language modeling. We hope that our results will encourage others to workon this challenging problem.

We organized the paper by first defining and explaining our model whilst linking its several compo-nents to related work. Then in the following section we briefly elaborate on some of the related workto the task and our model. In Section 4 we describe both our general setup and the experiments weperformed, demonstrating strong results on one-shot learning on a variety of tasks and setups.

2 Model

Our non-parametric approach to solving one-shot learning is based on two components which wedescribe in the following subsections. First, our model architecture follows recent advances in neuralnetworks augmented with memory (as discussed in Section 3). Given a (small) support set S, ourmodel defines a function c

S

(or classifier) for each S, i.e. a mapping S ! c

S

(.). Second, we employa training strategy which is tailored for one-shot learning from the support set S.

2.1 Model Architecture

In recent years, many groups have investigated ways to augment neural network architectures withexternal memories and other components that make them more “computer-like”. We draw inspirationfrom models such as sequence to sequence (seq2seq) with attention [2], memory networks [29] andpointer networks [27].

In all these models, a neural attention mechanism, often fully differentiable, is defined to access (orread) a memory matrix which stores useful information to solve the task at hand. Typical uses ofthis include machine translation, speech recognition, or question answering. More generally, thesearchitectures model P (B|A) where A and/or B can be a sequence (like in seq2seq models), or, moreinterestingly for us, a set [26].

Our contribution is to cast the problem of one-shot learning within the set-to-set framework [26].The key point is that when trained, Matching Networks are able to produce sensible test labels forunobserved classes without any changes to the network. More precisely, we wish to map from a(small) support set of k examples of image-label pairs S = {(x

i

, y

i

)}ki=1 to a classifier c

S

(x) which,given a test example x, defines a probability distribution over outputs y. We define the mappingS ! c

S

(x) to be P (y|x, S) where P is parameterised by a neural network. Thus, when given a

2


AGENDA

n  Introductionn  Related work

⁃  One-shot learning⁃  Attention mechanisms

n  Matching Networksn  Experiments

⁃  Omniglot⁃  ImageNet⁃  Penn Treebank

3


Supervised Learning

n  Learn a correspondence between training data and labels⁃  Require a large labeled dataset for training

(ex. CIFAR10 [Krizhevsky+, 2009]: 6000 data / class)⁃  It is hard to let classifiers learn new concepts from little data

4

airplane

automobile

bird

cat

deer

Classifier

examples Labels

0 airplane

1 automobile

0 bird

0 cat

0 deer

Classifier

Training phase Predicting phase

https://www.cs.toronto.edu/~kriz/cifar.html


One-shot Learning

n  Learn a concept from one or only a few training examples⁃  A classifier can be trained by datasets with labels which donʼt

be used in predicting phase

5

airplane

automobile

bird

cat

deer

Classifier

examples Labels

0 airplane

1 automobile

0 bird

0 cat

0 deer

Classifier

（Pre-）Training phase Predicting phase（one-shot learning phase）


dog

frog

horse

ship

truck

Classifier

examples Labels


One-shot Learning

n  Task: N-way k-shot learning

6

T’: Testing task T: Training task

dog

frog

horse

ship

truck

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

airplane

automobile

bird

cat

deer

•  Separate labels for training and testing•  All the labels which you use in testing

phase (one-shot learning phase) are not used in training phase



One-shot Learning


7


dog

frog

horse

ship

truck

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

airplane

automobile

bird

cat

deer

•  T’ is used for one-shot learning•  T can be used freely to train（e.g. Multiclass classification）



One-shot Learning


8


dog

frog

horse

ship

truck

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

airplane

automobile

bird

cat

deer

L’: Label setsampling N labels from Tʼ

•  In this figure, Lʼ has 3 classes, thus “3-way k-shot learning”

automobile

cat

deer



One-shot Learning


9


dog

frog

horse

ship

truck

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

airplane

automobile

bird

cat

deer

L’: Label set

S’: Support set : Query

automobile

cat

deer

sampling N labels from Tʼ

sampling k examples from Lʼ sampling 1 example from Lʼx

•  Task: classify into 3 classes, {automobile, cat, deer}, using support set

x



Related Work (One-shot Learning)

n  Convolutional Siamese Network [Koch+, 2015]⁃  Learn image representation with a siamese neural network⁃  Reuse features from the network for one-shot learning

10





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2

CNN CNN

Same?



n  Memory-Augmented Neural Networks (MANN) [Santoro+, 2016]⁃  Quickly encode and retrieve new information using external

memory, inspired by the idea of Neural Turing Machine

11



n  Siamese Learnet [Bertinetto+, NIPS2016]⁃  Learn the parameters of a network to incorporate domain

specific information from a few examples

12

siamesesiamese learnet learnet

Figure 1: Our proposed architectures predict the parameters of a network from a single example,replacing static convolutions (green) with dynamic convolutions (red). The siamese learnet predictsthe parameters of an embedding function that is applied to both inputs, whereas the single-streamlearnet predicts the parameters of a function that is applied to the other input. Linear layers aredenoted by ⇤ and nonlinear layers by �. Dashed connections represent parameter sharing.

discriminative one-shot learning is to find a mechanism to incorporate domain-specific information inthe learner, i.e. learning to learn. Another challenge, which is of practical importance in applicationsof one-shot learning, is to avoid a lengthy optimization process such as eq. (1).

We propose to address both challenges by learning the parameters W of the predictor from a singleexemplar z using a meta-prediction process, i.e. a non-iterative feed-forward function ! that maps(z; W 0) to W . Since in practice this function will be implemented using a deep neural network, wecall it a learnet. The learnet depends on the exemplar z, which is a single representative of the class ofinterest, and contains parameters W 0 of its own. Learning to learn can now be posed as the problem ofoptimizing the learnet meta-parameters W 0 using an objective function defined below. Furthermore,the feed-forward learnet evaluation is much faster than solving the optimization problem (1).

In order to train the learnet, we require the latter to produce good predictors given any possibleexemplar z, which is empirically evaluated as an average over n training samples zi:

minW 0

1

n

nX

i=1

L('(xi; !(zi; W0)), ì). (2)

In this expression, the performance of the predictor extracted by the learnet from the exemplar zi isassessed on a single “validation” pair (xi, ì), comprising another exemplar and its label ì. Hence,the training data consists of triplets (xi, zi, ì). Notice that the meaning of the label ì is subtlydifferent from eq. (1) since the class of interest changes depending on the exemplar zi: ì is positivewhen xi and zi belong to the same class and negative otherwise. Triplets are sampled uniformly withrespect to these two cases. Importantly, the parameters of the original predictor ' of eq. (1) nowchange dynamically with each exemplar zi.

Note that the training data is reminiscent of that of siamese networks [2], which also learn fromlabeled sample pairs. However, siamese networks apply the same model '(x; W ) with sharedweights W to both xi and zi, and compute their inner-product to produce a similarity score:

minW

1

n

nX

i=1

L(h'(xi; W ), '(zi; W )i, ì). (3)

There are two key differences with our model. First, we treat xi and zi asymmetrically, which resultsin a different objective function. Second, and most importantly, the output of !(z; W 0) is used toparametrize linear layers that determine the intermediate representations in the network '. This issignificantly different to computing a single inner product in the last layer (eq. (3)).

Eq. (2) specifies the optimization objective of one-shot learning as dynamic parameter prediction.By application of the chain rule, backpropagating derivatives through the computational blocksof '(x; W ) and !(z; W 0) is no more difficult than through any other standard deep network.Nevertheless, when we dive into concrete implementations of such models we face a peculiarchallenge, discussed next.

2.1 The challenge of naive parameter prediction

In order to analyse the practical difficulties of implementing a learnet, we will begin with one-shotprediction of a fully-connected layer, as it is simpler to analyse. This is given by

y = Wx+ b, (4)

3


Related Work (Attention Mechanism)

n  Sequence to Sequence with Attention [Bahdanau+, 2014]⁃  Attend to the word relevant to the generation of the next

target word in the source sentence

13

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

x1 x2 x3 xT

+αt,1αt,2 αt,3

αt,T

yt-1 yt

h1 h2 h3 hT

h1 h2 h3 hT

st-1 s t

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

↵

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

↵

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3



t


p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y


p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)


, and s

t






t


).


p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i


s

i

= f(s

i�1, yi�1, ci).


i


i

.


i


x


i




i

:

c

i

=

T

xX

j=1

↵

ij

h

j

. (5)

The weight ↵ij


j

is computed by

↵

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)



i


j



3



t


p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y


p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)


, and s

t






t


).


p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i


s

i

= f(s

i�1, yi�1, ci).


i


i

.


i


x


i




i

:

c

i

=

T

xX

j=1

↵

ij

h

j

. (5)

The weight ↵ij


j

is computed by

↵

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)



i


j



3



t


p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y


p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)


, and s

t






t


).


p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i


s

i

= f(s

i�1, yi�1, ci).


i


i

.


i


x


i




i

:

c

i

=

T

xX

j=1

↵

ij

h

j

. (5)

The weight ↵ij


j

is computed by

↵

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)



i


j



3


(a) (b)

(c) (d)

Figure 3: Four sample alignments found by RNNsearch-50. The x-axis and y-axis of each plotcorrespond to the words in the source sentence (English) and the generated translation (French),respectively. Each pixel shows the weight ↵

ij

of the annotation of the j-th source word for the i-thtarget word (see Eq. (6)), in grayscale (0: black, 1: white). (a) an arbitrary sentence. (b–d) threerandomly selected samples among the sentences without any unknown words and of length between10 and 20 words from the test set.

One of the motivations behind the proposed approach was the use of a fixed-length context vectorin the basic encoder–decoder approach. We conjectured that this limitation may make the basicencoder–decoder approach to underperform with long sentences. In Fig. 2, we see that the perfor-mance of RNNencdec dramatically drops as the length of the sentences increases. On the other hand,both RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences. RNNsearch-50, especially, shows no performance deterioration even with sentences of length 50 or more. Thissuperiority of the proposed model over the basic encoder–decoder is further confirmed by the factthat the RNNsearch-30 even outperforms RNNencdec-50 (see Table 1).

6



n  Pointer Networks [Vinyals+, 2015]⁃  Generate output sequence using a distribution over the

dictionary of inputs

14

(a) Sequence-to-Sequence (b) Ptr-Net

Figure 1: (a) Sequence-to-Sequence - An RNN (blue) processes the input sequence to create a codevector that is used to generate the output sequence (purple) using the probability chain rule andanother RNN. The output dimensionality is fixed by the dimensionality of the problem and it is thesame during training and inference [1]. (b) Ptr-Net - An encoding RNN converts the input sequenceto a code (blue) that is fed to the generating network (purple). At each step, the generating networkproduces a vector that modulates a content-based attention mechanism over inputs ([5, 2]). Theoutput of the attention mechanism is a softmax distribution with dictionary size equal to the lengthof the input.

ion (i.e., when we only have examples of inputs and desired outputs). The proposed approach isdepicted in Figure 1.

The main contributions of our work are as follows:

• We propose a new architecture, that we call Pointer Net, which is simple and effective. Itdeals with the fundamental problem of representing variable length dictionaries by using asoftmax probability distribution as a “pointer”.

• We apply the Pointer Net model to three distinct non-trivial algorithmic problems involvinggeometry. We show that the learned model generalizes to test problems with more pointsthan the training problems.

• Our Pointer Net model learns a competitive small scale (n 50) TSP approximate solver.Our results demonstrate that a purely data driven approach can learn approximate solutionsto problems that are computationally intractable.

2 Models

We review the sequence-to-sequence [1] and input-attention models [5] that are the baselines for thiswork in Sections 2.1 and 2.2. We then describe our model - Ptr-Net in Section 2.3.

2.1 Sequence-to-Sequence Model

Given a training pair, (P, CP), the sequence-to-sequence model computes the conditional probabil-

ity p(CP |P; ✓) using a parametric model (an RNN with parameters ✓) to estimate the terms of theprobability chain rule (also see Figure 1), i.e.

p(CP |P; ✓) =

m(P)Y

i=1

p✓(Ci|C1, . . . , Ci�1,P; ✓). (1)

2

This model performs significantly better than the sequence-to-sequence model on the convex hullproblem, but it is not applicable to problems where the output dictionary size depends on the input.

Nevertheless, a very simple extension (or rather reduction) of the model allows us to do this easily.

2.3 Ptr-Net

We now describe a very simple modification of the attention model that allows us to apply themethod to solve combinatorial optimization problems where the output dictionary size depends onthe number of elements in the input sequence.

The sequence-to-sequence model of Section 2.1 uses a softmax distribution over a fixed sized outputdictionary to compute p(Ci|C1, . . . , Ci�1,P) in Equation 1. Thus it cannot be used for our problemswhere the size of the output dictionary is equal to the length of the input sequence. To solve thisproblem we model p(Ci|C1, . . . , Ci�1,P) using the attention mechanism of Equation 3 as follows:

uij = vT tanh(W1ej +W2di) j 2 (1, . . . , n)

p(Ci|C1, . . . , Ci�1,P) = softmax(ui)

where softmax normalizes the vector ui (of length n) to be an output distribution over the dictionaryof inputs, and v, W1, and W2 are learnable parameters of the output model. Here, we do not blendthe encoder state ej to propagate extra information to the decoder, but instead, use ui

j as pointersto the input elements. In a similar way, to condition on Ci�1 as in Equation 1, we simply copythe corresponding PCi�1 as the input. Both our method and the attention model can be seen as anapplication of content-based attention mechanisms proposed in [6, 5, 2].

We also note that our approach specifically targets problems whose outputs are discrete and corre-spond to positions in the input. Such problems may be addressed artificially – for example we couldlearn to output the coordinates of the target point directly using an RNN. However, at inference,this solution does not respect the constraint that the outputs map back to the inputs exactly. With-out the constraints, the predictions are bound to become blurry over longer sequences as shown insequence-to-sequence models for videos [12].

3 Motivation and Datasets Structure

In the following sections, we review each of the three problems we considered, as well as our datageneration protocol.1

In the training data, the inputs are planar point sets P = {P1, . . . , Pn} with n elements each, wherePj = (xj , yj) are the cartesian coordinates of the points over which we find the convex hull, the De-launay triangulation or the solution to the corresponding Travelling Salesman Problem. In all cases,we sample from a uniform distribution in [0, 1] ⇥ [0, 1]. The outputs CP

= {C1, . . . , Cm(P)} aresequences representing the solution associated to the point set P . In Figure 2, we find an illustrationof an input/output pair (P, CP

) for the convex hull and the Delaunay problems.

3.1 Convex Hull

We used this example as a baseline to develop our models and to understand the difficulty of solvingcombinatorial problems with data driven approaches. Finding the convex hull of a finite numberof points is a well understood task in computational geometry, and there are several exact solutionsavailable (see [13, 14, 15]). In general, these solutions have O(n log n) time complexity, where n isthe number of points considered.

The vectors Pj are uniformly sampled from [0, 1] ⇥ [0, 1]. The elements Ci are indices between 1and n corresponding to positions in the sequence P , or special tokens representing beginning or endof sequence. See Figure 2 (a) for an illustration.

1We will release all the datasets at hidden for reference.

4



n  Sequence to Sequence for Sets [Vinyals+, ICLR2016]⁃  Handle input sets using an extension of seq2seq framework:

Read-Process-and Write model

15


All these empirical findings point to the same story: often for optimization purposes, the order inwhich input data is shown to the model has an impact on the learning performance.

Note that we can define an ordering which is independent of the input sequence or set X (e.g., alwaysreversing the words in a translation task), but also an ordering which is input dependent (e.g., sortingthe input points in the convex hull case). This distinction also applies in the discussion about outputsequences and sets in Section 5.1.

Recent approaches which pushed the seq2seq paradigm further by adding memory and computationto these models allowed us to define a model which makes no assumptions about input ordering,whilst preserving the right properties which we just discussed: a memory that increases with thesize of the set, and which is order invariant. In the next sections, we explain such a modification,which could also be seen as a special case of a Memory Network (Weston et al., 2015) or NeuralTuring Machine (Graves et al., 2014) – with a computation flow as depicted in Figure 1.

4.2 ATTENTION MECHANISMS

Neural models with memories coupled to differentiable addressing mechanism have been success-fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,2015). Since we are interested in associative memories we employed a “content” based attention.This has the property that the vector retrieved from our memory would not change if we randomlyshuffled the memory. This is crucial for proper treatment of the input set X as such. In particular,our process block based on an attention mechanism uses the following:

qt = LSTM(q⇤t�1) (3)ei,t = f(mi, qt) (4)

ai,t =

exp(ei,t)Pj exp(ej,t)

(5)

rt =

X

i

ai,tmi (6)

q⇤t = [qt rt] (7)

Read

Process Write

Figure 1: The Read-Process-and-Write model.

where i indexes through each memory vector mi (typically equal to the cardinality of X), qt isa query vector which allows us to read rt from the memories, f is a function that computes asingle scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes arecurrent state but which takes no inputs. q⇤t is the state which this LSTM evolves, and is formedby concatenating the query qt with the resulting attention readout rt. t is the index which indicateshow many “processing steps” are being carried to compute the state to be fed to the decoder. Notethat permuting mi and mi0 has no effect on the read vector rt.

4.3 READ, PROCESS, WRITE

Our model, which naturally handles input sets, has three components (the exact equations and im-plementation will be released in an appendix prior to publication):

• A reading block, which simply embeds each element xi 2 X using a small neural networkonto a memory vector mi (the same neural network is used for all i).

• A process block, which is an LSTM without inputs or outputs performing T steps of com-putation over the memories mi. This LSTM keeps updating its state by reading mi repeat-edly using the attention mechanism described in the previous section. At the end of thisblock, its hidden state q⇤T is an embedding which is permutation invariant to the inputs. Seeeqs. (3)-(7) for more details.

4


Matching Networks [Vinyals+, NIPS2016]

n  Motivation⁃  It is important for one-shot learning to attain rapid learning

from new examples while keeping an ability for common examples•  Simple parametric models such as deep classifiers need to be

optimized to treat with new examples•  Non-parametric models such as k-nearest neighbor donʼt require

optimization but performance depends on the chosen metric⁃  It could be efficient to train a end-to-end nearest neighbor

based classifier

16



n  Train a classifier through one-shot learning

17


dog

frog

horse

ship

truck

airplane

automobile

bird

cat

deer

dog

frog

horse

ship

truck

airplane

automobile

bird

cat

deer

L: Label set

S: Support set B : Batch

dog

horse

ship

sampling N labels from T

sampling k examplesfrom L

sampling b example from L




n  System Overview⁃  Embedding functions f, g are parameterized as a simple CNN (e.g.

VGG or Inception) or a fully conditional embedding function mentioned later

18





2 Model


S


S






i

, y

i


S


S


2

xQuery

f

g(xi )

f (x,S)

a

∑

Our model in its simplest form computes a probability over y as follows:

P (y|x, S) =kX

i=1

a(x, x

i

)y

i

(1)

where x

i

, y

i

are the inputs and corresponding label distributions from the support set S =

{(xi

, y

i

)}ki=1, and a is an attention mechanism which we discuss below. Note that eq. 1 essen-

tially describes the output for a new class as a linear combination of the labels in the support set.Where the attention mechanism a is a kernel on X ⇥X , then (1) is akin to a kernel density estimator.Where the attention mechanism is zero for the b furthest x

i

from x according to some distancemetric and an appropriate constant otherwise, then (1) is equivalent to ‘k � b’-nearest neighbours(although this requires an extension to the attention mechanism that we describe in Section 2.1.2).Thus (1) subsumes both KDE and kNN methods. Another view of (1) is where a acts as an attentionmechanism and the y

i

act as values bound to the corresponding keys xi

, much like a hash table. Inthis case we can understand this as a particular kind of associative memory where, given an input,we “point” to the corresponding example in the support set, retrieving its label. Hence the functionalform defined by the classifier c

S

(x) is very flexible and can adapt easily to any new support set.

2.1.1 The Attention Kernel

Equation 1 relies on choosing a(., .), the attention mechanism, which fully specifies the classi-fier. The simplest form that this takes (and which has very tight relationships with commonattention models and kernel functions) is to use the softmax over the cosine distance c, i.e.,a(x, x

i

) = e

c(f(x),g(xi))/

Pk

j=1 ec(f(x),g(xj)) with embedding functions f and g being appropri-

ate neural networks (potentially with f = g) to embed x and x

i

. In our experiments we shall seeexamples where f and g are parameterised variously as deep convolutional networks for imagetasks (as in VGG[22] or Inception[24]) or a simple form word embedding for language tasks (seeSection 4).

We note that, though related to metric learning, the classifier defined by Equation 1 is discriminative.For a given support set S and sample to classify x, it is enough for x to be sufficiently aligned withpairs (x0

, y

0) 2 S such that y0 = y and misaligned with the rest. This kind of loss is also related to

methods such as Neighborhood Component Analysis (NCA) [18], triplet loss [9] or large marginnearest neighbor [28].

However, the objective that we are trying to optimize is precisely aligned with multi-way, one-shotclassification, and thus we expect it to perform better than its counterparts. Additionally, the loss issimple and differentiable so that one can find the optimal parameters in an “end-to-end” fashion.

2.1.2 Full Context Embeddings

The main novelty of our model lies in reinterpreting a well studied framework (neural networks withexternal memories) to do one-shot learning. Closely related to metric learning, the embedding func-tions f and g act as a lift to feature space X to achieve maximum accuracy through the classificationfunction described in eq. 1.

Despite the fact that the classification strategy is fully conditioned on the whole support set throughP (.|x, S), the embeddings on which we apply the cosine similarity to “attend”, “point” or simplycompute the nearest neighbor are myopic in the sense that each element x

i

gets embedded by g(x

i

)

independently of other elements in the support set S. Furthermore, S should be able to modify howwe embed the test image x through f .

We propose embedding the elements of the set through a function which takes as input the full setS in addition to x

i

, i.e. g becomes g(xi

, S). Thus, as a function of the whole support set S, g canmodify how to embed x

i

. This could be useful when some element xj

is very close to x

i

, in whichcase it may be beneficial to change the function with which we embed x

i

– some evidence of this isdiscussed in Section 4. We use a bidirectional Long-Short Term Memory (LSTM) [8] to encode x

i

inthe context of the support set S, considered as a sequence (see Appendix).

The second issue to make f depend on x and S can be fixed via an LSTM with read-attention overthe whole set S, whose inputs are equal to f

0(x) (f 0 is an embedding function, e.g. a CNN). To do

3


P (y|x, S) =kX

i=1

a(x, x

i

)y

i

(1)

where x

i

, y

i


{(xi

, y

i



i


i



S




i

) = e

c(f(x),g(xi))/

Pk



i



, y







i


i

)



i



i


is very close to x

i


i


i




3





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2


showing only a few examples per class, switching the task from minibatch to minibatch, much likehow it will be tested when presented with a few examples of a new task.

Besides our contributions in defining a model and training criterion amenable for one-shot learning,we contribute by the definition of two new tasks that can be used to benchmark other approaches onboth ImageNet and small scale language modeling. We hope that our results will encourage others towork on this challenging problem.


2 Model


S


S





Our contribution is to cast the problem of one-shot learning within the set-to-set framework [26].The key point is that when trained, Matching Networks are able to produce sensible test labels forunobserved classes without any changes to the network. More precisely, we wish to map from a(small) support set of k examples of input-label pairs S = {(x

i

, y

i


S

(x) which,given a test example x, defines a probability distribution over outputs y. Here, x could be an image,and y a distribution over possible visual classes. We define the mapping S ! c

S

(x) to be P (y|x, S)where P is parameterised by a neural network. Thus, when given a new support set of examples S0

from which to one-shot learn, we simply use the parametric neural network defined by P to makepredictions about the appropriate label distribution y for each test example x: P (y|x, S0

).

2

xiSupport Set（S）

yi

g



n  The Attention Kernel⁃  Calculate softmax over the cosine distance between and

•  Similar to nearest neighbor calculation⁃  Train a network using cross entropy loss

19





2 Model


S


S






i

, y

i


S


S


2

xQuery

f

g(xi )

f (x,S)

aOur model in its simplest form computes a probability over y as follows:

P (y|x, S) =kX

i=1

a(x, x

i

)y

i

(1)

where x

i

, y

i


{(xi

, y

i



i


i



S




i

) = e

c(f(x),g(xi))/

Pk



i



, y







i


i

)



i



i


is very close to x

i


i


i




3

∑


P (y|x, S) =kX

i=1

a(x, x

i

)y

i

(1)

where x

i

, y

i


{(xi

, y

i



i


i



S




i

) = e

c(f(x),g(xi))/

Pk



i



, y







i


i

)



i



i


is very close to x

i


i


i




3


P (y|x, S) =kX

i=1

a(x, x

i

)y

i

(1)

where x

i

, y

i


{(xi

, y

i



i


i



S




i

) = e

c(f(x),g(xi))/

Pk



i



, y







i


i

)



i



i


is very close to x

i


i


i




3


P (y|x, S) =kX

i=1

a(x, x

i

)y

i

(1)

where x

i

, y

i


{(xi

, y

i



i


i



S




i

) = e

c(f(x),g(xi))/

Pk



i



, y







i


i

)



i



i


is very close to x

i


i


i




3

c: cosine distanceFigure 1: Matching Networks architecture




2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S



).

2


yi

g

so, we define the following recurrence over “processing” steps k, following work from [26]:

ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)

noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x the input,h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “content”based attention. We do K steps of “reads”, so f(x, S) = h

K

where h

k

is as described in eq. 3.

2.2 Training Strategy

In the previous subsection we described Matching Networks which map a support set to a classificationfunction, S ! c(x). We achieve this via a modification of the set-to-set paradigm augmented withattention, with the resulting mapping being of the form P

✓

(.|x, S), noting that ✓ are the parametersof the model (i.e. of the embedding functions f and g described previously).

The training procedure has to be chosen carefully so as to match inference at test time. Our modelhas to perform well with support sets S0 which contain classes never seen during training.

More specifically, let us define a task T as distribution over possible label sets L. Typically weconsider T to uniformly weight all data sets of up to a few unique classes (e.g., 5), with a fewexamples per class (e.g., up to 5). In this case, a label set L sampled from a task T , L ⇠ T , willtypically have 5 to 25 examples.

To form an “episode” to compute gradients and update our model, we first sample L from T (e.g.,L could be the label set {cats, dogs}). We then use L to sample the support set S and a batch B

(i.e., both S and B are labelled examples of cats and dogs). The Matching Net is then trained tominimise the error predicting the labels in the batch B conditioned on the support set S. This is aform of meta-learning since the training procedure explicitly learns to learn from a given support setto minimise a loss over a batch. More precisely, the Matching Nets training objective is as follows:

✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)

Training ✓ with eq. 6 yields a model which works well when sampling S

0 ⇠ T

0 from a differentdistribution of novel labels. Crucially, our model does not need any fine tuning on the classes it hasnever seen due to its non-parametric nature. Obviously, as T 0 diverges far from the T from which wesampled to learn ✓, the model will not work – we belabor this point further in Section 4.1.2.

3 Related Work

3.1 Memory Augmented Neural Networks

A recent surge of models which go beyond “static” classification of fixed vectors onto their classeshas reshaped current research and industrial applications alike. This is most notable in the massiveadoption of LSTMs [8] in a variety of tasks such as speech [7], translation [23, 2] or learning programs[4, 27]. A key component which allowed for more expressive models was the introduction of “content”based attention in [2], and “computer-like” architectures such as the Neural Turing Machine [4] orMemory Networks [29]. Our work takes the metalearning paradigm of [21], where an LSTM learntto learn quickly from data presented sequentially, but we treat the data as a set. The one-shot learningtask we defined on the Penn Treebank [15] relates to evaluation techniques and models presented in[6], and we discuss this in Section 4.

4

g(xi )f (x,S)



n  The Fully Conditional Embedding g⁃  Embed in consideration of S

g’LSTM

LSTM+





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2

20





2 Model


S


S






i

, y

i


S


S



).

2

xiSupport Set（S）yi

g’LSTM

LSTM+

g’LSTM

LSTM+

Appendix

A Model Description

In this section we fully specify the models which condition the embedding functions f and g on thewhole support set S. Much previous work has fully described similar mechanisms, which is why weleft the precise details for this appendix.

A.1 The Fully Conditional Embedding f

As described in section 2.1.2, the embedding function for an example x in the batch B is as follows:

f(x, S) = attLSTM(f

0(x), g(S),K)

where f

0 is a neural network (e.g., VGG or Inception, as described in the main text). We define K

to be the number of “processing” steps following work from [26] from their “Process” block. g(S)represents the embedding function g applied to each element x

i

from the set S.

Thus, the state after k processing steps is as follows:

ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (3)

h

k

=

ˆ

h

k

+ f

0(x) (4)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (5)

a(h

k�1, g(xi

)) = softmax(hT

k�1g(xi

)) (6)

noting that LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x the input,h the output (i.e., cell after the output gate), and c the cell. a is commonly referred to as “content”based attention, and the softmax in eq. 6 normalizes w.r.t. g(x

i

). The read-out rk�1 from g(S) is

concatenated to h

k�1. Since we do K steps of “reads”, attLSTM(f

0(x), g(S),K) = h

K

where h

k


A.2 The Fully Conditional Embedding g

In section 2.1.2 we described the encoding function for the elements in the support set S, g(xi

, S),as a bidirectional LSTM. More precisely, let g0(x

i

) be a neural network (similar to f

0 above, e.g. aVGG or Inception model). Then we define g(x

i

, S) =

~

h

i

+

~

h

i

+ g

0(x

i

) with:

~

h

i

,~c

i

= LSTM(g

0(x

i

),

~

h

i�1,~ci�1)

~

h

i

, ~c

i

= LSTM(g

0(x

i

),

~

h

i+1, ~c

i+1)

where, as in above, LSTM(x, h, c) follows the same LSTM implementation defined in [23] with x

the input, h the output (i.e., cell after the output gate), and c the cell. Note that the recursion for ~

h

starts from i = |S|. As in eq. 3, we add a skip connection between input and outputs.

B ImageNet Class Splits

Here we define the two class splits used in our full ImageNet experiments – these classes wereexcluded for training during our one-shot experiments described in section 4.1.2.

L

rand

=

n01498041, n01537544, n01580077, n01592084, n01632777, n01644373, n01665541, n01675722, n01688243, n01729977, n01775062,n01818515, n01843383, n01883070, n01950731, n02002724, n02013706, n02092339, n02093256, n02095314, n02097130, n02097298,

10

g’: neural network (e.g., VGG or Inception)

Appendix

A Model Description




f(x, S) = attLSTM(f

0(x), g(S),K)

where f



i

from the set S.


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (3)

h

k

=

ˆ

h

k

+ f

0(x) (4)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (5)

a(h

k�1, g(xi

)) = softmax(hT

k�1g(xi

)) (6)


i


concatenated to h


0(x), g(S),K) = h

K

where h

k





i



i

, S) =

~

h

i

+

~

h

i

+ g

0(x

i

) with:

~

h

i

,~c

i

= LSTM(g

0(x

i

),

~

h

i�1,~ci�1)

~

h

i

, ~c

i

= LSTM(g

0(x

i

),

~

h

i+1, ~c

i+1)



h




L

rand

=


10

xi

g(xi,S)




g’





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2

21





2 Model


S


S






i

, y

i


S


S



).

2


g’

g’

Appendix

A Model Description




f(x, S) = attLSTM(f

0(x), g(S),K)

where f



i

from the set S.


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (3)

h

k

=

ˆ

h

k

+ f

0(x) (4)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (5)

a(h

k�1, g(xi

)) = softmax(hT

k�1g(xi

)) (6)


i


concatenated to h


0(x), g(S),K) = h

K

where h

k





i



i

, S) =

~

h

i

+

~

h

i

+ g

0(x

i

) with:

~

h

i

,~c

i

= LSTM(g

0(x

i

),

~

h

i�1,~ci�1)

~

h

i

, ~c

i

= LSTM(g

0(x

i

),

~

h

i+1, ~c

i+1)



h




L

rand

=


10


Appendix

A Model Description




f(x, S) = attLSTM(f

0(x), g(S),K)

where f



i

from the set S.


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (3)

h

k

=

ˆ

h

k

+ f

0(x) (4)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (5)

a(h

k�1, g(xi

)) = softmax(hT

k�1g(xi

)) (6)


i


concatenated to h


0(x), g(S),K) = h

K

where h

k





i



i

, S) =

~

h

i

+

~

h

i

+ g

0(x

i

) with:

~

h

i

,~c

i

= LSTM(g

0(x

i

),

~

h

i�1,~ci�1)

~

h

i

, ~c

i

= LSTM(g

0(x

i

),

~

h

i+1, ~c

i+1)



h




L

rand

=


10

Embed into vector using g’ （g’: neural network such as VGG or Inception）

xi

xi




g’LSTM

LSTM





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2

22





2 Model


S


S






i

, y

i


S


S



).

2


g’LSTM

LSTM

g’LSTM

LSTM

Appendix

A Model Description




f(x, S) = attLSTM(f

0(x), g(S),K)

where f



i

from the set S.


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (3)

h

k

=

ˆ

h

k

+ f

0(x) (4)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (5)

a(h

k�1, g(xi

)) = softmax(hT

k�1g(xi

)) (6)


i


concatenated to h


0(x), g(S),K) = h

K

where h

k





i



i

, S) =

~

h

i

+

~

h

i

+ g

0(x

i

) with:

~

h

i

,~c

i

= LSTM(g

0(x

i

),

~

h

i�1,~ci�1)

~

h

i

, ~c

i

= LSTM(g

0(x

i

),

~

h

i+1, ~c

i+1)



h




L

rand

=


10


Appendix

A Model Description




f(x, S) = attLSTM(f

0(x), g(S),K)

where f



i

from the set S.


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (3)

h

k

=

ˆ

h

k

+ f

0(x) (4)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (5)

a(h

k�1, g(xi

)) = softmax(hT

k�1g(xi

)) (6)


i


concatenated to h


0(x), g(S),K) = h

K

where h

k





i



i

, S) =

~

h

i

+

~

h

i

+ g

0(x

i

) with:

~

h

i

,~c

i

= LSTM(g

0(x

i

),

~

h

i�1,~ci�1)

~

h

i

, ~c

i

= LSTM(g

0(x

i

),

~

h

i+1, ~c

i+1)



h




L

rand

=


10

Feed into Bi-LSTM（gʼ: neural network such as VGG or Inception）

g '(xi )

xi




g’LSTM

LSTM+





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2

23





2 Model


S


S






i

, y

i


S


S



).

2


g’LSTM

LSTM+

g’LSTM

LSTM+

Appendix

A Model Description




f(x, S) = attLSTM(f

0(x), g(S),K)

where f



i

from the set S.


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (3)

h

k

=

ˆ

h

k

+ f

0(x) (4)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (5)

a(h

k�1, g(xi

)) = softmax(hT

k�1g(xi

)) (6)


i


concatenated to h


0(x), g(S),K) = h

K

where h

k





i



i

, S) =

~

h

i

+

~

h

i

+ g

0(x

i

) with:

~

h

i

,~c

i

= LSTM(g

0(x

i

),

~

h

i�1,~ci�1)

~

h

i

, ~c

i

= LSTM(g

0(x

i

),

~

h

i+1, ~c

i+1)



h




L

rand

=


10


Appendix

A Model Description




f(x, S) = attLSTM(f

0(x), g(S),K)

where f



i

from the set S.


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (3)

h

k

=

ˆ

h

k

+ f

0(x) (4)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (5)

a(h

k�1, g(xi

)) = softmax(hT

k�1g(xi

)) (6)


i


concatenated to h


0(x), g(S),K) = h

K

where h

k





i



i

, S) =

~

h

i

+

~

h

i

+ g

0(x

i

) with:

~

h

i

,~c

i

= LSTM(g

0(x

i

),

~

h

i�1,~ci�1)

~

h

i

, ~c

i

= LSTM(g

0(x

i

),

~

h

i+1, ~c

i+1)



h




L

rand

=


10

g(xi,S)

Let be the sum ofand outputs of Bi-LSTMg(xi,S) g '(xi )

xi



n  The Fully Conditional Embedding f ⁃  Embed in consideration of S

g





2 Model


S


S






i

, y

i


S


S


2

f’LSTM

rk−1

a(hk−1,g(xi ))g(xi )

LSTM

f (x,S) = hK

hk−1

hk−1

hk

+

+

x


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)


K

where h

k




✓






✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)


0 ⇠ T


3 Related Work



4

Query

weighted sum

24





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S



).

2


x


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)


K

where h

k




✓






✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)


0 ⇠ T


3 Related Work



4




g





2 Model


S


S






i

, y

i


S


S


2

f’LSTM

g(xi ) h1

h1+

x


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)


K

where h

k




✓






✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)


0 ⇠ T


3 Related Work



4

Query

25





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S



).

2



ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)


K

where h

k




✓






✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)


0 ⇠ T


3 Related Work



4

is calculated without using S

h1 = LSTM ( f '(x),[h0, r0 ],c0 )+ f '(x)

h1

x




g





2 Model


S


S






i

, y

i


S


S


2

f’LSTM

g(xi ) h1

h1+

x


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)


K

where h

k




✓






✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)


0 ⇠ T


3 Related Work



4

Query

26





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S



).

2



ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)


K

where h

k




✓






✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)


0 ⇠ T


3 Related Work



4

Calculate the relevance between and

softmaxa(h1,g(x1)) =

a(h1,g(xi ))

(hT1g(x1))

g(xi ) h1

x




g





2 Model


S


S






i

, y

i


S


S


2

f’LSTM

g(xi ) h1

h1+

x


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)


K

where h

k




✓






✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)


0 ⇠ T


3 Related Work



4

Query

27





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S



).

2



ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)


K

where h

k




✓






✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)


0 ⇠ T


3 Related Work



4

is a sum of weighted according to the relevance to

a(h1,g(xi ))

r1

weighted sum

r1 g(xi )h1

r1 = a(h1,g(xi ))i=1

|S|

∑ g(xi )

x




g





2 Model


S


S






i

, y

i


S


S


2

f’LSTM

g(xi ) h1

h1+

x


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)


K

where h

k




✓






✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)


0 ⇠ T


3 Related Work



4

Query

28





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S



).

2



ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)


K

where h

k




✓






✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)


0 ⇠ T


3 Related Work



4

a(h1,g(xi ))

r1

weighted sum

LSTMh1

+

h1

is calculated using Sh1

x




g





2 Model


S


S






i

, y

i


S


S


2

f’LSTM

rk−1

a(hk−1,g(xi ))g(xi )

LSTM

f (x,S) = hK

hk−1

hk−1

hk

+

+

x


ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)


K

where h

k




✓






✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)


0 ⇠ T


3 Related Work



4

Query

weighted sum

29





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S


2





2 Model


S


S






i

, y

i


S


S



).

2



ˆ

h

k

, c

k

= LSTM(f

0(x), [h

k�1, rk�1], ck�1) (2)

h

k

=

ˆ

h

k

+ f

0(x) (3)

r

k�1 =

|S|X

i=1

a(h

k�1, g(xi

))g(x

i

) (4)

a(h

k�1, g(xi

)) = e

h

Tk�1g(xi)

/

|S|X

j=1

e

h

Tk�1g(xj) (5)


K

where h

k




✓






✓ = argmax

✓

E

L⇠T

2

4E

S⇠L,B⇠L

2

4X

(x,y)2B

logP

✓

(y|x, S)

3

5

3

5. (6)


0 ⇠ T


3 Related Work



4

Let be the outputafter K steps

f (x,S)

x


Experimental Settings

n  Datasets⁃  Image classification sets•  Omniglot [Lake+, 2011]

⁃  Language modeling•  Penn Treebank [Marcus+, 1993]

30

•  ImageNet [Deng+, 2009]

ref. http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/

4.1.3 One-Shot Language ModelingWe also introduce a new one-shot language task which is analogous to those examined for images.The task is as follows: given a query sentence with a missing word in it, and a support set of sentenceswhich each have a missing word and a corresponding 1-hot label, choose the label from the supportset that best matches the query sentence. Here we show a single example, though note that the wordson the right are not provided and the labels for the set are given as 1-hot-of-5 vectors.1. an experimental vaccine can alter the immune response of people infected with the aids virus a

<blank_token> u.s. scientist said.

prominent

2. the show one of five new nbc <blank_token> is the second casualty of the three networks so far

this fall.

series

3. however since eastern first filed for chapter N protection march N it has consistently promised

to pay creditors N cents on the <blank_token>.

dollar

4. we had a lot of people who threw in the <blank_token> today said <unk> ellis a partner in

benjamin jacobson & sons a specialist in trading ual stock on the big board.

towel

5. it’s not easy to roll out something that <blank_token> and make it pay mr. jacob says. comprehensive

Query: in late new york trading yesterday the <blank_token> was quoted at N marks down from N

marks late friday and at N yen down from N yen late friday.

dollar

Sentences were taken from the Penn Treebank dataset [15]. On each trial, we make sure that the setand batch are populated with sentences that are non-overlapping. This means that we do not usewords with very low frequency counts; e.g. if there is only a single sentence for a given word we donot use this data since the sentence would need to be in both the set and the batch. As with the imagetasks, each trial consisted of a 5 way choice between the classes available in the set. We used a batchsize of 20 throughout the sentence matching task and varied the set size across k=1,2,3. We ensuredthat the same number of sentences were available for each class in the set. We split the words into arandomly sampled 9000 for training and 1000 for testing, and we used the standard test set to reportresults. Thus, neither the words nor the sentences used during test time had been seen during training.

We compared our one-shot matching model to an oracle LSTM language model (LSTM-LM) [30]trained on all the words. In this setup, the LSTM has an unfair advantage as it is not doing one-shotlearning but seeing all the data – thus, this should be taken as an upper bound. To do so, we examineda similar setup wherein a sentence was presented to the model with a single word filled in with 5different possible words (including the correct answer). For each of these 5 sentences the model gavea log-likelihood and the max of these was taken to be the choice of the model.

As with the other 5 way choice tasks, chance performance on this task was 20%. The LSTM languagemodel oracle achieved an upper bound of 72.8% accuracy on the test set. Matching Networkswith a simple encoding model achieve 32.4%, 36.1%, 38.2% accuracy on the task with k = 1, 2, 3

examples in the set, respectively. Future work should explore combining parametric models such asan LSTM-LM with non-parametric components such as the Matching Networks explored here.

Two related tasks are the CNN QA test of entity prediction from news articles [5], and the Children’sBook Test (CBT) [6]. In the CBT for example, a sequence of sentences from a book are providedas context. In the final sentence one of the words, which has appeared in a previous sentence, ismissing. The task is to choose the correct word to fill in this blank from a small set of words givenas possible answers, all of which occur in the preceding sentences. In our sentence matching taskthe sentences provided in the set are randomly drawn from the PTB corpus and are related to thesentences in the query batch only by the fact that they share a word. In contrast to CBT and CNNdataset, they provide only a generic rather than specific sequential context.

5 ConclusionIn this paper we introduced Matching Networks, a new neural architecture that, by way of itscorresponding training regime, is capable of state-of-the-art performance on a variety of one-shotclassification tasks. There are a few key insights in this work. Firstly, one-shot learning is mucheasier if you train the network to do one-shot learning. Secondly, non-parametric structures in aneural network make it easier for networks to remember and adapt to new training sets in the sametasks. Combining these observations together yields Matching Networks. Further, we have definednew one-shot tasks on ImageNet, a reduced version of ImageNet (for rapid experimentation), and alanguage modeling task. An obvious drawback of our model is the fact that, as the support set S growsin size, the computation for each gradient update becomes more expensive. Although there are sparseand sampling-based methods to alleviate this, much of our future efforts will concentrate around thislimitation. Further, as exemplified in the ImageNet dogs subtask, when the label distribution hasobvious biases (such as being fine grained), our model suffers. We feel this is an area with excitingchallenges which we hope to keep improving in future work.

8


Experimental Settings (Omniglot)

n  Baseline⁃  Matching on raw pixels⁃  Matching on discriminative features from VGG

(Baseine classifier)⁃  MANN⁃  Convolutional Siamese Network

n  Datasets⁃  training: 1200 characters⁃  testing: 423 characters

31


Experimental Results (Omniglot)

32

n  Fully Conditional Embedding (FCE) did not seem to help muchn  Baseline and Siamese Net were improved with fine-tuning

3.2 Metric Learning

As discussed in Section 2, there are many links between content based attention, kernel based nearestneighbor and metric learning [1]. The most relevant work is Neighborhood Component Analysis(NCA) [18], and the follow up non-linear version [20]. The loss is very similar to ours, except weuse the whole support set S instead of pair-wise comparisons which is more amenable to one-shotlearning. Follow-up work in the form of deep convolutional siamese [11] networks included muchmore powerful non-linear mappings. Other losses which include the notion of a set (but use lesspowerful metrics) were proposed in [28].

Lastly, the work in one-shot learning in [14] was inspirational and also provided us with the invaluableOmniglot dataset – referred to as the “transpose” of MNIST. Other works used zero-shot learning onImageNet, e.g. [17]. However, there is not much one-shot literature on ImageNet, which we hope toamend via our benchmark and task definitions in the following section.

4 Experiments

In this section we describe the results of many experiments, comparing our Matching Networksmodel against strong baselines. All of our experiments revolve around the same basic task: an N -wayk-shot learning task. Each method is providing with a set of k labelled examples from each of Nclasses that have not previously been trained upon. The task is then to classify a disjoint batch ofunlabelled examples into one of these N classes. Thus random performance on this task stands at1/N . We compared a number of alternative models, as baselines, to Matching Networks.

Let L0 denote the held-out subset of labels which we only use for one-shot. Unless otherwise specified,training is always on 6=L0, and test in one-shot mode on L

0.

We ran one-shot experiments on three data sets: two image classification sets (Omniglot [14] andImageNet [19, ILSVRC-2012]) and one language modeling (Penn Treebank). The experiments onthe three data sets comprise a diverse set of qualities in terms of complexity, sizes, and modalities.

4.1 Image Classification Results

For vision problems, we considered four kinds of baselines: matching on raw pixels, matching ondiscriminative features from a state-of-the-art classifier (Baseline Classifier), MANN [21], and ourreimplementation of the Convolutional Siamese Net [11]. The baseline classifier was trained toclassify an image into one of the original classes present in the training data set, but excluding theN classes so as not to give it an unfair advantage (i.e., trained to classify classes in 6=L0). We thentook this network and used the features from the last layer (before the softmax) for nearest neighbourmatching, a strategy commonly used in computer vision [3] which has achieved excellent resultsacross many tasks. Following [11], the convolutional siamese nets were trained on a same-or-differenttask of the original training data set and then the last layer was used for nearest neighbour matching.

Model Matching Fn Fine Tune 5-way Acc 20-way Acc1-shot 5-shot 1-shot 5-shot

PIXELS Cosine N 41.7% 63.2% 26.7% 42.6%BASELINE CLASSIFIER Cosine N 80.0% 95.0% 69.5% 89.1%BASELINE CLASSIFIER Cosine Y 82.3% 98.4% 70.6% 92.0%BASELINE CLASSIFIER Softmax Y 86.0% 97.6% 72.9% 92.3%

MANN (NO CONV) [21] Cosine N 82.8% 94.9% – –CONVOLUTIONAL SIAMESE NET [11] Cosine N 96.7% 98.4% 88.0% 96.5%CONVOLUTIONAL SIAMESE NET [11] Cosine Y 97.3% 98.4% 88.1% 97.0%

MATCHING NETS (OURS) Cosine N 98.1% 98.9% 93.8% 98.5%MATCHING NETS (OURS) Cosine Y 97.9% 98.7% 93.5% 98.7%

Table 1: Results on the Omniglot dataset.

5


Experimental Settings (ImageNet)

n  Baseline⁃  Matching on raw pixels⁃  Matching on discriminative features from InceptionV3

(Baseine classifier)n  Datasets

⁃  miniImageNet (size: 84x84)•  training: (80 classes)•  testing: (20 classes)

⁃  randImageNet•  training: randomly picked up classes (882 classes)•  testing: remaining classes (118 classes)

⁃  dogsImageNet•  training: all non-dog classes (882 classes)•  testing: dog classes (118 classes)

33


Experimental Results (miniImageNet)

34

Figure 2: Example of two 5-way problem instance on ImageNet. The images in the set S0 containclasses never seen during training. Our model makes far less mistakes than the Inception baseline.

Table 2: Results on miniImageNet.

Model Matching Fn Fine Tune 5-way Acc1-shot 5-shot

PIXELS Cosine N 23.0% 26.6%BASELINE CLASSIFIER Cosine N 36.6% 46.0%BASELINE CLASSIFIER Cosine Y 36.2% 52.2%BASELINE CLASSIFIER Softmax Y 38.4% 51.2%

MATCHING NETS (OURS) Cosine N 41.2% 56.2%MATCHING NETS (OURS) Cosine Y 42.4% 58.0%MATCHING NETS (OURS) Cosine (FCE) N 44.2% 57.0%MATCHING NETS (OURS) Cosine (FCE) Y 46.6% 60.0%

1-shot tasks from the training data set, incorporating Full Context Embeddings and our MatchingNetworks and training strategy.

The results of the randImageNet and dogsImageNet experiments are shown in Table 3. The InceptionOracle (trained on all classes) performs almost perfectly when restricted to 5 classes only, which isnot too surprising given its impressive top-1 accuracy. When trained solely on 6=L

rand

, MatchingNets improve upon Inception by almost 6% when tested on L

rand

, halving the errors. Figure 2 showstwo instances of 5-way one-shot learning, where Inception fails. Looking at all the errors, Inceptionappears to sometimes prefer an image above all others (these images tend to be cluttered like theexample in the second column, or more constant in color). Matching Nets, on the other hand, manageto recover from these outliers that sometimes appear in the support set S0.

Matching Nets manage to improve upon Inception on the complementary subset 6=Ldogs

(althoughthis setup is not one-shot, as the feature extraction has been trained on these labels). However, on themuch more challenging L

dogs

subset, our model degrades by 1%. We hypothesize this to the factthat the sampled set during training, S, comes from a random distribution of labels (from 6=L

dogs

),whereas the testing support set S0 from L

dogs

contains similar classes, more akin to fine grainedclassification. Thus, we believe that if we adapted our training strategy to samples S from fine grainedsets of labels instead of sampling uniformly from the leafs of the ImageNet class tree, improvementscould be attained. We leave this as future work.

Table 3: Results on full ImageNet on rand and dogs one-shot tasks. Note that 6=Lrand

and 6=Ldogs

are sets of classes which are seen during training, but are provided for completeness.

Model Matching Fn Fine Tune ImageNet 5-way 1-shot AccL

rand

6=Lrand

Ldogs

6=Ldogs

PIXELS Cosine N 42.0% 42.8% 41.4% 43.0%INCEPTION CLASSIFIER Cosine N 87.6% 92.6% 59.8% 90.0%

MATCHING NETS (OURS) Cosine (FCE) N 93.2% 97.0% 58.8% 96.4%INCEPTION ORACLE Softmax (Full) Y (Full) ⇡ 99% ⇡ 99% ⇡ 99% ⇡ 99%

7

n  Matching Networks overtook baselinen  Fully Conditional Embedding (FCE) was shown effective to

improve the performance in this task


Experimental Results (randImageNet, dogsImageNet)

35








rand


rand




dogs


dogs


dogs



and 6=Ldogs



rand

6=Lrand

Ldogs

6=Ldogs



7n  Matching Networks outperformed Inception Classifier in ,

but degraded in n  The decrease of the performance in might be caused by the

different distributions of labels between training and testing⁃  Training support set comes from a random distribution

whereas testing one comes from similar classes








rand


rand




dogs


dogs


dogs



and 6=Ldogs



rand

6=Lrand

Ldogs

6=Ldogs



7








rand


rand




dogs


dogs


dogs



and 6=Ldogs



rand

6=Lrand

Ldogs

6=Ldogs



7








rand


rand




dogs


dogs


dogs



and 6=Ldogs



rand

6=Lrand

Ldogs

6=Ldogs



7


Experimental Settings (Penn Treebank)

36


xQuery

g(xi )

f (x,S)

a


P (y|x, S) =kX

i=1

a(x, x

i

)y

i

(1)

where x

i

, y

i


{(xi

, y

i



i


i



S




i

) = e

c(f(x),g(xi))/

Pk



i



, y







i


i

)



i



i


is very close to x

i


i


i




3

∑


P (y|x, S) =kX

i=1

a(x, x

i

)y

i

(1)

where x

i

, y

i


{(xi

, y

i



i


i



S




i

) = e

c(f(x),g(xi))/

Pk



i



, y







i


i

)



i



i


is very close to x

i


i


i




3


P (y|x, S) =kX

i=1

a(x, x

i

)y

i

(1)

where x

i

, y

i


{(xi

, y

i



i


i



S




i

) = e

c(f(x),g(xi))/

Pk



i



, y







i


i

)



i



i


is very close to x

i


i


i




3

yi


P (y|x, S) =kX

i=1

a(x, x

i

)y

i

(1)

where x

i

, y

i


{(xi

, y

i



i


i



S




i

) = e

c(f(x),g(xi))/

Pk



i



, y







i


i

)



i



i


is very close to x

i


i


i




3

c: cosine distance

LSTMLSTM…

virus a

LSTMLSTM…

new nbc

LSTMLSTM

on the

…

LSTMLSTM

the yesterday

…



prominent


this fall.

series



dollar



towel




dollar







8



prominent


this fall.

series



dollar



towel




dollar







8

n  Fill in a brank in a query sentence by a label in a support set


Experimental Settings and Results (Penn Treebank)

37

n  Baseline⁃  Oracle LSTM-LM•  Trained on all the words (not one-shot)•  Consider this model as an upper bound

n  Datasets⁃  training: 9000 words⁃  testing: 1000 words

n  Results

Model5 way accuracy

1-shot 2-shot 3-shot

Matching Nets 32.4% 36.1% 38.2%

Oracle LSTM-LM (72.8%) - -


Conclusion

n  They proposed Matching Networks: nearest neighbor based approach trained fully end-to-end

n  Keypoints⁃  “One-shot learning is much easier if you train the network to

do one-shot learning” [Vinyals+, 2016]⁃  Matching Network has non-parametric structure, thus has

ability to acquisition of new examples rapidlyn  Findings

⁃  Matching Networks was effective to improve the performance for Omniglot, miniImageNet, randImageNet, however it degraded for dogsImageNet

⁃  One-shot learning with fine-grained sets of labels is difficult to solve thus could be exciting challenge in this area

38


References

n  Matching Networks⁃  Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in Neural

Information Processing Systems. 2016.

n  One-shot Learning⁃  Koch, Gregory. Siamese neural networks for one-shot image recognition. Diss.

University of Toronto, 2015.⁃  Santoro, Adam, et al. "Meta-learning with memory-augmented neural networks."

Proceedings of The 33rd International Conference on Machine Learning. 2016.⁃  Bertinetto, Luca, et al. "Learning feed-forward one-shot learners." Advances in Neural

Information Processing Systems. 2016.

n  Attention Mechanisms⁃  Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation

by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).⁃  Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." Advances in

Neural Information Processing Systems. 2015.⁃  Vinyals, Oriol, Samy Bengio, and Manjunath Kudlur. "Order matters: Sequence to

sequence for sets." In ICLR2016

39


References

n  Datasets⁃  Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features from tiny

images." (2009).⁃  Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." Computer

Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.⁃  Lake, Brenden M., et al. "One shot learning of simple visual concepts." Proceedings of

the 33rd Annual Conference of the Cognitive Science Society. Vol. 172. 2011.⁃  Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. "Building a large

annotated corpus of English: The Penn Treebank." Computational linguistics 19.2 (1993): 313-330.

40

matching networks for one shot learning

Data & Analytics