task2vec: task embedding for meta-learning

15
TASK2VEC: Task Embedding for Meta-Learning Alessandro Achille UCLA and AWS [email protected] Michael Lam AWS [email protected] Rahul Tewari AWS [email protected] Avinash Ravichandran AWS [email protected] Subhransu Maji UMass and AWS [email protected] Charless Fowlkes UCI and AWS [email protected] Stefano Soatto UCLA and AWS [email protected] Pietro Perona Caltech and AWS [email protected] Abstract We introduce a method to provide vectorial represen- tations of visual classification tasks which can be used to reason about the nature of those tasks and their re- lations. Given a dataset with ground-truth labels and a loss function defined over those labels, we process images through a “probe network” and compute an embedding based on estimates of the Fisher information matrix asso- ciated with the probe network parameters. This provides a fixed-dimensional embedding of the task that is independent of details such as the number of classes and does not require any understanding of the class label semantics. We demon- strate that this embedding is capable of predicting task sim- ilarities that match our intuition about semantic and tax- onomic relations between different visual tasks (e.g., tasks based on classifying different types of plants are similar). We also demonstrate the practical value of this framework for the meta-task of selecting a pre-trained feature extractor for a new task. We present a simple meta-learning frame- work for learning a metric on embeddings that is capable of predicting which feature extractors will perform well. Se- lecting a feature extractor with task embedding obtains a performance close to the best available feature extractor, while costing substantially less than exhaustively training and evaluating on all available feature extractors. 1. Introduction The success of Deep Learning hinges in part on the fact that models learned for one task can be used on other related tasks. Yet, no general framework exists to describe and learn relations between tasks. We introduce the TASK2VEC embedding, a technique to represent tasks as elements of a vector space based on the Fisher Information Matrix. The norm of the embedding correlates with the complexity of the task, while the distance between embeddings captures semantic similarities between tasks (Fig. 1). When other natural distances are available, such as the taxonomical dis- tance in biological classification, we find that the embed- ding distance correlates positively with it (Fig. 2). More- over, we introduce an asymmetric distance on tasks which correlates with the transferability between tasks. Computation of the embedding leverages a duality be- tween network parameters (weights) and outputs (activa- tions) in a deep neural network (DNN): Just as the activa- tions of a DNN trained on a complex visual recognition task are a rich representation of the input images, we show that the gradients of the weights relative to a task-specific loss are a rich representation of the task itself. Specifically, given a task defined by a dataset D = {(x i ,y i )} N i=1 of labeled samples, we feed the data through a pre-trained reference convolutional neural network which we call “probe net- work”, and compute the diagonal Fisher Information Ma- trix (FIM) of the network filter parameters to capture the structure of the task (Sect. 2). Since the architecture and weights of the probe network are fixed, the FIM provides a fixed-dimensional representation of the task. We show this embedding encodes the “difficulty” of the task, character- istics of the input domain, and which features of the probe network are useful to solve it (Sect. 2.1). Our task embedding can be used to reason about the space of tasks and solve meta-tasks. As a motivating exam- ple, we study the problem of selecting the best pre-trained feature extractor to solve a new task. This can be particu- larly valuable when there is insufficient data to train or fine- tune a generic model, and transfer of knowledge is essen- tial. TASK2VEC depends solely on the task, and ignores interactions with the model which may however play an important role. To address this, we learn a joint task and model embedding, called MODEL2VEC, in such a way that models whose embeddings are close to a task exhibit good perfmormance on the task. We use this to select an expert from a given collection, improving performance relative to 1 arXiv:1902.03545v1 [cs.LG] 10 Feb 2019

Upload: others

Post on 12-Jan-2022

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TASK2VEC: Task Embedding for Meta-Learning

TASK2VEC: Task Embedding for Meta-Learning

Alessandro AchilleUCLA and AWS

[email protected]

Michael LamAWS

[email protected]

Rahul TewariAWS

[email protected]

Avinash RavichandranAWS

[email protected]

Subhransu MajiUMass and [email protected]

Charless FowlkesUCI and AWS

[email protected]

Stefano SoattoUCLA and [email protected]

Pietro PeronaCaltech and [email protected]

Abstract

We introduce a method to provide vectorial represen-tations of visual classification tasks which can be usedto reason about the nature of those tasks and their re-lations. Given a dataset with ground-truth labels and aloss function defined over those labels, we process imagesthrough a “probe network” and compute an embeddingbased on estimates of the Fisher information matrix asso-ciated with the probe network parameters. This provides afixed-dimensional embedding of the task that is independentof details such as the number of classes and does not requireany understanding of the class label semantics. We demon-strate that this embedding is capable of predicting task sim-ilarities that match our intuition about semantic and tax-onomic relations between different visual tasks (e.g., tasksbased on classifying different types of plants are similar).We also demonstrate the practical value of this frameworkfor the meta-task of selecting a pre-trained feature extractorfor a new task. We present a simple meta-learning frame-work for learning a metric on embeddings that is capable ofpredicting which feature extractors will perform well. Se-lecting a feature extractor with task embedding obtains aperformance close to the best available feature extractor,while costing substantially less than exhaustively trainingand evaluating on all available feature extractors.

1. IntroductionThe success of Deep Learning hinges in part on the fact

that models learned for one task can be used on other relatedtasks. Yet, no general framework exists to describe andlearn relations between tasks. We introduce the TASK2VECembedding, a technique to represent tasks as elements of avector space based on the Fisher Information Matrix. Thenorm of the embedding correlates with the complexity ofthe task, while the distance between embeddings captures

semantic similarities between tasks (Fig. 1). When othernatural distances are available, such as the taxonomical dis-tance in biological classification, we find that the embed-ding distance correlates positively with it (Fig. 2). More-over, we introduce an asymmetric distance on tasks whichcorrelates with the transferability between tasks.

Computation of the embedding leverages a duality be-tween network parameters (weights) and outputs (activa-tions) in a deep neural network (DNN): Just as the activa-tions of a DNN trained on a complex visual recognition taskare a rich representation of the input images, we show thatthe gradients of the weights relative to a task-specific lossare a rich representation of the task itself. Specifically, givena task defined by a dataset D = {(xi, yi)}Ni=1 of labeledsamples, we feed the data through a pre-trained referenceconvolutional neural network which we call “probe net-work”, and compute the diagonal Fisher Information Ma-trix (FIM) of the network filter parameters to capture thestructure of the task (Sect. 2). Since the architecture andweights of the probe network are fixed, the FIM provides afixed-dimensional representation of the task. We show thisembedding encodes the “difficulty” of the task, character-istics of the input domain, and which features of the probenetwork are useful to solve it (Sect. 2.1).

Our task embedding can be used to reason about thespace of tasks and solve meta-tasks. As a motivating exam-ple, we study the problem of selecting the best pre-trainedfeature extractor to solve a new task. This can be particu-larly valuable when there is insufficient data to train or fine-tune a generic model, and transfer of knowledge is essen-tial. TASK2VEC depends solely on the task, and ignoresinteractions with the model which may however play animportant role. To address this, we learn a joint task andmodel embedding, called MODEL2VEC, in such a way thatmodels whose embeddings are close to a task exhibit goodperfmormance on the task. We use this to select an expertfrom a given collection, improving performance relative to

1

arX

iv:1

902.

0354

5v1

[cs

.LG

] 1

0 Fe

b 20

19

Page 2: TASK2VEC: Task Embedding for Meta-Learning

Task Embeddings Domain Embeddings

Actinopterygii (n)

Amphibia (n)

Arachnida (n)

Aves (n)

Fungi (n)

Insecta (n)

Mammalia (n)

Mollusca (n)

Plantae (n)

Protozoa (n)

Reptilia (n)

Category (m)

Color (m)

Gender (m)

Material (m)

Neckline (m)

Pants (m)

Pattern (m)

Shoes (m)

Figure 1: Task embedding across a large library of tasks (best seen magnified). (Left) T-SNE visualization of the embed-ding of tasks extracted from the iNaturalist, CUB-200, iMaterialist datasets. Colors indicate ground-truth grouping of tasksbased on taxonomic or semantic types. Notice that the bird classification tasks extracted from CUB-200 embed near the birdclassification task from iNaturalist, even though the original datasets are different. iMaterialist is well separated from iNat-uralist, as it entails very different tasks (clothing attributes). Notice that some tasks of similar type (such as color attributes)cluster together but attributes of different task types may also mix when the underlying visual semantics are correlated. Forexample, the tasks of jeans (clothing type), denim (material) and ripped (style) recognition are close in the task embedding.(Right) T-SNE visualization of the domain embeddings (using mean feature activations) for the same tasks. Domain em-bedding can distinguish iNaturalist tasks from iMaterialist tasks due to differences in the two problem domains. However,the fashion attribute tasks on iMaterialist all share the same domain and only differ in their labels. In this case, the domainembeddings collapse to a region without recovering any sensible structure.

fine-tuning a generic model trained on ImageNet and ob-taining close to ground-truth optimal selection. We discussour contribution in relation to prior literature in Sect. 6, afterpresenting our empirical results in Sect. 5.

2. Task Embeddings via Fisher Information

Given an observed input x (e.g., an image) and an hid-den task variable y (e.g., a label), a deep network is afamily of functions pw(y|x) parametrized by weights w,trained to approximate the posterior p(y|x) by minimizingthe (possibly regularized) cross entropy loss Hpw,p(y|x) =Ex,y∼p[− log pw(y|x)], where p is the empirical distribu-tion defined by the training set D = {(xi, yi)}Ni=1. It isuseful, especially in transfer learning, to think of the net-work as composed of two parts: a feature extractor whichcomputes some representation z = φw(x) of the input data,and a “head,” or classifier, which encodes the distributionp(y|z) given the representation z.

Not all network weights are equally useful in predictingthe task variable: the importance, or “informative content,”of a weight for the task can be quantified by considering aperturbation w′ = w + δw of the weights, and measuringthe average Kullbach-Leibler (KL) divergence between the

original output distribution pw(y|x) and the perturbed onepw′(y|x). To second-order approximation, this is

Ex∼pKL(pw′(y|x) ‖ pw(y|x)) = δw · Fδw + o(δw2),

where F is the Fisher information matrix (FIM):

F = Ex,y∼p(x)pw(y|x)

[∇w log pw(y|x)∇w log pw(y|x)T

].

that is, the expected covariance of the scores (gradients ofthe log-likelihood) with respect to the model parameters.

The FIM is a Riemannian metric on the space of proba-bility distributions [7], and provides a measure of the infor-mation a particular parameter (weight or feature) containsabout the joint distribution pw(x, y) = p(x)pw(y|x): If theclassification performance for a given task does not dependstrongly a parameter, the corresponding entry in the FIMwill be small. The FIM is also related to the (Kolmogorov)complexity of a task, a property that can be used to de-fine a computable metric of the learning distance betweentasks [3]. Finally, the FIM can be interpreted as an easy-to-compute positive semidefinite upper-bound to the Hessianof the cross-entropy loss, and coincides with it at local min-ima [24]. In particular, “flat minima” correspond to weightsthat have, on average, low (Fisher) information [5, 13].

Page 3: TASK2VEC: Task Embedding for Meta-Learning

2.1. TASK2VEC embedding using a probe network

While the network activations capture the information inthe input image which are needed to infer the image label,the FIM indicates the set of feature maps which are moreinformative for solving the current task. Following this in-tuition, we use the FIM to represent the task itself. How-ever, the FIMs computed on different networks are not di-rectly comparable. To address this, we use single “probe”network pre-trained on ImageNet as a feature extractor andre-train only the classifier layer on any given task, whichusually can be done efficiently. After training is complete,we compute the FIM for the feature extractor parameters.

Since the full FIM is unmanageably large for rich probenetworks based on CNNs, we make two additional approxi-mations. First, we only consider the diagonal entries, whichimplicitly assumes that correlations between different filtersin the probe network are not important. Second, since theweights in each filter are usually not independent, we aver-age the Fisher Information for all weights in the same filter.The resulting representation thus has fixed size, equal to thenumber of filters in the probe network. We call this embed-ding method TASK2VEC.

Robust Fisher computation Since the FIM is a localquantity, it is affected by the local geometry of the trainingloss landscape, which is highly irregular in many deep net-work architectures [21], and may be too noisy when trainedwith few samples. To avoid this problem, instead of a directcomputation, we use a more robust estimator that leveragesconnections to variational inference. Assume we perturbthe weights w of the network with Gaussian noise N (0,Λ)with precision matrix Λ, and we want to find the optimal Λwhich yields a good expected error, while remaining closeto an isotropic prior N (w, λ2I). That is, we want to find Λthat minimizes:

L(w; Λ) = Ew∼N (w,Λ)[Hpw,pp(y|x)]

+ β KL(N (0,Λ) ‖N (0, λ2I)),

whereH is the cross-entropy loss and β controls the weightof the prior. Notice that for β = 1 this reduces to the Evi-dence Lower-Bound (ELBO) commonly used in variationalinference. Approximating to the second order, the optimalvalue of Λ satisfies (see Supplementary Material):

β

2NΛ = F +

βλ2

2NI.

Therefore, β2NΛ ∼ F+o(1) can be considered as an estima-

tor of the FIM F , biased towards the prior λ2I in the low-data regime instead of being degenerate. In case the task istrivial (the loss is constant or there are too few samples) theembedding will coincide with the prior λ2I , which we willrefer to as the trivial embedding. This estimator has the

advantage of being easy to compute by directly minimizingthe loss L(w; Σ) through Stochastic Gradient VariationalBayes [18], while being less sensitive to irregularities ofthe loss landscape than direct computation, since the valueof the loss depends on the cross-entropy in a neighborhoodof w of size Λ−1. As in the standard Fisher computation,we estimate one parameter per filter, rather than per weight,which in practice means that we constrain Λii = Λjj when-ever wi and wj belongs to the same filter. In this case, opti-mization of L(w; Λ) can be done efficiently using the localreparametrization trick of [18].

2.2. Properties of the TASK2VEC embedding

The task embedding we just defined has a number ofuseful properties. For illustrative purposes, consider a two-layer sigmoidal network for which an analytic expressioncan be derived (see Supplementary Materials). The FIMof the feature extractor parameters can be written using theKronecker product as

F = Ex,y∼p(x)pw(y|x)[(y − p)2 · S ⊗ xxT ]

where p = pw(y = 1|x) and the matrix S = wwT � zzT �(1 − z)(1 − z)T is an element-wise product of classifierweights w and first layer feature activations z. It is informa-tive to compare this expression to an embedding based onlyon the dataset domain statistics, such as the (non-centered)covariance C0 = E

[xxT

]of the input data or the covari-

ance C1 = E[zzT

]of the feature activations. One could

take such statistics as a representative domain embeddingsince they only depend on the marginal distribution p(x) incontrast to the FIM task embedding, which depends on thejoint distribution p(x, y). These simple expressions high-light some important (and more general) properties of theFisher embedding we now describe.

Invariance to the label space: The task embedding doesnot directly depend on the task labels, but only on the pre-dicted distribution pw(y|x) of the trained model. Infor-mation about the ground-truth labels y is encoded in theweights w which are a sufficient statistic of the task [5]. Inparticular, the task embedding is invariant to permutationsof the labels y, and has fixed dimension (number of filtersof the feature extractor) regardless of the output space (e.g.,k-way classification with varying k).

Encoding task difficulty: As we can see from the ex-pressions above, if the fit model is very confident in its pre-dictions, E[(y − p)2] goes to zero. Hence, the norm of thetask embedding ‖F‖? scales with the difficulty of the taskfor a given feature extractor φ. Figure 2 (Right) shows thateven for more complex models trained on real data, the FIMnorm correlates with test performance.

Encoding task domain: Data points x that are classi-fied with high confidence, i.e., p is close to 0 or 1, willhave a lower contribution to the task embedding than points

Page 4: TASK2VEC: Task Embedding for Meta-Learning

0 25 50 75 100 125Size k of neighborhood

1.0

1.5

2.0

2.5

3.0

Avg.

top-

k ta

x. d

istan

ce

Task2Vec distanceTax. distance

0.4 0.6 0.8L1 norm of task embedding 1e8

0%

10%

20%

30%

40%

50%

60%

Test

erro

r on

task

(%)

Figure 2: Distance between species classification tasks. (Left) Task similarity matrix ordered by hierarchical clustering.Note that the dendrogram produced by the task similarity matches the taxonomic clusters (indicated by color bar). (Center)For tasks extracted from iNaturalist and CUB, we compare the cosine distance between tasks to their taxonomical distance.As the size of the task embedding neighborhood increases (measured by number of tasks in the neighborhood), we plot theaverage taxonomical distance of tasks from the neighborhood center. While the task distance does not perfectly match thetaxonomical distance (whose curve is shown in orange), it shows a good correlation. Difference are both due to the fact thattaxonomically close species may need very different features to be classified, creating a mismatch between the two notionsof distance, and because for some tasks in iNaturalist too few samples are provided to compute a good embedding. (Right)Correlation between L1 norm of the task embedding (distance from origin) and test error obtained on the task.

near the decision boundary since p(1 − p) is maximized atp = 1/2. Compare this to the covariance matrix of the data,C0, to which all data points contribute equally. Instead, inTASK2VEC information on the domain is based on data nearthe decision boundary (task-weighted domain embedding).

Encoding useful features for the task: The FIM de-pends on the curvature of the loss function with the diagonalentries capturing the sensitivity of the loss to model param-eters. Specifically, in the two-layer model one can see that,if a given feature is uncorrelated with y, the correspond-ing blocks of F are zero. In contrast, a domain embeddingbased on feature activations of the probe network (e.g., C1)only reflects which features vary over the dataset withoutindication of whether they are relevant to the task.

3. Similarity Measures on the Space of TasksWhat metric should be used on the space of tasks? This

depends critically on the meta-task we are considering. As amotivation, we concentrate on the meta-task of selecting thepre-trained feature extractor from a set in order to obtain thebest performance on a new training task. There are severalnatural metrics that may be considered for this meta-task.In this work, we mainly consider:

Taxonomic distance For some tasks, there is a natural no-tion of semantic similarity, for instance defined by sets ofcategories organized in a taxonomic hierarchy where eachtask is classification inside a subtree of the hierarchy (e.g.,we may say that classifying breeds of dogs is closer to clas-

sification of cats than it is to classification of species ofplants). In this setting, we can define

Dtax(ta, tb) = mini∈Sa,j∈Sb

d(i, j),

where Sa, Sb are the sets of categories in task ta, tb andd(i, j) is an ultrametric or graph distance in the taxonomytree. Notice that this is a proper distance, and in particularit is symmetric.

Transfer distance. We define the transfer (or fine-tuning)gain from a task ta to a task tb (which we improperly calldistance, but is not necessarily symmetric or positive) asthe difference in expected performance between a modeltrained for task tb from a fixed initialization (random or pre-trained), and the performance of a model fine-tuned for tasktb starting from a solution of task ta:

Dft(ta → tb) =E[`a→b]− E[`b]

E[`b],

where the expectations are taken over all trainings with theselected architecture, training procedure and network ini-tialization, `b is the final test error obtained by training ontask b from the chosen initialization, and `a→b is the errorobtained instead when starting from a solution to task a andthen fine-tuning (with the selected procedure) on task tb.

3.1. Symmetric and asymmetric TASK2VEC metrics

By construction, the Fisher embedding on whichTASK2VEC is based captures fundamental information

Page 5: TASK2VEC: Task Embedding for Meta-Learning

about the structure of the task. We may therefore expectthat the distance between two embeddings correlate posi-tively with natural metrics on the space of tasks. However,there are two problems in using the Euclidean distance be-tween embeddings: the parameters of the network have dif-ferent scales, and the norm of the embedding is affected bycomplexity of the task and the number of samples used tocompute the embedding.

Symmetric TASK2VEC distance To make the distancecomputation robust, we propose to use the cosine distancebetween normalized embeddings:

dsym(Fa, Fb) = dcos

( FaFa + Fb

,Fb

Fa + Fb

),

where dcos is the cosine distance, Fa and Fb are the twotask embeddings (i.e., the diagonal of the Fisher Informa-tion computed on the same probe network), and the divisionis element-wise. This is a symmetric distance which we ex-pect to capture semantic similarity between two tasks. Forexample, we show in Fig. 2 that it correlates well with thetaxonomical distance between species on iNaturalist.

On the other hand, precisely for this reason, this distanceis ill-suited for tasks such as model selection, where the (in-trinsically asymmetric) transfer distance is more relevant.

Asymmetric TASK2VEC distance In a first approxima-tion, that does not consider either the model or the trainingprocedure used, positive transfer between two tasks dependsboth on the similarity between two tasks and on the com-plexity of the first. Indeed, pre-training on a general butcomplex task such as ImageNet often yields a better resultthan fine-tuning from a close dataset of comparable com-plexity. In our case, complexity can be measured as the dis-tance from the trivial embedding. This suggests the follow-ing asymmetric score, again improperly called a “distance”despite being asymmetric and possibly negative:

dasym(ta → tb) = dsym(ta, tb)− αdsym(ta, t0),

where t0 is the trivial embedding, and α is an hyperparam-eter. This has the effect of bring more complex modelscloser. The hyper-parameter α can be selected based onthe meta-task. In our experiments, we found that the bestvalue of α (α = 0.15 when using a ResNet-34 pretrainedon ImageNet as the probe network) is robust to the choiceof meta-tasks.

4. MODEL2VEC: task/model co-embeddingBy construction, the TASK2VEC distance ignores details

of the model and only relies on the task. If we know whattask a model was trained on, we can represent the model bythe embedding of that task. However, in general we maynot have such information (e.g., black-box models or hand-constructed feature extractors). We may also have multiple

models trained on the same task with different performancecharacteristics. To model the joint interaction between taskand model (i.e., architecture and training algorithm), we aimto learn a joint embedding of the two.

We consider for concreteness the problem of learninga joint embedding for model selection. In order to em-bed models in the task space so that those near a taskare likely to perform well on that task, we formulate thefollowing meta-learning problem: Given k models, theirMODEL2VEC embedding are the vectors mi = Fi + bi,where Fi is the task embedding of the task used to trainmodel mi (if available, else we set it to zero), and bi is alearned “model bias” that perturbs the task embedding toaccount for particularities of the model. We learn bi by opti-mizing a k-way cross entropy loss to predict the best modelgiven the task distance (see Supplementary Material):

L = E[− log p(m | dasym(t,m0), . . . , dasym(t,mk))].

After training, given a novel query task t, we can then pre-dict the best model for it as the arg maxi dasym(t,mi), thatis, the model mi embedded closest to the query task.

5. ExperimentsWe test TASK2VEC on a large collection of tasks and

models, related to different degrees. Our experiments aim totest both qualitative properties of the embedding and its per-formance on meta-learning tasks. We use an off-the-shelfResNet-34 pretrained on ImageNet as our probe network,which we found to give the best overall performance (seeSect. 5.2). The collection of tasks is generated startingfrom the following four main datasets. iNaturalist [36]:Each task extracted corresponds to species classification ina given taxonomical order. For instance, the “Rodentiatask” is to classify species of rodents. Notice that eachtask is defined on a separate subset of the images in theoriginal dataset; that is, the domains of the tasks are dis-joint. CUB-200 [37]: We use the same procedure as iNat-uralist to create tasks. In this case, all tasks are classifica-tions inside orders of birds (the aves taxonomical class), andhave generally much less training samples than correspond-ing tasks in iNaturalist. iMaterialist [1] and DeepFashion[23]: Each image in both datasets is associated with sev-eral binary attributes (e.g., style attributes) and categoricalattributes (e.g., color, type of dress, material). We binarizethe categorical attributes, and consider each attribute as aseparate task. Notice that, in this case, all tasks share thesame domain and are naturally correlated.

In total, our collection of tasks has 1460 tasks (207iNaturalist, 25 CUB, 228 iMaterialist, 1000 DeepFashion).While a few tasks have many training examples (e.g., hun-dred thousands), most have just hundreds or thousands ofsamples. This simulates the heavy-tail distribution of datain real-world applications.

Page 6: TASK2VEC: Task Embedding for Meta-Learning

[CUB] Bom

bycill

idae

[CUB] Thra

upida

e

[CUB] Lan

iidae

[CUB] Pass

erida

e

[CUB] Mim

idae

[CUB] Anse

riform

es

[CUB] Frin

gillida

e

[CUB] Card

inalida

e

[CUB] Cap

rimulg

iform

es

[CUB] Proc

ellariif

ormes

[CUB] Apo

diform

es

[CUB] Hiru

ndinid

ae

[CUB] Cucu

liform

es

[CUB] Cora

ciiform

es

[CUB] Pod

iciped

iform

es

[CUB] Pele

canifo

rmes

[CUB] Picif

ormes

[CUB] Corv

idae

[CUB] Icter

idae

[CUB] Trog

lodyti

dae

[CUB] Tyra

nnida

e

[CUB] Vire

onida

e

[CUB] Cha

radriif

ormes

[CUB] Paru

lidae

[CUB] Embe

rizida

e

[iNat]

Pelec

anifo

rmes

[iNat]

Roden

tia

[iNat]

Columbif

ormes

[iNat]

Sapin

dales

[iNat]

Picifo

rmes

[iNat]

Accipit

riform

es

[iNat]

Ranun

culale

s

[iNat]

Anserifo

rmes

[iNat]

Coleop

tera

[iNat]

Carnivo

ra

[iNat]

Anura

[iNat]

Charad

riiform

es

[iNat]

Gentia

nales

[iNat]

Erica

les

[iNat]

Asparag

ales

[iNat]

Faba

les

[iNat]

Asteral

es

[iNat]

Odona

ta

[iNat]

Rosales

[iNat]

Passe

riform

es

[iNat]

Caryop

hyllal

es

[iNat]

Perci

formes

[iNat]

Squa

mata

[iNat]

Lamiale

s

[iNat]

Lepid

opter

a

0%

20%

40%

60%

80%

Test

Erro

riNat+CUB error distribution and expert selection

Selected expertImageNet expert

Figure 3: TASK2VEC often selects the best available experts. Violin plot of the distribution of the final test error (shadedplot) on tasks from the CUB-200 dataset (columns) obtained by training a linear classifier over several expert feature extrac-tors (points). Most specialized feature extractors perform similarly on a given task, and generally are similar or worse than ageneric feature extractor pre-trained on ImageNet (blue triangles). However, in some cases a carefully chosen expert, trainedon a relevant task, can greatly outperform all other experts (long whisker of the violin plot). The model selection algorithmbased on TASK2VEC can, without training, suggest an expert to use for the task (red cross, lower is better). TASK2VEC mostlyrecover the optimal, or close to optimal, feature extractor to use without having to perform an expensive brute-force searchover all possibilities. Columns are ordered by norm of the task embedding: Notice tasks with lower embedding norm havelower error and more “complex” task (task with higher embedding norm) tend to benefit more from a specialized expert.

Together with the collection of tasks, we collect several“expert” feature extractors. These are ResNet-34 modelspre-trained on ImageNet and then fine-tuned on a specifictask or collection of related tasks (see Supplementary Ma-terials for details). We also consider a “generic”expert pre-trained on ImageNet without any finetuning. Finally, foreach combination of expert feature extractor and task, wetrained a linear classifier on top of the expert in order tosolve the selected task using the expert.

In total, we trained 4,100 classifiers, 156 feature extrac-tors and 1,460 embeddings. The total effort to generate thefinal results was about 1,300 GPU hours.

Meta-tasks. In Sect. 5.2, for a given task we aim to pre-dict, using TASK2VEC , which expert feature extractor willyield the best classification performance. In particular, weformulate two model selection meta-tasks: iNat + CUB andMixed. The first consists of 50 tasks and experts from iNat-uralist and CUB, and aims to test fine-grained expert selec-tion in a restricted domain. The second contains a mix of26 curated experts and 50 random tasks extracted from alldatasets, and aims to test model selection between differentdomains and tasks (see Supplementary Material for details).

5.1. Task Embedding Results

Task Embedding qualitatively reflects taxonomic dis-tance for iNaturalist For tasks extracted from the iNat-uralist dataset (classification of species), the taxonomicaldistance between orders provides a natural metric of the se-mantic similarity between tasks. In Figure 2 we comparethe symmetric TASK2VEC distance with the taxonomicaldistance, showing strong agreement.

Task embedding for iMaterialist In Fig. 1 we show at-SNE visualization of the embedding for iMaterialist andiNaturalist tasks. Task embedding yields interpretable re-sults: Tasks that are correlated in the dataset, such as binaryclasses corresponding to the same categorical attribute, mayend up far away from each other and close to other tasks thatare semantically more similar (e.g., the jeans category taskis close to the ripped attribute and the denim material). Thisis reflected in the mixture of colors of semantically relatednearby tasks, showing non-trivial grouping.

We also compare the TASK2VEC embedding with a do-main embedding baseline, which only exploits the inputdistribution p(x) rather than the task distribution p(x, y).While some tasks are highly correlated with their domain(e.g., tasks from iNaturalist), other tasks differ only on thelabels (e.g., all the attribute tasks of iMaterialist, whichshare the same clothes domain). Accordingly, the domain

Page 7: TASK2VEC: Task Embedding for Meta-Learning

102 103 104

Number of samples

-10%

0%

10%

Erro

r rel

ativ

e to

bru

te fo

rce

(lowe

r is b

ette

r)Brute force fixedImageNet fixedTask2Vec fixed

ImageNet finetuneTask2Vec finetune

Figure 4: TASK2VEC improves results at differentdataset sizes and training conditions: Performance ofmodel selection on a subset of 4 tasks as a function ofthe number of samples available to train relative to opti-mal model selection (dashed orange). Training a classifieron the feature extractor selected by TASK2VEC (solid red) isalways better than using a generic ImageNet feature extrac-tor (dashed red). The same holds when allowed to fine-tunethe feature extractor (blue curves). Also notice that in thelow-data regime fine-tuning the ImageNet feature extractoris more expensive and has a worse performance than accu-rately selecting a good fixed feature extractor.

Probe network Top-10 AllChance +13.95% +59.52%VGG-13 +4.82% +38.03%

DenseNet-121 +0.30% +10.63%ResNet-13 +0.00% +9.97%

Table 1: Choice of probe network. Mean relative errorincrease over the ground-truth optimum on the iNat+CUBmeta-task for different choices of the probe-network. Wealso report the performance on the top 10 tasks with moresamples to show how data size affect different architectures.

embedding recovers similar clusters on iNaturalist. How-ever, on iMaterialst domain embedding collapses all tasksto a single uninformative cluster (not a single point due toslight noise in embedding computation).

Task Embedding encodes task difficulty The scatter-plot in Fig. 3 compares the norm of embedding vectors vs.performance of the best expert (or task specific model forcases where we have the diagonal computed). As shownanalytically for the two-layers model, the norm of the taskembedding correlates with the complexity of the task alsoon real tasks and architectures.

5.2. Model Selection

Given a task, our aim is to select an expert feature extrac-tor that maximizes the classification performance on thattask. We propose two strategies: (1) embed the task and

select the feature extractor trained on the most similar task,and (2) jointly embed the models and tasks, and select amodel using the learned metric (see Section 4). Notice that(1) does not use knowledge of the model performance onvarious tasks, which makes it more widely applicable butrequires we know what task a model was trained for andmay ignore the fact that models trained on slightly differ-ent tasks may still provide an overall better feature extrac-tor (for example by over-fitting less to the task they weretrained on).

In Table 2 we compare the overall results of the variousproposed metrics on the model selection meta-tasks. Onboth the iNat+CUB and Mixed meta-tasks, the AsymmetricTASK2VEC model selection is close to the ground-truth op-timal, and significantly improves over both chance, and overusing an generic ImageNet expert. Notice that our methodhas O(1) complexity, while searching over a collection ofN experts is O(N).

Error distribution In Fig. 3 we show in detail the errordistribution of the experts on multiple tasks. It is interestingto notice that the classification error obtained using most ex-perts clusters around some mean value, and little improve-ment is observed over using a generic expert. On the otherhand, a few optimal experts can obtain a largely better per-formance on the task than a generic expert. This confirmsthe importance of having access to a large collection of ex-perts when solving a new task, especially if few trainingdata are available. But this collection can only be efficientlyexploited if an algorithm is given to efficiently find one ofthe few experts for the task, which we propose.

Dependence on task dataset size Finding experts is es-pecially important when the task we are interested in hasrelatively few samples. In Fig. 4 we show how the perfor-mance of TASK2VEC varies on a model selection task as thenumber of samples varies. At all sample sizes TASK2VEC isclose to the optimum, and improves over selecting a genericexpert (ImageNet), both when fine-tuning and when train-ing only a classifier. We observe that the best choice of ex-perts is not affected by the dataset size, and that even withfew examples TASK2VEC is able to find the optimal experts.

Choice of probe network In Table 1 we show thatDenseNet [15] and ResNet architectures [11] perform sig-nificantly better when used as probe networks to computethe TASK2VEC embedding than a VGG [32] architecture.

6. Related WorkTask and Domain embedding. Tasks distinguished bytheir domain can be understood simply in terms of imagestatistics. Due to the bias of different datasets, sometimes abenchmark task may be identified just by looking at a fewimages [34]. The question of determining what summary

Page 8: TASK2VEC: Task Embedding for Meta-Learning

Meta-task Optimal Chance ImageNet TASK2VEC Asymmetric TASK2VEC MODEL2VEC

iNat + CUB 31.24 +59.52% +30.18% +42.54% +9.97% +6.81%Mixed 22.90 +112.49% +75.73% +40.30% +29.23% +27.81%

Table 2: Model selection performance of different metrics. Average optimal error obtained on two meta-learning tasksby exhaustive search over the best expert, and relative error increase when using cheaper model selection methods. Alwayspicking a fixed good general model (e.g., a model pretrained on ImageNet) performs better than picking an expert at random(chance). However, picking an expert using the Asymmetric TASK2VEC distance can achieve an overall better performancethan using a general model. Notice also the improvement over the Symmetric version, especially on iNat + CUB, whereexperts trained on very similar tasks may be too simple to yield good transfer, and should be avoided.

statistics are useful (analogous to our choice of probe net-work) has also been considered, for example [9] train anautoencoder that learns to extract fixed dimensional sum-mary statistics that can reproduce many different datasetsaccurately. However, for general vision tasks which applyto all natural images, the domain is the same across tasks.

Taskonomy [39] explores the structure of the space oftasks, focusing on the question of effective knowledgetransfer in a curated collection of 26 visual tasks, rangingfrom classification to 3D reconstruction, defined on a com-mon domain. They compute pairwise transfer distances be-tween pairs of tasks and use the results to compute a di-rected hierarchy. Introducing novel tasks requires comput-ing the pairwise distance with tasks in the library. In con-trast, we focus on a larger library of 1,460 fine-grained clas-sification tasks both on same and different domains, andshow that it is possible to represent tasks in a topologicalspace with a constant-time embedding. The large task col-lection and cheap embedding costs allow us to tackle newmeta-learning problems.

Fisher kernels Our work takes inspiration from Jaakkolaand Hausler [16]. They propose the “Fisher Kernel”, whichuses the gradients of a generative model score function as arepresentation of similarity between data items

K(x(1), x(2)) = ∇θ logP (x(1)|θ)TF−1∇θ logP (x(2)|θ).

Here P (x|θ) is a parameterized generative model and F isthe Fisher information matrix. This provides a way to utilizegenerative models in the context of discriminative learning.Variants of the Fisher kernel have found wide use as a repre-sentation of images [28, 29], and other structured data suchas protein molecules [17] and text [30]. Since the genera-tive model can be learned on unlabelled data, several workshave investigated the use of Fisher kernel for unsupervisedlearning [14, 31]. [35] learns a metric on the Fisher kernelrepresentation similar to our metric learning approach. Ourapproach differs in that we use the FIM as a representationof a whole dataset (task) rather than using model gradientsas representations of individual data items.

Fisher Information for CNNs Our approach to task em-bedding makes use of the Fisher Information matrix of a

neural network as a characterization of the task. Use ofFisher information for neural networks was popularized byAmari [6] who advocated optimization using natural gra-dient descent which leverages the fact that the FIM is anappropriate parameterization-independent metric on statis-tical models. Recent work has focused on approximates ofFIM appropriate in this setting (see e.g., [12, 10, 25]). FIMhas also been proposed for various regularization schemes[5, 8, 22, 27], analyze learning dynamics of deep networks[4], and to overcome catastrophic forgetting [19].

Meta-learning and Model Selection The general prob-lem of meta-learning has a long history with much re-cent work dedicated to problems such as neural architecturesearch and hyper-parameter estimation. Closely related toour problem is work on selecting from a library of classi-fiers to solve a new task [33, 2, 20]. Unlike our approach,these usually address the question via land-marking or ac-tive testing, in which a few different models are evaluatedand performance of the remainder estimated by extension.This can be viewed as a problem of completing a matrixdefined by performance of each model on each task.

A similar approach has been taken in computer vision forselecting a detector for a new category out of a large libraryof detectors [26, 40, 38].

7. DiscussionTASK2VEC is an efficient way to represent a task, or the

corresponding dataset, as a fixed dimensional vector. It hasseveral appealing properties, in particular its norm corre-lates with the test error obtained on the task, and the co-sine distance between embeddings correlates with naturaldistances between tasks, when available, such as the taxo-nomic distance for species classification, and the fine-tuningdistance for transfer learning. Having a representation oftasks paves the way for a wide variety of meta-learningtasks. In this work, we focused on selection of an expertfeature extractor in order to solve a new task, especiallywhen little training data is present, and showed that usingTASK2VEC to select an expert from a collection can sen-sibly improve test performance while adding only a smalloverhead to the training process.

Page 9: TASK2VEC: Task Embedding for Meta-Learning

Meta-learning on the space of tasks is an important steptoward general artificial intelligence. In this work, we in-troduce a way of dealing with thousands of tasks, enough toenable reconstruct a topology on the task space, and to testmeta-learning solutions. The current experiments highlightthe usefulness of our methods. Even so, our collection doesnot capture the full complexity and variety of tasks that onemay encounter in real-world situations. Future work shouldfurther test effectiveness, robustness, and limitations of theembedding on larger and more diverse collections.

References[1] iMaterialist Challenge (Fashion) at FGVC5 workshop,

CVPR 2018. https://www.kaggle.com/c/imaterialist-challenge-fashion-2018.5

[2] S. M. Abdulrahman, P. Brazdil, J. N. van Rijn, and J. Van-schoren. Speeding up algorithm selection using averageranking and active testing by introducing runtime. Machinelearning, 107(1):79–108, 2018. 8

[3] A. Achille, G. Mbeng, G. Paolini, and S. Soatto. The dy-namic distance between learning tasks: From Kolmogorovcomplexity to transfer learning via quantum physics andthe information bottleneck of the weights of deep networks.Proc. of the NIPS Workshop on Integration of Deep LearningTheories (ArXiv: 1810.02440), October 2018. 2

[4] A. Achille, M. Rovere, and S. Soatto. Critical learning pe-riods in deep neural networks. Proc. of the Intl. Conf. onLearning Representations (ICLR). ArXiv:1711.08856, 2019.8

[5] A. Achille and S. Soatto. Emergence of invariance and dis-entanglement in deep representations. Journal of MachineLearning Research (ArXiv 1706.01350), 19(50):1–34, 2018.2, 3, 8

[6] S.-I. Amari. Natural gradient works efficiently in learning.Neural computation, 10(2):251–276, 1998. 8

[7] S.-I. Amari and H. Nagaoka. Methods of information geome-try, volume 191 of translations of mathematical monographs.American Mathematical Society, 13, 2000. 2

[8] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger gen-eralization bounds for deep nets via a compression approach.arXiv preprint arXiv:1802.05296, 2018. 8

[9] H. Edwards and A. Storkey. Towards a neural statistician.arXiv preprint arXiv:1606.02185, 2016. 8

[10] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprintarXiv:1703.03400, 2017. 8

[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages770–778, 2016. 7

[12] T. Heskes. On natural learning and pruning in multilayeredperceptrons. Neural Computation, 12(4):881–901, 2000. 8

[13] S. Hochreiter and J. Schmidhuber. Flat minima. NeuralComputation, 9(1):1–42, 1997. 2

[14] A. D. Holub, M. Welling, and P. Perona. Combining gener-ative models and fisher kernels for object recognition. InIEEE International Conference on Computer Vision, vol-ume 1, pages 136–143. IEEE, 2005. 8

[15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger.Densely connected convolutional networks. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, 2017. 7

[16] T. Jaakkola and D. Haussler. Exploiting generative models indiscriminative classifiers. In Advances in neural informationprocessing systems, pages 487–493, 1999. 8

[17] T. S. Jaakkola, M. Diekhans, and D. Haussler. Using thefisher kernel method to detect remote protein homologies. InISMB, volume 99, pages 149–158, 1999. 8

[18] D. P. Kingma, T. Salimans, and M. Welling. Variationaldropout and the local reparameterization trick. In Advancesin Neural Information Processing Systems, pages 2575–2583, 2015. 3, 13

[19] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Des-jardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho,A. Grabska-Barwinska, et al. Overcoming catastrophic for-getting in neural networks. Proceedings of the nationalacademy of sciences, page 201611835, 2017. 8

[20] R. Leite, P. Brazdil, and J. Vanschoren. Selecting classifi-cation algorithms with active testing. In International work-shop on machine learning and data mining in pattern recog-nition, pages 117–131. Springer, 2012. 8

[21] H. Li, Z. Xu, G. Taylor, and T. Goldstein. Visualizing the losslandscape of neural nets. arXiv preprint arXiv:1712.09913,2017. 3

[22] T. Liang, T. Poggio, A. Rakhlin, and J. Stokes. Fisher-raometric, geometry, and complexity of neural networks. arXivpreprint arXiv:1711.01530, 2017. 8

[23] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfash-ion: Powering robust clothes recognition and retrieval withrich annotations. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 1096–1104,2016. 5

[24] J. Martens. New perspectives on the natural gradient method.CoRR, abs/1412.1193, 2014. 2, 12

[25] J. Martens and R. Grosse. Optimizing neural networks withkronecker-factored approximate curvature. In Internationalconference on machine learning, pages 2408–2417, 2015. 8

[26] P. Matikainen, R. Sukthankar, and M. Hebert. Model rec-ommendation for action recognition. In Computer Visionand Pattern Recognition (CVPR), 2012 IEEE Conference on,pages 2256–2263. IEEE, 2012. 8

[27] Y. Mroueh and T. Sercu. Fisher gan. In Advances in NeuralInformation Processing Systems, pages 2513–2523, 2017. 8

[28] F. Perronnin, J. Sanchez, and T. Mensink. Improvingthe fisher kernel for large-scale image classification. InEuropean conference on computer vision, pages 143–156.Springer, 2010. 8

[29] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Im-age classification with the fisher vector: Theory and practice.International journal of computer vision, 105(3):222–245,2013. 8

Page 10: TASK2VEC: Task Embedding for Meta-Learning

[30] C. Saunders, A. Vinokourov, and J. S. Shawe-taylor. Stringkernels, fisher kernels and finite state automata. In Advancesin Neural Information Processing Systems, pages 649–656,2003. 8

[31] M. Seeger. Learning with labeled and unlabeled data. Tech-nical Report EPFL-REPORT-161327, Institute for Adaptiveand Neural Computation, University of Edinburgh, 2000. 8

[32] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014. 7

[33] M. R. Smith, L. Mitchell, C. Giraud-Carrier, and T. Martinez.Recommending learning algorithms and their associated hy-perparameters. arXiv preprint arXiv:1407.1890, 2014. 8

[34] A. Torralba and A. A. Efros. Unbiased look at dataset bias.In Computer Vision and Pattern Recognition (CVPR), 2011IEEE Conference on, pages 1521–1528. IEEE, 2011. 8

[35] L. Van Der Maaten. Learning discriminative fisher kernels.In ICML, volume 11, pages 217–224, 2011. 8

[36] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun,A. Shepard, H. Adam, P. Perona, and S. Belongie. The inatu-ralist species classification and detection dataset. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, 2018. 5

[37] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.The Caltech-UCSD Birds-200-2011 Dataset. Technical Re-port CNS-TR-2011-001, California Institute of Technology,2011. 5

[38] Y.-X. Wang and M. Hebert. Model recommendation: Gen-erating object detectors from few samples. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 1619–1628, 2015. 8

[39] A. R. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, andS. Savarese. Taskonomy: Disentangling task transfer learn-ing. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3712–3722, 2018. 8

[40] P. Zhang, J. Wang, A. Farhadi, M. Hebert, and D. Parikh.Predicting failures of vision systems. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 3566–3573, 2014. 8

Page 11: TASK2VEC: Task Embedding for Meta-Learning

A. Analytic FIM for two-layer modelAssume we have data points (xi, yi), i = 1 . . . n and yi ∈ {0, 1}. Assume that a fixed feature extractor applied to data

points x yields features z = φ(x) ∈ Rd and a linear model with parameters w is trained to model the conditional distributionpi = P (yi = 1|xi) = σ

(wTφ(xi)

), where σ is the sigmoid function. The gradient of the cross-entropy loss with respect to

the linear model parameters is:∂`

∂w=

1

N

∑i

(yi − pi)φ(xi),

and the empirical estimate of the Fisher information matrix is:

F = E[ ∂`∂w

(∂`

∂w

)T ]= Ey∼pw(y|x)

1

N

∑i

φ(xi)(yi − pi)2φ(xi)T

=1

n

∑i

φ(xi)(1− pi)piφ(xi)T

In general, we are also interested in the Fisher information of the parameters of the feature extractor φ(x) since this isindependent of the specifics of the output space y (e.g., for k-way classification). Consider a 2-layer network where thefeature extractor uses a sigmoid non-linearity:

p = σ(wT z) zk = σ(UTk x)

and the matrix U specifies the feature extractor parameters and w are parameters of the task-specific classifier. Taking thegradient w.r.t. parameters we have:

∂`

∂wj= (y − p)zj

∂`

∂Ukj= (y − p)wkzk(1− zk)xj

The Fisher Information Matrix (FIM) consists of blocks:

∂`

∂wi

(∂`

∂wj

)T= (y − p)2zizj

∂`

∂Uki

(∂`

∂wj

)T= (y − p)2zjzk(1− zk)xi

∂`

∂Uli

(∂`

∂Ukj

)T= (y − p)2wkzk(1− zk)wlzl(1− zl)xixj

We focus on the FIM of the probe network parameters which is independent of the dimensionality of the output layer andwrite it in matrix form as:

∂`

∂Ul

(∂`

∂Uk

)T= (y − p)2(1− zk)zk(1− zl)zlwkwlxxT

Note that each block {l, k} consists of the same matrix (y − p)2 · xxT multiplied by a scalar Skl given as:

Skl = (1− zk)zk(1− zl)zlwkwl

We can thus write the whole FIM as the expectation of a Kronecker product:

F = E[(y − p)2 · S ⊗ xxT ]

where the matrix S can be written asS = wwT � zzT � (1− z)(1− z)T

Page 12: TASK2VEC: Task Embedding for Meta-Learning

(a) Random linear + ReLU (b) Polynomial of degree three

Figure 5: Task embeddings computed for a probe network consisting of (a) 10 random linear + ReLU features and (b)degree three polynomial features projected to 2D using t-SNE. The tasks are random binary partitions of the unit squarevisualized in each icon (three tasks are visualized on the left) and cannot be distinguished based purely on the input domainwithout considering target labels. Note that qualitatively similar tasks group together, with more complex tasks (requiringcomplicated decision boundaries) separated from simpler tasks.

Given a task described by N training samples {(xe, ye)}, the FIM can be estimated empirically as

F =1

N

∑e

pe(1− pe) · Se ⊗ xexTe

Se = wwT � zezTe � (1− ze)(1− ze)T

where we take expectation over y w.r.t. the predictive distribution y ∼ pw(y|x).

Example toy task embedding As noted in the main text, the FIM depends on the domain embedding, the particular taskand its complexity. We illustrate these properties of the task embedding using an “toy” task space illustrated in Figure 5. Wegenerate 64 binary classification tasks by clustering a uniform grid of points in the XY plane into k ∈ [3, 16] clusters usingk-means and assigning a half of them to one category. We consider two different feature extractors, which play the role of“probe network”. One is a collection of polynomial functions of degree d = 3, the second is 10 random linear features of theform max(0, ax+ by + c) where a and b are sampled uniformly between [−1/2, 1/2] and c between [−1, 1].

B. Robust Fisher ComputationConsider again the loss function (parametrized with the covariance matrix Σ instead of the precision matrix Λ for conve-

nience of notation):L(w; Σ) = Ew∼N (w,Σ)[Hpw,p(y|x)] + β KL(N (w,Σ) ‖N (0, σ2I)).

We will make use of the fact that the Fisher Information matrix is a positive semidefinite approximation of the Hessian H ofthe cross-entropy loss, and coincide with it in local minima [24]. Expanding to the second order around w, we have:

L(w; Σ) =Ew∼N (w,Σ)[Hpw,p(y|x) +∇wHpw,p(y|x)(w − w) +1

2(w − w)TH(w − w)] + β KL(N (w,Σ) ‖N (0, σ2I))

=Hpw,p(y|x) +1

2tr(ΣH) + β KL(N (w,Σ) ‖N (0, σ2I))

=Hpw,p(y|x) +1

2tr(ΣH) +

β

2[w2

σ2+

1

σ2trΣ + k log σ2 − log(|Σ|)− k]

where in the last line used the known expression for the KL divergence of two Gaussian. Taking the derivative with respectto Σ and setting it to zero, we obtain that the expression loss is minimized when Σ−1 = 2

β

(H + β

2σ2 I), or, rewritten in term

Page 13: TASK2VEC: Task Embedding for Meta-Learning

of the precision matrices, when

Λ =2

β

(H +

βλ2

2I),

where we have introduced the precision matrices Λ = Σ−1 and λ2I = 1/σ2I .We can then obtain an estimate of the Hessian H of the cross-entropy loss at the point w, and hence of the FIM, by

minimizing the loss L(w,Λ) with respect to Λ. This is a more robust approximation than the standard definition, as itdepends on the loss in a whole neighborhood of w of size ∝ Λ, rather than from the derivatives of the loss at a point.To further make the estimation more robust, and to reduce the number of parameters, we constrain Λ to be diagonal, andconstrain weights wij belonging to the same filter to have the same precision Λij . Optimization of this loss can be performedeasily using Stochastic Gradient Variational Bayes, and in particular using the local reparametrization trick of [18].

The prior precision λ2 should be picked according to the scale of the weights of each layer. In practice, since the weightsof each layer have a different scale, we found it useful to select a different λ2 for each layer, and train it together with Λ,

C. Details of the experimentsC.1. Training of experts and classifiers

Given a task, we train an expert on it by fine-tuning an off-the-shelf ResNet-34 pretrained on ImageNet1. Fine-tuning isperformed by first fixing the weights of the network and retraining from scratch only the final classifier for 10 epochs usingAdam, and then fine-tuning all the network together with SGD for 60 epochs with weight decay 5e-4, starting from learningrate 0.001 and decreasing it by a factor 0.1 at epochs 40.

Given an expert, we train a classifier on top of it by replacing the final classification layer and training it with Adam for16 epochs. We use weight decay 5e-4 and learning rate 1e-4.

The tasks we train on generally have different number of samples and unbalanced classes. To limit the impact of thisimbalance on the training procedure, regardless of the total size of the dataset, in each epoch we always sample 10,000images with replacement, uniformly between classes. In this way, all epochs have the same length and see approximately thesame number of examples for each class. We use this balanced sampling in all experiments, unless noted otherwise.

C.2. Computation of the TASK2VEC embedding

As the described in the main text, the TASK2VEC embedding is obtained by choosing a probe network, retraining the finalclassifier on the given task, and then computing the Fisher Information Matrix for the weights of the probe network.

Unless specified otherwise, we use an off-the-shelf ResNet-34 pretrained on ImageNet as the probe network. The FisherInformation Matrix is computed in a robust way minimizing the loss function L(w; Λ) with respect to the precision matrix Λ,as described before. To make computation of the embedding faster, instead of waiting for the convergence of the classifier,we train the final classifier for 2 epochs using Adam and then we continue to train it jointly with the precision matrix Λ usingthe loss L(w; Λ). We constrain Λ to be positive by parametrizing it as Λ = exp(L), for some unconstrained variable L.While for the classifier we use a low learning rate (1e-4), we found it useful to use an higher learning rate (1e-2) to train L.

C.3. Training the MODEL2VECembedding

As described in the main text, in the MODEL2VECembedding we aim to learn a vector representation mj = Fj + bj ofthe j-th model in the collection, which represents both the task the model was trained on (through the TASK2VEC embeddingFj), and the particularities of the model (through the learned parameter bj).

We learn bj by minimizing a k-way classification loss which, given a task t, aims to select the model that performs beston the task among a collection of k models. Multiple models may perform similarly and close to optimal: to preserve thisinformation, instead of using a one-hot encoding for the best model, we train using soft-labels obtained as follows:

p(yi) = Softmax(− αerrori −mean(errori)

std(errori)

),

where errori,j is the ground-truth test error obtained by training a classifier for task i on top of the j-th model. Notice thatfor α � 1, the soft-label yji reduces to the one-hot encoding of the index of the best performing model. However, for lowerα’s, the vector yi contains richer information about the relative performance of the models.

1https://pytorch.org/docs/stable/torchvision/models.html

Page 14: TASK2VEC: Task Embedding for Meta-Learning

We obtain our prediction in a similar way: Let di,j = dasym(ti,mj), then we set our model prediction to be

p(y|di,0, . . . , di,k) = Softmax(−γ di),

where the scalar γ > 0 is a learned parameter. Finally, we learn both the mj’s and γ using a cross-entropy loss:

L =1

N

N∑i=0

Eyi∼p[p(yi|di,0, . . . , di,k)],

which is minimized precisely when p(y|di,0, . . . , di,k) = p(yi).In our experiments we set α = 20, and minimize the loss using Adam with learning rate 0.05, weight decay 0.0005, and

early stopping after 81 epochs, and report the leave-one-out error (that is, for each task we train using the ground truth of allother tasks and test on that task alone, and report the average of the test errors obtained in this way).

D. Datasets, tasks and meta-tasks

Our two model selection meta-tasks, iNat+CUB and Mixed, are curated as follows. For iNat+CUB, we generated 50tasks and (the same) experts from iNaturalist and CUB. The 50 tasks consist of 25 iNaturalist tasks and 25 CUB tasks toprovide a balanced mix from two datasets of the same domain. We generated the 25 iNaturalist tasks by grouping speciesinto orders and then choosing the top 25 orders with the most samples. The number of samples for tasks shows the heavy-taildistribution typical of real data, with the top task having 64,100 samples (the Passeriformes order classification task), whilemost tasks have around 6, 000 samples.

The 25 CUB tasks were similarly generated with 10 order tasks but additionally has 15 Passeriformes family tasks: Aftergrouping CUB into orders, we determined 11 usable order tasks (the only unusable order task, Gaviiformes, has only onespecies so it makes no sense to train on it). However, one of the orders—Passeriformes—dominated all other orders with 134species when compared to 3-24 species of other orders. Therefore, we decided to further subdivide the Passeriformes ordertask into family tasks (i.e., grouping species into families) to provide a more balanced partition. This resulted in 15 usablefamily tasks (i.e., has more than one species) out of 22 family tasks. Unlike iNaturalist, tasks from CUB have only a fewhundreds of samples and hence benefit more from carefully selecting an expert.

In the iNAT+CUB meta-task the classification tasks are the same tasks used to train the experts. To avoid trivial solu-tions (always selecting the expert trained on the task we are trying to solve) we test in a leave-one-out fashion: given aclassficication task, we aim to select the best expert that was not trained on the same data.

For the Mixed meta-task, we chose 40 random tasks and 25 curated experts from all datasets. The 25 experts weregenerated from iNaturalist, iMaterialist and DeepFashion (CUB, having fewer samples than iNaturalist, is more appropriateas tasks). For iNaturalist, we trained 15 experts: 8 order tasks and 7 class tasks (species ordered by class), both with numberof samples greater than 10,000. For DeepFashion, we trained 3 category experts (upper-body, lower-body, full-body). ForiMaterialist, we trained 2 category experts (pants, shoes) and 5 multi-label experts by grouping attributes (color, gender,neckline, sleeve, style). For the purposes of clustering attributes into larger groups for training experts (and color codingthe dots in Figure 1), we obtained a de-anonymized list of the iMaterialist Fashion attribute names from the FGCV contestorganizers.

The 40 random tasks were generated as follows. In order to balance tasks among all datasets, we selected 5 CUB, 15iNaturalist, 15 iMaterialist and 5 DeepFashion tasks. Within those datasets, we randomly pick tasks with a sufficient numberof validation samples and maximum variety. For the iNaturalist tasks, we group the order tasks into class tasks, filter outthe number of validation samples less than 100 and randomly pick order tasks within each class. For the iMaterialist tasks,we similarly group the tasks (e.g. category, style, pattern), filter out tasks with less than 1,000 validation samples andrandomly pick tasks within each group. For CUB, we randomly select 2 order tasks and 3 Passeriformes family tasks, andfor DeepFashion, we randomly select the tasks uniformly. All this ensures that we have a balanced variety of tasks.

For the data efficiency experiment, we trained on a subset of the tasks and experts in the Mixed meta-task: We pickedthe Accipitriformes, Asparagales, Upper-body, Short Sleeves for the tasks, and the Color, Lepidoptera, Upper-body, Passer-iformes, Asterales for the experts. Tasks where selected among those that have more than 30,000 training samples in orderto represent all datasets. The experts were also selected to be representative of all datasets, and contain both strong and veryweak experts (such as the Color expert).

Page 15: TASK2VEC: Task Embedding for Meta-Learning

E. Error matrices

[CUB] Proc

ellariif

ormes

(Aves)

[CUB] Cucu

liform

es (Ave

s)

[CUB] Cha

radriif

ormes

(Aves)

[CUB] Cap

rimulg

iform

es (Ave

s)

[CUB] Pele

canifo

rmes

(Aves)

[CUB] Picif

ormes

(Aves)

[CUB] Anse

riform

es (Ave

s)

[CUB] Pod

iciped

iform

es (Ave

s)

[CUB] Apo

diform

es (Ave

s)

[CUB] Cora

ciiform

es (Ave

s)

[CUB] Icter

idae (

Aves)

[CUB] Card

inalida

e (Ave

s)

[CUB] Mim

idae (

Aves)

[CUB] Paru

lidae (

Aves)

[CUB] Embe

rizida

e (Ave

s)

[CUB] Corv

idae (

Aves)

[CUB] Frin

gillida

e (Ave

s)

[CUB] Tyra

nnida

e (Ave

s)

[CUB] Lan

iidae (

Aves)

[CUB] Pass

erida

e (Ave

s)

[CUB] Hiru

ndinid

ae (A

ves)

[CUB] Thra

upida

e (Ave

s)

[CUB] Vire

onida

e (Ave

s)

[CUB] Bom

bycill

idae (

Aves)

[CUB] Trog

lodyti

dae (

Aves)

[iNat]

Passe

riform

es (Ave

s)

[iNat]

Lepid

opter

a (Ins

ecta)

[iNat]

Squa

mata (R

eptili

a)

[iNat]

Odona

ta (In

secta)

[iNat]

Charad

riiform

es (Ave

s)

[iNat]

Asteral

es (M

agno

liopsid

a)

[iNat]

Pelec

anifo

rmes

(Aves)

[iNat]

Anserifo

rmes

(Aves)

[iNat]

Lamiale

s (Mag

noliop

sida)

[iNat]

Anura

(Amphibia

)

[iNat]

Accipit

riform

es (Ave

s)

[iNat]

Caryop

hyllal

es (M

agno

liopsid

a)

[iNat]

Coleop

tera (

Insect

a)

[iNat]

Asparag

ales (

Liliop

sida)

[iNat]

Roden

tia (M

ammalia

)

[iNat]

Carnivo

ra (M

ammalia

)

[iNat]

Picifo

rmes

(Aves)

[iNat]

Faba

les (M

agno

liopsid

a)

[iNat]

Rosales

(Mag

noliop

sida)

[iNat]

Columbif

ormes

(Aves)

[iNat]

Perci

formes

(Actino

pteryg

ii)

[iNat]

Sapin

dales

(Mag

noliop

sida)

[iNat]

Ranun

culale

s (Mag

noliop

sida)

[iNat]

Gentia

nales

(Mag

noliop

sida)

[iNat]

Erica

les (M

agno

liopsid

a)

[CUB] Procellariiformes (Aves)[CUB] Cuculiformes (Aves)

[CUB] Charadriiformes (Aves)[CUB] Caprimulgiformes (Aves)

[CUB] Pelecaniformes (Aves)[CUB] Piciformes (Aves)

[CUB] Anseriformes (Aves)[CUB] Podicipediformes (Aves)

[CUB] Apodiformes (Aves)[CUB] Coraciiformes (Aves)

[CUB] Icteridae (Aves)[CUB] Cardinalidae (Aves)

[CUB] Mimidae (Aves)[CUB] Parulidae (Aves)

[CUB] Emberizidae (Aves)[CUB] Corvidae (Aves)

[CUB] Fringillidae (Aves)[CUB] Tyrannidae (Aves)

[CUB] Laniidae (Aves)[CUB] Passeridae (Aves)

[CUB] Hirundinidae (Aves)[CUB] Thraupidae (Aves)[CUB] Vireonidae (Aves)

[CUB] Bombycillidae (Aves)[CUB] Troglodytidae (Aves)[iNat] Passeriformes (Aves)[iNat] Lepidoptera (Insecta)

[iNat] Squamata (Reptilia)[iNat] Odonata (Insecta)

[iNat] Charadriiformes (Aves)[iNat] Asterales (Magnoliopsida)

[iNat] Pelecaniformes (Aves)[iNat] Anseriformes (Aves)

[iNat] Lamiales (Magnoliopsida)[iNat] Anura (Amphibia)

[iNat] Accipitriformes (Aves)[iNat] Caryophyllales (Magnoliopsida)

[iNat] Coleoptera (Insecta)[iNat] Asparagales (Liliopsida)

[iNat] Rodentia (Mammalia)[iNat] Carnivora (Mammalia)

[iNat] Piciformes (Aves)[iNat] Fabales (Magnoliopsida)[iNat] Rosales (Magnoliopsida)

[iNat] Columbiformes (Aves)[iNat] Perciformes (Actinopterygii)[iNat] Sapindales (Magnoliopsida)

[iNat] Ranunculales (Magnoliopsida)[iNat] Gentianales (Magnoliopsida)

[iNat] Ericales (Magnoliopsida)

15 18 18 20 16 17 19 16 22 17 22 17 18 23 24 19 19 16 22 17 17 23 23 24 20 17 20 15 18 14 30 18 15 25 26 24 26 18 19 20 19 25 18 26 19 27 24 20 22 2412 15 17 18 18 23 17 20 18 22 23 18 14 18 18 21 20 20 20 16 22 23 22 18 19 12 16 20 19 19 32 14 19 25 17 20 29 21 27 19 17 16 26 29 19 20 31 27 25 2545 44 41 45 43 45 45 46 46 46 47 49 46 47 47 45 46 47 48 47 44 47 48 44 48 36 42 45 45 32 51 41 43 47 48 46 50 42 47 48 45 44 50 48 43 44 49 48 50 4821 24 29 16 27 21 16 23 27 23 21 25 25 27 20 27 23 23 25 27 24 20 29 21 17 27 24 16 19 23 33 27 20 23 24 23 23 27 25 28 23 17 29 35 35 21 29 27 31 3118 17 22 15 17 19 19 20 20 19 17 19 17 19 20 15 18 16 21 18 17 20 23 17 21 20 19 17 20 19 24 17 18 20 24 22 25 20 23 17 19 20 19 20 18 16 25 25 23 2413 11 15 13 10 8 11 12 14 14 13 13 14 11 15 12 16 17 13 15 16 14 15 16 16 6 9 14 11 10 13 9 12 14 14 16 14 10 12 13 12 3 15 15 15 10 16 16 15 1316 14 16 13 12 15 12 12 12 11 14 16 17 12 13 13 15 15 13 17 14 15 15 14 14 12 11 12 12 12 20 10 8 17 14 14 12 11 14 14 10 12 14 14 15 11 17 16 15 1720 17 27 16 21 21 13 16 19 16 17 24 24 20 20 18 22 20 22 17 17 18 20 17 22 17 16 24 21 13 26 17 15 18 25 20 17 15 18 17 22 17 19 25 22 18 23 20 17 1722 27 24 22 24 27 25 21 23 27 23 29 27 25 24 22 27 27 26 27 22 27 30 25 22 18 19 24 21 22 26 24 19 27 26 24 27 31 28 22 21 22 32 22 25 24 27 26 28 2819 21 21 19 19 20 18 20 21 19 21 23 22 26 21 22 22 30 17 20 21 19 26 22 25 17 16 20 20 15 30 18 16 29 25 19 20 19 24 21 18 21 31 25 21 24 30 24 29 2128 31 32 27 32 31 30 31 34 30 25 30 29 28 32 32 29 29 30 28 31 34 28 30 28 20 25 30 29 26 35 26 27 35 35 28 34 29 34 33 30 28 32 36 29 28 37 36 36 3612 13 10 14 8 11 12 14 12 9 14 7 13 14 16 13 9 14 10 12 15 16 13 12 13 6 10 16 14 9 18 9 13 18 13 10 19 11 13 13 14 12 16 18 11 12 15 18 20 197 14 6 8 9 10 8 14 12 12 13 10 11 8 11 9 14 13 11 12 11 12 15 10 11 5 5 8 10 10 17 9 4 16 10 9 14 11 14 8 11 10 10 14 5 8 10 15 13 1535 37 41 36 36 36 38 42 36 40 37 37 36 23 36 36 35 39 40 40 40 38 34 38 35 14 26 36 33 33 42 33 34 42 37 40 41 29 39 36 37 30 42 42 34 33 43 45 43 4136 42 43 38 38 42 38 43 40 41 41 44 40 39 29 41 42 43 41 38 42 43 42 42 39 18 31 40 38 32 53 39 40 50 43 42 49 34 45 43 41 36 47 50 35 40 50 50 51 4929 30 32 29 29 29 29 29 33 27 30 30 32 30 33 25 32 30 30 28 27 30 31 29 30 23 26 29 32 27 35 30 32 33 30 32 35 27 33 27 25 31 31 33 28 34 35 35 34 357 5 11 7 12 6 9 9 7 9 7 8 8 4 8 11 6 8 10 7 8 7 8 10 6 4 7 7 6 6 8 6 7 13 7 9 7 7 11 7 8 7 9 13 7 6 13 10 9 1142 43 42 41 38 40 40 41 43 43 38 42 41 42 40 41 43 36 46 44 40 42 40 42 42 24 39 40 41 38 46 40 37 47 40 41 49 37 40 43 42 39 45 44 37 41 46 50 46 4635 35 32 28 27 38 40 42 35 33 33 33 45 22 30 32 35 33 22 33 35 35 30 28 42 25 27 37 23 28 32 20 17 42 33 30 37 27 35 37 28 27 35 37 30 33 37 35 37 3238 27 32 38 40 35 33 32 35 37 32 42 30 35 33 35 32 30 28 32 33 27 37 37 27 18 35 22 30 33 38 35 25 38 30 32 35 28 35 32 32 28 33 35 25 35 35 43 37 4227 23 29 20 21 21 19 24 27 23 25 27 24 22 24 21 25 20 24 23 23 27 23 28 22 12 19 26 29 17 32 18 18 30 25 24 28 20 29 25 24 23 28 26 23 27 34 26 28 2610 13 7 7 13 8 13 13 13 10 5 7 10 10 5 8 8 12 12 8 13 8 15 13 3 3 5 7 5 8 13 8 10 15 7 10 8 7 8 5 7 8 12 13 7 12 10 18 12 1332 33 39 35 36 31 34 38 33 35 33 35 34 27 34 29 35 33 35 34 36 33 32 41 30 15 24 32 32 27 41 27 33 45 37 35 44 34 43 29 33 28 44 42 31 37 48 46 42 4010 10 15 7 10 8 13 20 10 10 10 12 13 10 13 15 10 7 8 12 10 8 12 13 8 7 15 12 12 18 17 12 13 15 18 10 18 13 13 12 7 10 17 12 13 12 17 8 15 1332 32 33 27 31 32 30 36 31 32 30 31 31 31 30 33 26 31 35 30 29 34 30 30 22 18 24 29 33 29 35 30 31 32 28 29 33 32 35 30 30 31 40 35 34 30 37 38 32 3478 81 81 78 78 81 79 82 80 81 79 81 80 80 81 80 78 81 82 78 79 81 81 81 80 56 73 79 78 76 86 75 77 84 82 78 85 77 82 79 79 77 85 85 75 78 86 85 87 8461 60 62 58 60 59 61 63 60 60 60 61 60 62 60 60 60 63 62 60 60 62 62 60 60 54 31 54 56 60 62 59 60 60 55 60 60 54 62 60 59 60 62 62 60 57 61 62 63 6173 74 73 70 72 73 73 75 74 72 75 74 70 75 74 72 73 76 74 72 74 72 75 72 73 68 68 61 71 71 75 71 74 74 73 73 73 71 74 71 72 73 77 76 71 70 76 77 77 7671 73 73 69 70 70 73 72 70 70 72 74 71 71 73 71 71 74 72 72 71 73 71 71 72 65 59 67 45 69 72 68 71 69 70 71 70 64 69 71 70 68 67 72 71 69 74 72 73 7162 60 61 59 65 62 63 65 64 62 65 64 59 63 65 64 60 66 63 60 61 62 65 64 65 54 59 59 62 48 70 57 57 65 66 60 66 60 69 62 63 60 68 70 61 64 69 69 69 6569 68 69 69 69 69 70 70 69 67 68 67 66 69 68 67 68 69 68 70 69 69 69 69 69 66 64 66 66 67 49 69 69 54 69 69 57 65 59 69 68 69 58 60 68 66 63 58 58 5958 61 55 49 54 56 54 57 58 59 62 55 58 64 60 59 57 64 57 54 57 57 63 53 59 48 50 54 60 53 67 46 51 59 56 51 62 57 64 51 47 56 64 64 54 59 67 67 67 6160 60 63 61 61 60 60 63 64 60 66 65 60 62 63 59 61 66 60 62 58 66 67 59 62 57 59 66 62 56 69 59 57 61 67 61 66 63 65 64 62 62 64 65 63 65 66 65 68 6462 58 61 60 59 58 61 63 57 59 58 59 58 60 61 56 60 60 60 61 61 59 61 61 63 55 57 57 56 57 43 59 61 38 57 60 47 54 45 60 58 59 46 45 61 58 52 49 47 4869 69 73 65 71 72 72 71 69 72 71 73 70 73 71 69 71 71 73 69 70 75 70 74 66 68 60 64 70 70 75 68 71 71 58 70 74 64 71 72 73 70 70 74 70 69 76 75 71 7169 69 69 68 73 72 68 75 74 76 70 73 68 74 69 70 70 77 73 69 71 76 71 72 71 62 67 68 68 68 77 69 68 76 66 66 74 66 71 72 68 66 74 76 71 72 71 77 73 7260 57 59 56 58 55 60 58 55 58 55 54 57 57 59 53 58 56 57 57 56 55 57 55 54 55 53 53 56 57 44 56 57 45 54 57 42 53 48 53 56 57 47 50 56 54 50 46 48 4645 42 44 41 43 42 42 46 43 43 43 43 42 40 45 42 44 45 45 40 43 43 41 40 43 34 29 40 37 41 49 41 43 44 39 44 44 29 43 42 42 40 42 46 43 38 52 48 45 4651 50 53 48 50 51 51 54 50 50 52 48 52 52 55 50 52 54 52 51 53 45 51 50 52 44 47 49 51 51 38 48 50 38 49 53 40 48 34 51 49 52 41 42 52 48 44 44 42 4456 50 54 51 48 49 48 58 49 49 53 51 48 50 52 46 52 50 47 50 50 48 48 46 46 43 51 44 50 53 56 52 56 58 51 50 58 51 54 44 50 52 54 54 50 48 59 61 59 5952 47 58 48 47 54 49 56 49 51 53 51 50 55 54 52 50 49 52 47 52 49 53 49 50 52 49 49 53 54 58 53 54 58 54 53 58 53 56 54 46 52 54 59 49 49 57 57 57 5459 62 60 52 60 56 57 52 59 62 53 63 61 64 62 63 58 63 61 60 61 63 66 59 63 41 50 60 51 52 69 55 54 72 61 59 66 50 60 53 56 48 63 67 58 60 70 66 68 6961 60 59 57 61 64 65 61 60 59 58 58 59 61 59 59 60 63 61 61 60 58 63 58 60 56 54 60 58 59 49 58 61 46 58 60 49 57 48 59 60 60 43 49 62 57 50 52 50 5065 65 65 66 64 63 71 70 64 63 63 64 65 68 64 62 62 66 65 66 68 63 69 67 66 59 61 62 67 68 51 62 66 54 62 63 57 61 62 64 62 66 59 49 64 63 56 57 58 5853 50 55 51 50 58 55 56 59 59 57 57 50 62 59 50 55 58 55 52 57 59 59 60 57 41 50 51 59 50 67 55 51 62 58 50 61 50 56 59 58 59 67 61 50 64 64 63 59 6545 46 48 45 44 48 45 52 44 47 46 45 44 49 48 44 45 50 50 46 48 43 48 43 45 41 35 45 43 46 48 46 48 47 47 51 48 41 46 49 48 46 48 49 48 29 50 50 48 4664 68 70 67 63 64 67 64 64 66 70 65 63 67 67 69 67 68 67 66 69 64 65 65 63 64 61 61 65 65 64 67 70 59 63 67 60 62 64 65 67 64 56 61 70 57 53 60 61 6250 48 50 50 49 48 53 51 49 48 50 47 46 52 53 45 51 51 50 49 52 49 50 49 50 51 48 49 51 52 35 48 50 33 50 54 38 49 37 50 49 49 39 37 54 47 42 35 39 3957 54 52 55 51 54 56 55 53 54 54 53 54 55 54 52 50 53 53 53 52 53 55 54 55 53 51 51 50 57 41 54 57 42 51 55 44 48 45 57 56 51 46 45 54 48 47 45 40 4349 46 52 47 48 48 47 49 47 48 46 44 47 49 48 48 51 49 49 47 50 47 49 52 48 44 45 49 45 51 38 48 47 37 46 51 38 45 35 47 50 49 40 40 50 45 40 39 41 35

Upper-

body

clothe

s

Lower-

body

clothe

s

Full-b

ody c

lothe

sColo

r

Gende

r

Neckline

Sleev

eSty

lePa

ntsSh

oes

Aves

Magno

liopsid

a

Insect

a

Reptili

a

Mammalia

Liliop

sida

Amphibia

Passe

riform

es

Lepido

ptera

Squa

mata

Odona

ta

Charad

riiform

es

Asteral

es

Pelec

anifo

rmes

Anserifo

rmes

[CUB] Parulidae

[CUB] Emberizidae

[CUB] Tyrannidae

[CUB] Charadriiformes

[INAT] Accipitriformes

[INAT] Caryophyllales

[INAT] Coleoptera

[INAT] Rodentia

[CUB] Pelecaniformes

[INAT] Carnivora

[INAT] Piciformes

[INAT] Fabales

[INAT] Rosales

[INAT] Columbiformes

[INAT] Perciformes

[INAT] Hymenoptera

[INAT] Testudines

[INAT] Hemiptera

[INAT] Suliformes

[INAT] Caudata

[IMAT] Male

[IMAT] Marbled

[DEEPFASHION] Sleeve

[IMAT] Printed

[IMAT] Prom dresses

[IMAT] Reversible

[IMAT] Ruched

[IMAT] Blouses

[IMAT] Slippers

[IMAT] Bodycon

[IMAT] Bodysuits

[IMAT] Velour

[IMAT] Winter boots

[IMAT] Criss cross

[DEEPFASHION] Floral

[IMAT] Embroidered

[IMAT] Flannel

[DEEPFASHION] Knit

[DEEPFASHION] Lace

[DEEPFASHION] Print

40 38 40 92 45 44 47 43 39 41 16 39 26 35 28 38 33 14 25 36 33 32 44 33 33

39 44 42 92 43 43 48 46 42 47 21 48 30 39 36 45 38 20 32 40 39 30 50 38 35

34 43 44 80 41 40 44 42 42 46 26 44 38 40 39 40 43 25 37 39 41 37 44 34 39

44 47 49 89 46 45 48 46 43 47 35 46 40 42 45 42 48 37 41 42 45 32 51 40 38

69 70 68 93 69 70 74 77 69 76 59 69 67 68 65 72 69 67 64 71 72 67 75 68 70

58 61 57 95 63 64 71 62 58 60 56 36 49 54 56 44 53 56 51 54 54 60 43 58 58

43 45 45 95 49 51 58 52 45 47 33 39 22 37 38 42 39 33 29 39 38 42 46 43 39

52 56 54 86 54 54 61 56 54 52 41 49 46 47 49 54 50 45 52 45 54 52 51 52 49

12 22 16 69 20 24 24 20 17 16 14 22 19 14 20 18 25 20 14 18 20 20 26 19 14

51 55 53 94 57 52 59 62 60 55 46 51 44 54 48 53 53 50 47 54 50 55 57 56 54

63 66 63 96 66 64 71 71 63 63 37 63 52 52 54 64 62 35 47 58 56 60 71 58 57

61 65 60 95 65 62 73 64 63 62 56 37 50 59 57 45 57 55 56 60 60 56 47 59 61

66 66 69 95 69 64 71 70 66 68 59 46 61 66 63 56 63 61 62 66 63 65 53 62 61

50 59 61 90 51 48 50 57 50 60 37 55 48 56 51 63 52 40 57 54 57 50 62 55 50

48 49 52 97 50 47 57 56 44 50 38 43 35 42 45 42 45 39 36 45 44 47 47 47 47

60 62 61 94 67 62 68 67 61 65 52 61 43 53 53 60 57 53 49 55 56 59 64 57 54

64 65 66 89 62 59 68 65 63 63 54 61 57 59 58 62 56 54 56 56 58 58 65 61 54

44 47 52 95 53 53 60 53 51 49 39 45 27 39 44 43 43 39 30 40 43 44 50 44 45

62 72 63 88 63 70 67 73 65 62 42 63 62 55 60 50 65 48 48 57 53 58 65 53 50

54 52 56 89 60 60 67 60 54 56 52 60 52 45 52 54 42 51 50 49 59 50 55 55 52

12 10 13 30 6 9 9 10 10 11 14 11 16 14 14 14 18 13 15 10 11 8 10 11 14

9 8 10 21 9 7 9 7 7 6 5 6 5 9 5 8 5 5 6 5 6 4 6 7 6

25 33 22 19 41 21 40 17 31 39 40 21 42 47 22 32 32 27 44 65 37 41 32 38 32

33 22 17 66 24 40 26 27 26 32 30 28 28 28 20 34 29 27 26 35 24 35 30 35 34

9 9 12 28 9 10 8 7 7 8 10 12 9 12 9 10 10 11 5 8 12 6 8 11 17

17 26 13 30 18 27 17 17 16 14 18 19 22 23 18 17 20 24 20 17 21 17 24 20 25

26 26 14 20 20 14 21 22 19 27 20 19 20 26 14 14 22 27 26 26 21 16 22 23 23

19 22 23 27 25 25 24 23 25 31 23 21 23 29 22 25 23 15 30 21 28 25 19 25 31

2 4 9 24 6 7 6 6 9 1 3 4 6 4 5 9 7 8 4 7 4 5 6 4 6

30 28 23 44 22 25 28 25 25 26 23 30 32 27 27 30 21 25 27 27 25 20 23 20 34

28 28 23 49 11 16 20 14 18 19 24 17 20 14 22 8 10 24 20 18 21 15 22 16 23

9 12 10 36 15 10 19 13 14 11 11 14 17 11 13 15 10 12 13 16 14 11 10 13 17

4 4 4 20 4 3 4 4 2 0 4 3 4 3 4 3 4 6 5 5 4 3 5 5 5

25 22 24 33 28 19 26 11 18 10 16 26 23 15 20 22 16 25 12 18 28 20 13 20 24

14 13 14 52 29 24 25 19 30 21 26 26 16 22 18 26 16 19 19 18 28 21 19 23 30

40 15 32 58 48 32 43 14 30 24 33 43 12 30 36 27 43 24 29 25 16 25 25 33 33

14 15 10 40 20 7 15 11 8 9 10 13 10 15 14 8 16 11 17 13 18 19 13 14 16

22 36 31 80 48 20 53 46 22 23 15 25 31 53 18 34 29 41 52 31 27 31 21 28 35

25 34 29 24 26 34 34 15 32 28 36 26 16 14 30 38 18 28 30 27 33 25 22 22 37

21 19 25 37 16 21 26 18 21 18 19 23 17 32 23 22 21 21 22 27 25 24 19 19 20

Figure 6: Meta-tasks ground-truth error matrices. (Best viewed magnified). (Left) Error matrix for the CUB+iNatmeta-task. The numbers in each cell is the test error obtained by training a classifier on a given combination of task (rows)and expert (columns). The background color represent the Asymmetric TASK2VEC distance between the target task andthe task used to train the expert. Numbers in red indicate the selection made by the model selection algorithm based onthe Asymmetric TASK2VEC embedding. The (out-of-diagonal) optimal expert (when different from the one selected by ouralgorithm), is highlighted in blue. (Right) Same as before, but for the Mixed meta-task.