distilling cross-task knowledge via relationship matching

18
Distilling Cross-Task Knowledge via Relationship Matching Han-Jia Ye Nanjing University [email protected] Su Lu Nanjing University [email protected] De-Chuan Zhan Nanjing University [email protected] Abstract The discriminative knowledge from a high-capacity deep neural network (a.k.a. the “teacher”) could be distilled to facilitate the learning efficacy of a shallow counterpart (a.k.a. the “student”). This paper deals with a general sce- nario reusing the knowledge from a cross-task teacher — two models are targeting non-overlapping label spaces. We emphasize that the comparison ability between instances acts as an essential factor threading knowledge across do- mains, and propose the RElationship FacIlitated Local cLassifiEr Distillation (REFILLED) approach, which de- composes the knowledge distillation flow into branches for embedding and the top-layer classifier. In particular, differ- ent from reconciling the instance-label confidence between models, REFILLED requires the teacher to reweight the hard triplets push forwarded by the student so that the sim- ilarity comparison levels between instances are matched. A local embedding-induced classifier from the teacher fur- ther supervises the student’s classification confidence. RE- FILLED demonstrates its effectiveness when reusing cross- task models, and also achieves state-of-the-art performance on the standard knowledge distillation benchmarks. The code of the paper can be accessed at https://github. com/njulus/ReFilled. 1. Introduction Knowledge distillation [6, 20, 64] facilitates the learn- ing efficiency of a deep neural network by reusing the “dark knowledge” from another model. In detail, a strong clas- sifier, e.g., a neural network trained with deeper architec- tures [43], high-quality images [65], or precise optimiza- tion strategies [13, 60], acts as a “teacher”, and guides the training of a weaker “student” model. Such model-based knowledge reuse improves the discriminative ability of the target student model, and relieves the burden of model train- ing and storage as well [20, 43, 64, 13]. Its success has been witnessed in a wide range of applications such as model/dataset compression [56, 2, 35, 36, 8], multi-task learning [68, 27], incremental image classification [69, 24]. Teacher Model Student Model Learn Student Model Learn Teacher Model A Pre-Trained model from Non- Overlapping Classes 1TU]RKJMK *OYZORRGZOUT Learn current task (student) model with the help of the teacher. +SHKJJOTM (GIQHUTK :UV2G_KX )RGYYOLOKX +SHKJJOTM (GIQHUTK :UV2G_KX )RGYYOLOKX Figure 1. An illustration of reusing the knowledge from a Cross- Task teacher model. In a classification task, a teacher is learned from images with non-overlapping classes, while its learning ex- perience is distilled to facilitate the training of the student model. The main idea of knowledge distillation is to specify a kind of dark knowledge, based on which the student is asked to align with the teacher. For example, the teacher specifies the proportion of how much similar an instance with can- didate categories rather than the extreme “black or white” supervision. Although the predictions matching enables the transition of knowledge flow across different neural archi- tectures [20, 35], its dependence on the instance-label rela- tionship restricts both teacher and student to the same label space. In this paper, we enable the student to utilize the learning experience from a cross-task teacher, i.e., a neural network with non-overlapping label spaces, which general- izes the knowledge reuse approaches to more applications. The label difference between teacher and student im- pedes the direct learning experience transition [22]. The comparison ability of the embeddings — measuring how similar two instances are — captures a kind of invariant nature of the model [1] and is free from the label con- straint [47, 33, 22]. For a teacher and a student discerning ‘Husky vs. Birman” and “Poodle vs. Persian” respectively, the teacher’s discriminative embedding encoding the “dog- cat” related characteristics is capable of estimating the sim- ilarity relationship of instances in the student’s task. Thus, we emphasize the instance-instance relationship to bridge the knowledge transfer across different tasks, and thread the knowledge reuse for both embedding and top-layer classi- fier by taking advantage of the teacher’s comparison ability. Figure 1 illustrates the notion of cross-task distillation. To this end, we propose a 2-stage approach RElationship

Upload: others

Post on 03-Apr-2022

28 views

Category:

Documents


0 download

TRANSCRIPT

Distilling Cross-Task Knowledge via Relationship Matching

Han-Jia YeNanjing University

[email protected]

Su LuNanjing University

[email protected]

De-Chuan ZhanNanjing University

[email protected]

Abstract

The discriminative knowledge from a high-capacity deepneural network (a.k.a. the “teacher”) could be distilledto facilitate the learning efficacy of a shallow counterpart(a.k.a. the “student”). This paper deals with a general sce-nario reusing the knowledge from a cross-task teacher —two models are targeting non-overlapping label spaces. Weemphasize that the comparison ability between instancesacts as an essential factor threading knowledge across do-mains, and propose the RElationship FacIlitated LocalcLassifiEr Distillation (REFILLED) approach, which de-composes the knowledge distillation flow into branches forembedding and the top-layer classifier. In particular, differ-ent from reconciling the instance-label confidence betweenmodels, REFILLED requires the teacher to reweight thehard triplets push forwarded by the student so that the sim-ilarity comparison levels between instances are matched.A local embedding-induced classifier from the teacher fur-ther supervises the student’s classification confidence. RE-FILLED demonstrates its effectiveness when reusing cross-task models, and also achieves state-of-the-art performanceon the standard knowledge distillation benchmarks. Thecode of the paper can be accessed at https://github.com/njulus/ReFilled.

1. IntroductionKnowledge distillation [6, 20, 64] facilitates the learn-

ing efficiency of a deep neural network by reusing the “darkknowledge” from another model. In detail, a strong clas-sifier, e.g., a neural network trained with deeper architec-tures [43], high-quality images [65], or precise optimiza-tion strategies [13, 60], acts as a “teacher”, and guides thetraining of a weaker “student” model. Such model-basedknowledge reuse improves the discriminative ability of thetarget student model, and relieves the burden of model train-ing and storage as well [20, 43, 64, 13]. Its success hasbeen witnessed in a wide range of applications such asmodel/dataset compression [56, 2, 35, 36, 8], multi-tasklearning [68, 27], incremental image classification [69, 24].

Teacher Model Student Model

Learn Student ModelLearn Teacher Model

A Pre-Trained model from Non-Overlapping Classes

Learn current task (student) model with the help of the teacher.

Figure 1. An illustration of reusing the knowledge from a Cross-Task teacher model. In a classification task, a teacher is learnedfrom images with non-overlapping classes, while its learning ex-perience is distilled to facilitate the training of the student model.

The main idea of knowledge distillation is to specify akind of dark knowledge, based on which the student is askedto align with the teacher. For example, the teacher specifiesthe proportion of how much similar an instance with can-didate categories rather than the extreme “black or white”supervision. Although the predictions matching enables thetransition of knowledge flow across different neural archi-tectures [20, 35], its dependence on the instance-label rela-tionship restricts both teacher and student to the same labelspace. In this paper, we enable the student to utilize thelearning experience from a cross-task teacher, i.e., a neuralnetwork with non-overlapping label spaces, which general-izes the knowledge reuse approaches to more applications.

The label difference between teacher and student im-pedes the direct learning experience transition [22]. Thecomparison ability of the embeddings — measuring howsimilar two instances are — captures a kind of invariantnature of the model [1] and is free from the label con-straint [47, 33, 22]. For a teacher and a student discerning‘Husky vs. Birman” and “Poodle vs. Persian” respectively,the teacher’s discriminative embedding encoding the “dog-cat” related characteristics is capable of estimating the sim-ilarity relationship of instances in the student’s task. Thus,we emphasize the instance-instance relationship to bridgethe knowledge transfer across different tasks, and thread theknowledge reuse for both embedding and top-layer classi-fier by taking advantage of the teacher’s comparison ability.Figure 1 illustrates the notion of cross-task distillation.

To this end, we propose a 2-stage approach RElationship

FacIlitated Local cLassifiEr Distillation (REFILLED).First, the discriminative ability of features is stressed. Forthose hard triplets determined by the student’s embedding,how teacher comparing them acts as the soft supervision. Inother words, the teacher promotes the discriminative abil-ity of the student’s embedding by specifying the proportionfor each object how much a dissimilar impostor should befar away from a target nearest neighbor. Next, the teacherconstructs soft supervisions for classifying each instance bymeasuring its similarity to a local embedding center. Specif-ically, the classification confidences of the student modeland the embedding-induced “instance-label” predictions ofthe teacher are aligned. Empirical results verify that the RE-FILLED effectively transfers the classification ability from across-task teacher to a student. The same mechanism ob-tains the state-of-the-art performance on standard knowl-edge distillation benchmarks as well. We also investigatethe middle-shot learning problem, and REFILLED is supe-rior to some popular meta-learning methods.

In summary, We contribute to enhancing the training ef-ficiency of a deep neural network by reusing the knowl-edge from a cross-task model. The proposed REFILLEDapproach aligns the high-order comparison relationship be-tween models in a local manner, and works well in bothcross-task and same-task distillation problems.

We start by introducing the related literature and the pre-liminary in Section 2 and Section 3. Then we formalize ourREFILLED approach in Section 4. After detailed discus-sions, finally are experiments and conclusion.

2. Related WorkIt is an effective way to take advantage of the learning

experience from related pre-trained models to facilitate themodel training in the current task [70]. Different from fine-tuning [16] or parameter regularization [28, 11, 30, 48, 62],knowledge distillation/reuse extracts kinds of dark knowl-edge/privileged information [53, 51, 52] from a fixed strongmodel (a.k.a. “teacher”) and enrich the target model (a.k.a.student) training with more signals. Distilling the knowl-edge from one model to another has been investigated formodel interpretability [71] and compression [6], which iswidely applied between deep neural networks since [20, 44,35] with the help of soft targets. The teacher is usually setas a high-capacity deep neural network or a previous modelgeneration in the current task [5, 13, 60]. Richer supervi-sions like hidden layer activations [43, 67, 9, 25], parameterflows [64], and transformations [29] are explored. Theoreti-cal analyses and empirical studies of knowledge distillationcould be found in [40, 15, 19, 8].

Owing to the strong correspondence between classi-fier and categories, it is difficult to reuse the classifica-tion knowledge from a cross-task teacher. Heterogeneoustransfer learning or multi-task learning train a joint model

on current and related domains/tasks to fill the gap oflabel/distribution divergence [27]. Heterogeneous modelreuse takes advantage of the model from a related task,which relieves the burden of data storage so as to decreasethe risk of privacy leaking [62, 59]. Meta-learning has alsobeen utilized to transfer knowledge across different labelspaces, e.g., the few-shot learning [54, 46, 12, 41], but itrequires a special training strategy of the teacher.

Different from matching the instance-label predictionsbetween models, embedding [7, 2], pairwise distance [39,49], and similarity graph [31] have been investigated to im-prove the quality of the feature towards discriminative em-beddings, so that the “downstream” cross-task clusteringand representation learning tasks could be improved [22,38, 65]. The proposed REFILLED approach is general forboth same-task and cross-task distillation, where the classi-fication ability of the teacher is transferred to the student bymatching the high-order local comparisons.

Embedding learning improves the feature representationby pulling similar instances together and pushing dissimilarones away [57, 45, 33, 63]. Kinds of side-information suchas pairs [10] and triplets [57] are collected as weak super-vision in terms of the instance-wise relationship. Stochasticembeddings [32, 50, 3] learn hidden representation to ex-plain the provided relationships, and in REFILLED, the rela-tive instance comparisons measured by a cross-task teachermodel is embedded by the student. A local version of thenearest center mean classifier [34, 46] is leveraged to distillthe classification ability once with good features.

3. Knowledge Reuse via DistillationIn this section, we first introduce the way to distill

knowledge from a high-capacity teacher classifier with softlabels and then describe the cross-task distillation problem.

3.1. Background and Notations

For a C-class classification task, we denote the train-ing data with N examples as D = {(xi,yi)}Ni=1, wherexi ∈ RD and yi ∈ {0, 1}C are instance and one-hot label,respectively. Index of 1 in yi indicates the class of xi. Thetarget is to learn a classifier f(x) : RD 7→ {0, 1}C (e.g., adeep neural network) based on D, which maps an instanceto its label. f could be decomposed into a feature extractorφ : RD 7→ Rd and a linear classifier W ∈ Rd×C , such thatf(x) =W>φ(x).1 The objective to learn the model f is

minf

N∑i=1

` (f(xi), yi) (1)

` is the loss such as the cross-entropy, which measures thediscrepancy between the prediction and the true label.

1We omit the bias term for discussion simplicity.

3.2. Distill Knowledge from a Same-Task Teacher

To improve the training efficacy of f , [20] suggests todistill “dark knowledge” from another pre-trained teachermodel via aligning the soft targets:

minfS

N∑i=1

`(fS(xi),yi) + λR(sτ (fT (xi)), sτ (fS(xi)))

(2)Subscripts “T” and “S” denote the model/parameters of theteacher and student (the current task model), respectively.λ > 0 is a trade-off parameter. sτ transforms the logit intoa softened C-way probability:

sτ (f(xi)) = softmax(f(xi)

τ) (3)

τ is an non-negative temperature, the larger the value ofτ , the smoother the output. R(·) measures the differencebetween two distributions, e.g., the Kullback-Leibler diver-gence. In Eq. 2, the student not only minimizes the mappingf from an instance to its label overD, but also keeps its pre-dictions consistent with the teacher. Note that the studentand the teacher could use different temperatures.

Since the teacher model usually possesses larger capac-ity [20, 7, 35] or better parameters [13, 60], its predictionsencode the relationship between an instance and its can-didate classes. Other forms of dark knowledge along thethread of instance-label mapping are also investigated, suchas hidden activation [43] and parameter flows [64].

3.3. Distill Knowledge from a Cross-Task Teacher

The knowledge reuse in Eq. 2 requires the teacher net-work to target the same labels as the student model so thattheir classification results on the same instance could bematched. While in a general scenario, it is necessary to bor-row the learning experience from a cross-task teacher, i.e., apre-trained teacher fT on non-overlapping classes with thestudent fS . The relaxing of the learning condition enablesknowledge reuse across related tasks.

4. REFILLED for Cross-Task Distillation

We introduce the main idea of RElationship FacIlitatedLocal cLassifiEr Distillation (REFILLED) approach, fol-lowed by analysis and discussions of its two stages.

4.1. Main Ideas of REFILLED

Towards reusing the knowledge from a cross-taskteacher, REFILLED decomposes the model into two com-ponents, i.e., the embedding and the top-layer classifier,such that the knowledge for each component could be dis-tilled respectively. There are two stages in REFILLED. The

discriminative ability of features is distilled through align-ing the high-order instance-wise comparisons of the stu-dent with the teacher, which bridges the gap between non-overlapping label spaces. After that, the teacher’s classifica-tion confidences based on local embedding centers furtherfacilitates the classifier training of the student.

4.2. Distill the Embedding

Empirical studies verify the embedding extracted by thepenultimate layer of a deep neural network possesses dis-criminative property [58, 18, 1], where similar instances areclose and dissimilar ones are far away. Since instance em-bedding reveals whether two objects are similar or not, anddoes not rely on the specific label of each class, thus it couldbe used across different label spaces [57, 4, 45, 47, 33, 22].

Direct Embedding Distillation. One intuitive way tomatch the instance-wise relationship between teacher andstudent is to align their embeddings directly, e.g., minimiz-ing the loss ‖φS(x)−φT (x)‖22 over all instances in the cur-rent task [7, 14, 25]. This constraint requires both modelsto have the same size of embeddings, which is too strong tosatisfy especially there exists an architecture gap betweentwo models. [31, 38, 39, 49] reuse the embedding-basedpairwise relationship of the teacher, where the pairwise sim-ilarity measured by the student’s embedding should have thesame value as the teacher’s measure. It still suffers the ar-chitecture difference —- even the student has the right sim-ilarity relationship, it could still be wrongly rectified by theteacher due to their scale differences. Therefore, consider-ing the discrepancy between the embedding spaces, in RE-FILLED, we ask the teacher to provide its estimation aboutrelative comparisons among instances in the form of tripletsand require the student to align such relative similarity de-termination to obtain discriminative embeddings.

Align Triplet. A triplet (xi, xj , xk) contains an anchorxi, its similar target neighbor xj , and its dissimilar impostorxk.2 The distance between (xi,xj) based on the embeddingφ is Distφ(xi,xj) = ||φ(xi) − φ(xj)||2. A good embed-ding makes Distφ(xi,xj) smaller than Distφ(xi,xk). Weuse the stochastic triplet probability [50] as a kind of “darkknowledge”, which encodes how much the anchor is closeto its target neighbor than its impostor:

pijk(φ) = (4)

exp (−Distφ (xi,xj) /τ)

exp (−Distφ (xi,xj) /τ) + exp (−Distφ (xi,xk) /τ)

Eq. 4 measures the relative instance-wise similarities in atriplet form. If the target neighbor xj is close to the an-chor while the impostor is far away, pijk is large, otherwise

2Usually, we think two instances are similar if they come from the sameclass, and they are dissimilar if they have different labels.

Distill the Embedding

Distill the Classifier

Teacher’s Embedding Student’s Embedding

AligningTriplets

Updated Student’s Embedding

Teacher’s Estimation Student’s Prediction

AligningLocal

Classifier

Updated Student’s Prediction

Figure 2. Illustration of the proposed RElationship FacIlitated Local cLassifiEr Distillation (REFILLED), which has two stages: it firstdistills the discriminative embedding by aligning triplets, e.g. the relative similarities between two impostors (denoted by the red and bluearrow) are specified by the teacher; REFILLED then distills the classification ability via local embedding-based classifiers. With the classprototype (denoted by stars), the teacher provides a good estimation for the classification confidence. More details can be found in the text.

the probability is small. Different from the vanilla tripletsgenerated from labels with only the “similar or not” bi-nary information [57, 45, 47, 33], we take advantage of thestochastic triplet probability to introduce richer similaritycomparison information towards more effective embeddinglearning. With a bit abuse of the notation, we also use thetemperature τ to soften the probability in Eq. 4.

In REFILLED, we improve the discriminative abilityof the student model embedding φS by distilling thetriplet comparison knowledge from the teacher. Define theBernoulli distribution Pijk(φ) = [pijk(φ), 1− pijk(φ)], weminimize the KL-divergence over all generated triplets:

minφS

∑ijk

KL(Pijk(φT ) ‖ Pijk(φS)

)(5)

By aligning the novel kind of dark knowledge in Eq. 5,the student is expected to have better comparison ability asstrong as the teacher. There are two main advantages of thetriplet matching. With the help of the teacher, Eq. 5 not onlyencodes the high-order relationship between instances butalso specifies the differences between the generated triplets.For example, although three images of “black tern” are sim-ilar to one “red-winged black bird” image, the two flyingblack terns should be more close than the one black ternsdrinking the water. Besides, aligning the triplet compar-isons between different models gets rid of the scale and em-bedding size differences between neural architectures.

It still remains one key component of collecting thetriplets for relationship distillation. In our implementa-tion, we generate “semi-hard” triplets [45] based on thestudent’s embedding (the triplets with relatively smallerDistφS

(xi,xk) than DistφS(xi,xj)). Thus, if the student

finds some triplets hard to evaluate, it will query the teacher

for concrete measures of the similarity proportions. We do`2-normalization on all the embeddings before computingtheir distances, and only apply the temperature inPijk(φT ).

Discussions. Define ρijk = 1 − pijk(φT ) and ι(x) =ln(1 + exp(−x)) as the logistic loss, we can rethink theobjective in Eq. 5 by reformulating

KL (Pijk(φT ) ‖ Pijk(φS)) (6)∼= ρijk (DistφS

(xi,xk)−DistφS(xi,xj))

+ ι (DistφS(xi,xk)−DistφS

(xi,xj))

∼= neglects the constants. In addition to optimizing the em-bedding triplets with the loss ι, Eq. 6 adds different weightswhen minimizing (resp. maximizing) the distance betweensimilar (resp. dissimilar) pairs based on the teacher’s esti-mation. For example, if (xi,xj) are not too similar com-pared with (xi,xk), the teacher will specify a relative lowerprobability pijk to compensate for the over-emphasizing ofsimilarity/dissimilarity in the triplet, and the minimizationof DistφS

(xi,xj) in ι is weakened with weight ρijk .

4.3. Distill the Local Classifier

The student’s embedding depicts the relationship be-tween instances as well as the teacher by aligning the tripletprobabilities, which facilitates the “downstream” task. Con-sidering the transition between label space, REFILLED fur-ther proposes to distill the classification ability from theteacher via an embedding-based local classifier.

Embedding-Based Local Classifier. During the stochas-tic optimization of the student model, with a little abuse ofnotations, denote (X ∈ RN×D, Y ∈ {0, 1}N×C) as the

instances and one-hot labels in the sampled mini-batch, re-spectively. Note that the batch may not cover all the classesin the data. With the teacher’s embeddings φT (X) ∈ RN×don X , we compute the embedding center of each class by

P = diag(1� (Y >1))Y >φT (X) ∈ RC×d (7)

� denotes the element-wise division. Each row pc ∈ Rdof P corresponds to the center of the c-th class in the batch.The label of an instance in the batch can be determined by:

pφT(yi | xi) = softmax

(−‖φT (xi)− pc‖22/τ

)(8)

which is normalized over the set of negative distances fromthe instance embedding φ(xi) to all class centers in P .pφT

(yi | xi) is large if φT (xi) is close to pc in the teacher’sembedding space. τ is the temperature. Eq. 8 works in thesame manner as a local nearest center mean classifier [34],where only the classes in the current sampled batch aretaken into account. It can be applied even to the classifi-cation tasks across non-overlapping label spaces [46, 61].

Local Knowledge Distillation. Equipped with Eq. 8, theclassification ability of a cross-task teacher could be furtherreused for training the student’s classifier. Therefore, weincorporate a local knowledge distillation term with Eq. 1:

minfS

N∑i=1

` (fS(xi),yi)+λKL(pφT

(yi | xi), sτ (fS(xi)))

(9)Benefited from the local classifier induced from theteacher’s embedding, the classifier of the student could befurther supervised by a cross-task teacher. In the secondterm of Eq. 9, rather than aligning two model’s confidencesof all classes in the data set, only the posteriors of classesin the sampled mini-batch are matched. This local knowl-edge helps when distilling from a same-class teacher aswell (refer to Section 5.2), where two models match predic-tions over the sampled classes in the mini-batch. In the im-plementation, we also investigate an exponential-decayedweight to set λ, so that the student relies on the teacher’ssupervision during its initial learning period while weaken-ing the teacher’s guide if itself is strong enough.

Discussions. By decoupling the embedding φ and the lin-ear classifierW , the effectiveness of the knowledge distilla-tion could be analyzed by its gradient over the classifier wc

of the c-th class (denote the objective of Eq. 2 as O):

∂O

∂wc=∑x

[−pc +

C∑c′=1

pc′qc

]φ(x) (10)

qc and pc are the teacher’s and student’s posterior probabili-ties of the c-th class given instance x, respectively. Different

Algorithm 1 The Flow of REFILLED.Require: Pre-trained Teacher’s Embedding φT .

Distill the Embedding:for all Iter = 1,...,MaxIter do

Sample a mini-batch {(xi,yi)}.Generate triplets {(xi,xj ,xk)} with student’s embed-dings {φS(x)}.Compute probability of triplets pijk(φT ) as Eq. 4.Optimizing φS by aligning triplets in Eq. 5.

end forDistill the Classifier:Initialize fS with φS .Optimizing fS with Eq. 9.

from the vanilla loss, when considering the soft supervisionfrom the teacher, not only the instance from the target classbut also those from helpful related classes (the ones withlarge pc′ ) will be incorporated to direct the update of theclassifier. Since the summation in Eq. 10 is computed overall C classes, the normalized class posterior qc becomessmall if C is large, so that the helpful class instance willnot be stressed obviously. Therefore, we consider a localversion of the knowledge distillation term in Eq. 9, whereonly the classes in the current mini-batch are considered,i.e., the influence of a helpful related class selected by theteacher will be better emphasized in the update of wc.

The Two-Stage REFILLED Approach. In summary,there are two steps in REFILLED to reuse the holistic knowl-edge of the teacher through its embedding, so that to im-prove the discerning ability of the student’s embedding andclassifier, respectively. The whole flow of REFILLED forcross-task distillation is illustrated in Figure 2 and Alg. 1.

5. ExperimentsWe verify REFILLED on a variety of tasks, namely clas-

sification by reusing cross-task models, standard knowledgedistillation, and middle-shot learning.

5.1. Cross-Task Knowledge Distillation

REFILLED is able to reuse a cross-task teacher to assistthe training of a student model.

Datasets. Caltech-UCSD Birds-200-2011 (CUB) [55]constructs a fine-grained classification problem over 200different species of birds. We use the first 100 classes totrain the teacher, and learn the student model on the remain-ing 100 classes.

Implementation Details. We use different configurationsof the MobileNets [21] and adjust the model complexity

Table 1. The mean accuracy of cross-task distillation on CUB dataset, where teacher and student are trained for non-overlapping 100classes with MobileNets. The three values in the “teacher” rowcorrespond to baselines: applying 1NN based on teacher’s embed-ding, train a linear LR classifier based on fixed teacher’s embed-ding, and Fine-Tune (FT) based on teacher’s embedding.

Width Multiplier 1 0.75 0.5 0.25

Teacher 1NN: 45.31 , LR: 53.82 , FT: 65.72

Student 71.25 67.56 66.85 64.48RKD [38] 70.83 68.80 67.44 63.97

REFILLED 73.38 70.42 69.77 67.10

with different channels (complicated models have largerchannels). The teacher is trained with cross-entropy lossand width multiplier 1.0. We change the width multiplier ofthe student in {1, 0.75, 0.5, 0.25}.

Evaluations. For each 100-way classification task, wesplit 70% of data in each class for training, and the remain-ing is used for test. The teacher model is first trained onthe first 100 classes till convergence and then used to di-rect the training of the student model upon non-overlappingclasses. The averaged classification accuracy over 3 trialsis reported. The neural networks are optimized by SGD w/momentum. Detailed configurations are in the supp.

Results. The results of cross-task distillation are in Ta-ble 1. We first investigate three baselines by adapting theteacher for cross-task classification, i.e., the 1NN basedon teacher’s embedding, training a linear Logistic Regres-sion (LR) upon the fixed teacher’s embedding, and Fine-Tuning (FT) the teacher model initialized by the pre-trainedembedding. The test accuracy of the student becomeshigher when learning the task with more complicated mod-els (w/ larger width multiplier value). We also comparewith one representative embedding-based approach Rela-tion Knowledge Distillation (RKD) [38], and fine-tune themodel after obtaining the distilled embedding from thecross-task teacher. RKD sometimes gets better accuracythan the vanilla student model. Our REFILLED achieves thebest classification performance in all cases. Benefited fromreusing the knowledge from the teacher, the classificationachieves a further improvement w.r.t. the vanilla training.

Will All Components in REFILLED Help? Given pre-trained weights of the teacher and fixing the width multi-ple of the student equals 1, we investigate three fine-tuningvariants in Figure 3 besides training the student model di-rectly (Vanilla) , namely, fine-tuning with the distilled em-bedding after the first stage of REFILLED (REFILLED1st),fine-tuning with Eq. 9 using fixed λ (REFILLED−), and RE-

Figure 3. Left: The mean accuracy of different variants of RE-FILLED on CUB for cross-task distillation; Right: The change ofaccuracy when the number of instances per class (shot) varies.

FILLED (which has exponential-decayed λ). The step-wiseimprovements of the classification results verify the effec-tiveness of each component in REFILLED.

REFILLED with Different Size of Target Task Data. Totest the extreme of the knowledge distillation ability of RE-FILLED, we construct the target classification task with dif-ferent sizes of training data. When the number of effectivetraining data is small, it is more difficult to train the studentmodel, so that the help from the teacher becomes more im-portant. We vary the number of instances per class in thestudent’s task from 5 to 30, and the averaged classificationaccuracies are shown in Figure 3. REFILLED keeps a per-formance margin with comparison methods in all cases.

5.2. Standard Knowledge Distillation

The REFILLED is a general approach that helps the train-ing of a student with a same-class teacher.

Datasets. Following [2], we test the knowledge distilla-tion ability of REFILLED on another benchmark CIFAR-100 [26] besides CUB. CIFAR-100 contains 100 classeswith 6000 small images per class. In each class, there are5,000 images for training and 1,000 images for test. We usethe standard split to train both teacher and student models.We also evaluate REFILLED on CUB, where all 200 classesare used during training based on the standard split.

Implementation Details. We test the effectiveness ofREFILLED across diverse architectures, i.e., ResNet [17],Wide-ResNet [66], and MobileNets [21]. Towards inves-tigating different capacities of the teacher and student, wechange the depth of ResNet (through the number of layers),the width and depth of Wide-ResNet, and the width multi-plier of MobileNets. Both teacher and student are trainedon the same training set till convergence.

Evaluations. Both teacher and student are trained on thesame set with three different seeds of initialization, and wereport the mean accuracy of the student on the test set.

Table 2. The average classification results of knowledge distilla-tion methods on CIFAR-100 data set based on the Wide-ResNet.We fix the teacher with (depth, width) = (40, 2), and set the stu-dent capacity with different (depth, width) values.

(depth, width) (40, 2) (16, 2) (40, 1) (16, 1)

Teacher 74.44Student 74.44 70.15 68.97 65.44

KD [20] 75.47 71.87 70.46 66.54FitNet [43] 74.29 70.89 68.66 65.38

AT [67] 74.76 71.06 69.85 65.31NST [23] 74.81 71.19 68.00 64.95VID-I [2] 75.25 73.31 71.51 66.32

KD+VID-I [2] 76.11 73.69 72.16 67.19RKD [38] 76.62 72.56 72.18 65.22

REFILLED 77.49 74.01 72.72 67.56

Table 3. The average classification results of knowledge distilla-tion methods on CUB based on MobileNets. We fix the teacher’swidth multiplier to 1.0, and change the student’s multipliers.

Width Multiplier 1 0.75 0.5 0.25

Teacher 75.36Student 75.36 74.87 72.41 69.72

KD [20] 77.61 76.02 74.24 72.03FitNet [43] 75.10 75.03 72.17 69.09

AT [67] 76.22 76.10 73.70 70.74NST [23] 76.91 77.05 74.03 71.54

KD+VID-I [2] 77.03 76.91 75.62 72.23RKD [38] 77.72 76.80 74.99 72.55

REFILLED 78.95 78.01 76.11 73.42

Distillation From Same Architecture Family Models.We first test the case when teacher and student come fromthe same model family. The results on CIFAR-100 andCUB could be found in Table 2 and Table 3, respectively.On CIFAR-100 we exactly follow the evaluation proto-col in [2], which implements teacher and student with theWide-ResNet. We re-implement RKD [38] and cite theresults of other comparison methods from [2]. For CUB,we use MobileNets as the basic model. Since the teacherpossesses more capacity, its learning experience assists thetraining of the student once utilizing the knowledge distil-lation methods. REFILLED achieves the best classificationperformance in all settings, which validates transferring theknowledge for both embedding and classifier is one of thekey factors for model reuse.

Will Embedding Help for Knowledge Distillation? Weuse the Normalized Mutual Information (NMI) as a crite-rion to measure the embedding quality, the larger the better.In Table 4, we compute NMI for student model’s embed-

Table 4. The NMI on CIFAR-100 to evaluate the embedding qual-ity before and after the Triplet Aligning (TA) step in REFILLED.

(depth, width) (40, 2) (16, 2) (40, 1) (16, 1)

w/o TA 56.50 54.91 54.02 51.77w/ TA 59.63 57.98 57.62 54.39

Table 5. The mean accuracy on CIFAR-100 to evaluate the effec-tiveness of Local Knowledge Distillation (LKD) in REFILLED.

(depth, width) (40, 2) (16, 2) (40, 1) (16, 1)

w/ KD 77.08 73.57 72.24 67.14w/ Local KD 77.49 74.01 72.72 67.56

NMI=51.77 NMI=54.39

Figure 4. The tSNE [50] of the vanilla student training (left) andthe improved embedding after the 1st stage of REFILLED (right)over 10 classes sampled from CIFAR-100.

ding trained with and without aligning the teacher’s tripletsin CIFAR-100. Figure 4 visualizes the embedding qualityover 10 sampled classes using tSNE [50]. Both quantitativeand qualitative results verify the effectiveness of the tripletaligning step in REFILLED for knowledge distillation.

Will Local Knowledge Distillation Help? Results in Ta-ble 5 verify the further improvement of Local KnowledgeDistillation (LKD) in Eq. 9 compared with the vanillaKnowledge Distillation (KD) when training based on thedistilled embedding after the first stage of REFILLED. Alocal consideration of probability matching helps.

Distillation From Different Model Families. To furtherevaluate the performance of REFILLED, we use REFILLEDto distill the knowledge from a cross-family teacher. ForCIFAR-100, we set the teacher as ResNet-110, and use theMobileNets with different channels as the student model.Table 6 demonstrates the results, and REFILLED keeps itssuperiority in this case. More results are in the supp.

5.3. Middle-Shot Learning

Training a deep neural network with middle-shot of datais a difficult task, where models are prone to over-fit. In thissubsection, we apply our REFILLED approach for middle-shot learning, where the classification ability from a teachertrained on SEEN class can be used to help the student modeltraining for UNSEEN middle-shot tasks.

Table 6. The average classification accuracy of knowledge distil-lation methods on CIFAR-100 data set. The teacher is trainedwith ResNet-110, which gets 74.09% test accuracy. The studentis learned with MobileNets, whose width multiplier is changed.

Width Multiplier 1 0.75 0.5 0.25

Student 68.57 67.92 65.66 60.87

KD [20] 70.34 68.21 66.06 61.38FitNet [43] 67.99 67.85 65.12 61.01

AT [67] 68.97 67.88 66.44 62.15NST [23] 70.62 70.49 69.15 61.32

KD+VID-I [2] 71.94 70.13 68.51 62.50RKD [38] 70.41 68.93 66.24 61.44

REFILLED 73.81 72.88 70.02 63.15

Datasets. We use the popular MiniImageNet data set [54],which contains 100 classes and 600 images in each class.Following [54, 42], there are 64 classes (SEEN class) to trainthe teacher, 16 classes for validation, and we sample tasksfrom the remaining 20 classes to train the student.

Implementation Details. Following the literature, we in-vestigate two different backbones, a 4-layer ConvNet [54,46, 12] and the ResNet [37, 61], which outputs embed-dings with 64 and 640 dimensions, respectively. We traina teacher model on the SEEN classes with ResNet or Con-vNet, and use the teacher model to help the training of thestudent classifier on tasks composed by UNSEEN classes.

Evaluations. Define a K-shot C-way task as a C-classclassification problem with K instances per class. Differentfrom the few-shot learning setting where K ∈ {1, 5}, herewe consider there are a bit more instances in each class, i.e.,K = {10, 30}. Note that even K = 30 is not enough totrain a complicated neural network from scratch. We sam-ple 5-way tasks from the 20-class split to train the studentmodel and evaluate its performance by classifying another15 instances from each of the 5 sampled classes. We evalu-ate the final performance by mean accuracy over 600 trials.More results of few-shot learning are in supp.

Comparison Methods. Meta-learning is a popular way tosolve the few-shot classification problem. To mimic the testcase, it samples C-Way K-Shot tasks from the SEEN classset to learn task-level inductive bias like embedding [54, 46]or initialization [12, 41]. However, the computational bur-den (e.g., the batch size) becomes large when the numberof shots increases. Besides, meta-learning needs to spec-ify the way to obtain a meta-model from the SEEN classes.We compare our methods with the embedding-based meta-learning approaches like and ProtoNet [46] and FEAT [61].We can make predictions directly with the teacher’s embed-

Table 7. The mean accuracy over 600 trials of middle-shot tasks.We set the student model as the ConvNet, and investigate bothResNet and ConvNet as the teacher model, for our REFILLED

approach. Detailed results and configurations are in the supp.REFILLED1 denotes the result reusing a ResNet teacher and RE-FILLED2 stands for the result reusing a ConvNet teacher.

Tasks 10-Shot 5-Way 30-Shot 5-Way

1NN 66.56 69.80SVM 74.24 77.87

Fine-Tune 74.95 78.62

ProtoNet [46] 74.42 78.10FEAT [61] 74.86 78.84

REFILLED1 76.42 80.33REFILLED2 75.37 78.94

ding, the penultimate layer of the teacher, by leveraging thenearest neighbor (1NN). Based on the teacher’s embedding,we also train linear classifiers like SVM or fine-tune thewhole model upon the middle-shot training data of sampletasks. We tune the hyper-parameters of such methods withsampled middle-shot tasks from the validation split.

Results. The results of middle-shot learning are shownin Table 7. When the number of shots becomes large,fine-tuning is a very strong baseline, which gets better re-sults than some meta-learning approaches. Our REFILLEDmethod achieves better results than fine-tune, which vali-dates the importance of reusing the knowledge of a cross-task teacher for training a classifier.

6. ConclusionAlthough knowledge distillation facilitates the transi-

tion of learning experience between heterogeneous models,i.e., neural networks with different architectures, it is stillchallenging to reuse models across non-overlapping labelspaces. In this paper, we focus on matching the comparisonability on account of embeddings, which not only gets ridof the label space constraint but also captures the high orderrelationships among instances. The proposed RElationshipFacIlitated Local cLassifiEr Distillation (REFILLED) ap-proach has two stages, namely embedding aligning and lo-cal knowledge distillation. Besides improving the learn-ing efficiency by reusing cross-task models, REFILLEDalso achieves better classification performance in standardknowledge distillation tasks.

AcknowledgmentsThis work is partially supported by The National

Key R&D Program of China (2018YFB1004300), NSFC(61773198, 61773198, 61632004), and NSFC-NRF jointresearch project (61861146001).

References[1] Alessandro Achille and Stefano Soatto. Emergence of invari-

ance and disentanglement in deep representations. Journal ofMachine Learning Research, 19(50):1–34, 2018. 1, 3

[2] Sungsoo Ahn, Shell Xu Hu, Andreas C. Damianou, Neil D.Lawrence, and Zhenwen Dai. Variational information distil-lation for knowledge transfer. In CVPR, pages 9163–9171,2019. 1, 2, 6, 7, 8

[3] Ehsan Amid and Antti Ukkonen. Multiview triplet embed-ding: Learning attributes in multiple maps. In ICML, pages1472–1480, 2015. 2

[4] Ehsan Amid and Antti Ukkonen. Multiview triplet embed-ding: Learning attributes in multiple maps. In ICML, pages1472–1480, 2015. 3

[5] Hessam Bagherinezhad, Maxwell Horton, MohammadRastegari, and Ali Farhadi. Label refinery: Improvingimagenet classification through label progression. CoRR,abs/1805.02641, 2018. 2

[6] Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, pages 535–541, 2006.1, 2

[7] Hanting Chen, Yunhe Wang, Chang Xu, Chao Xu, andDacheng Tao. Learning student networks via feature em-bedding. CoRR, abs/1812.06597, 2018. 2, 3

[8] Jang Hyun Cho and Bharath Hariharan. On the efficacy ofknowledge distillation. In ICCV, pages 4794–4802, 2019. 1,2

[9] Wojciech M. Czarnecki, Simon Osindero, Max Jaderberg,Grzegorz Swirszcz, and Razvan Pascanu. Sobolev trainingfor neural networks. In NeurIPS, pages 4281–4290. 2017. 2

[10] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, andInderjit S. Dhillon. Information-theoretic metric learning. InICML, pages 209–216, 2007. 2

[11] Simon S. Du, Jayanth Koushik, Aarti Singh, and BarnabasPoczos. Hypothesis transfer learning via transformationfunctions. In NeurIPS, pages 574–584. 2017. 2

[12] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks.In ICML, pages 1126–1135, 2017. 2, 8

[13] Tommaso Furlanello, Zachary Chase Lipton, MichaelTschannen, Laurent Itti, and Anima Anandkumar. Born-again neural networks. In ICML, pages 1602–1611, 2018.1, 2, 3

[14] Mengya Gao, Yujun Shen, Quanquan Li, Chen ChangeLoy, and Xiaoou Tang. Feature matters: A stage-by-stageapproach for knowledge transfer. CoRR, abs/1812.01819,2018. 3

[15] Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong,and Richard Socher. A closer look at deep learning heuris-tics: Learning rate restarts, warmup and distillation. In ICLR,2019. 2

[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Delving deep into rectifiers: Surpassing human-level perfor-mance on imagenet classification. In ICCV, pages 1026–1034, 2015. 2

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In CVPR,pages 770–778, 2016. 6

[18] Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and XiangBai. Triplet-center loss for multi-view 3d object retrieval. InCVPR, pages 1945–1954, 2018. 3

[19] Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, No-jun Kwak, and Jin Young Choi. A comprehensive overhaulof feature distillation. In ICCV, pages 1921–1930, 2019. 2

[20] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean.Distilling the knowledge in a neural network. CoRR,abs/1503.02531, 2015. 1, 2, 3, 7, 8

[21] Andrew G. Howard, Menglong Zhu, Bo Chen, DmitryKalenichenko, Weijun Wang, Tobias Weyand, Marco An-dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-tional neural networks for mobile vision applications. CoRR,abs/1704.04861, 2017. 5, 6

[22] Yen-Chang Hsu, Zhaoyang Lv, and Zsolt Kira. Learningto cluster in order to transfer across domains and tasks. InICLR, 2018. 1, 2, 3

[23] Zehao Huang and Naiyan Wang. Like what you like:Knowledge distill via neuron selectivity transfer. CoRR,abs/1707.01219, 2017. 7, 8

[24] Khurram Javed and Faisal Shafait. Revisiting distillation andincremental classifier learning. In ACCV, pages 3–17, 2018.1

[25] Animesh Koratana, Daniel Kang, Peter Bailis, and Matei Za-haria. LIT: learned intermediate representation training formodel compression. In ICML, pages 3509–3518, 2019. 2, 3

[26] Alex Krizhevsky and Geoffrey Hinton. Learning multiplelayers of features from tiny images. Technical report, Uni-versity of Toronto, 2009. 6

[27] Jogendra Nath Kundu, Nishank Lakkakula, andR. Venkatesh Babu. Um-adapt: Unsupervised multi-task adaptation using adversarial cross-task distillation. InICCV, pages 1436–1445, 2019. 1, 2

[28] Ilja Kuzborskij and Francesco Orabona. Fast rates bytransferring from auxiliary hypotheses. Machine Learning,106(2):171–195, 2017. 2

[29] Seung Hyun Lee, Dae Ha Kim, and Byung Cheol Song. Self-supervised knowledge distillation using singular value de-composition. In ECCV, pages 339–354, 2018. 2

[30] Xuhong Li, Yves Grandvalet, and Franck Davoine. Explicitinductive bias for transfer learning with convolutional net-works. In ICML, pages 2830–2839, 2018. 2

[31] Yufan Liu, Jiajiong Cao, Bing Li, Chunfeng Yuan, WeimingHu, Yangxi Li, and Yunqiang Duan. Knowledge distillationvia instance relationship graph. In CVPR, pages 7096–7104,2019. 2, 3

[32] Laurens van der Maaten and Geoffrey Hinton. Visualizingdata using t-sne. JMLR, 9(Nov):2579–2605, 2008. 2

[33] R. Manmatha, Chao-Yuan Wu, Alexander J. Smola, andPhilipp Krahenbuhl. Sampling matters in deep embeddinglearning. In CVPR, pages 2859–2867, 2017. 1, 2, 3, 4

[34] Thomas Mensink, Jakob J. Verbeek, Florent Perronnin,and Gabriela Csurka. Distance-based image classification:Generalizing to new classes at near-zero cost. TPAMI,35(11):2624–2637, 2013. 2, 5

[35] Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, andHassan Ghasemzadeh. Improved knowledge distillation viateacher assistant: Bridging the gap between student andteacher. CoRR, abs/1902.03393, 2019. 1, 2, 3

[36] Gaurav Kumar Nayak, Konda Reddy Mopuri, Vaisakh Shaj,Venkatesh Babu Radhakrishnan, and Anirban Chakraborty.Zero-shot knowledge distillation in deep networks. In ICML,2019. 1

[37] Boris N. Oreshkin, Pau Rodrıguez Lopez, and Alexandre La-coste. TADAM: task dependent adaptive metric for improvedfew-shot learning. In NeurIPS, pages 719–729. 2018. 8

[38] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Re-lational knowledge distillation. In CVPR, pages 3967–3976,2019. 2, 3, 6, 7, 8

[39] Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, YichaoWu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correla-tion congruence for knowledge distillation. In ICCV, pages5007–5016, 2019. 2, 3

[40] Mary Phuong and Christoph Lampert. Towards understand-ing knowledge distillation. In ICML, pages 5142–5151,2019. 2

[41] Hang Qi, Matthew Brown, and David G. Lowe. Low-shotlearning with imprinted weights. In CVPR, pages 5822–5830, 2018. 2, 8

[42] Sachin Ravi and Hugo Larochelle. Optimization as a modelfor few-shot learning. In ICLR, 2017. 8

[43] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou,Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets:Hints for thin deep nets. In ICLR, 2015. 1, 2, 3, 7, 8

[44] Bharat Bhusan Sau and Vineeth N. Balasubramanian. Deepmodel compression: Distilling knowledge from noisy teach-ers. CoRR, abs/1610.09650, 2016. 2

[45] Florian Schroff, Dmitry Kalenichenko, and James Philbin.Facenet: A unified embedding for face recognition and clus-tering. In CVPR, pages 815–823, 2015. 2, 3, 4

[46] Jake Snell, Kevin Swersky, and Richard S. Zemel. Proto-typical networks for few-shot learning. In NeurIPS, pages4080–4090. 2017. 2, 5, 8

[47] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and SilvioSavarese. Deep metric learning via lifted structured featureembedding. In CVPR, pages 4004–4012, 2016. 1, 3, 4

[48] Suraj Srinivas and Francois Fleuret. Knowledge transfer withjacobian matching. In ICML, pages 4730–4738, 2018. 2

[49] Frederick Tung and Greg Mori. Similarity-preserving knowl-edge distillation. In ICCV, pages 1365–1374, 2019. 2, 3

[50] Laurens van der Maaten and Kilian Q. Weinberger. Stochas-tic triplet embedding. In MLSP, pages 1–6, 2012. 2, 3, 7

[51] Vladimir Vapnik and Rauf Izmailov. Learning using privi-leged information: similarity control and knowledge transfer.JMLR, 16:2023–2049, 2015. 2

[52] Vladimir Vapnik and Rauf Izmailov. Learning with intelli-gent teacher. In COPA, pages 3–19, 2016. 2

[53] Vladimir Vapnik and Akshay Vashist. A new learningparadigm: Learning using privileged information. NeuralNetworks, 22(5-6):544–557, 2009. 2

[54] Oriol Vinyals, Charles Blundell, Tim Lillicrap, KorayKavukcuoglu, and Daan Wierstra. Matching networks forone shot learning. In NeurIPS, pages 3630–3638. 2016. 2, 8

[55] Catherine Wah, Steve Branson, Peter Welinder, Pietro Per-ona, and Serge Belongie. The caltech-ucsd birds-200-2011dataset. 2011. 5

[56] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, andAlexei A. Efros. Dataset distillation. CoRR, abs/1811.10959,2018. 1

[57] Kilian Q. Weinberger and Lawrence K. Saul. Distance met-ric learning for large margin nearest neighbor classification.JMLR, 10:207–244, 2009. 2, 3, 4

[58] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. Adiscriminative feature learning approach for deep face recog-nition. In ECCV, pages 499–515, 2016. 3

[59] Xi-Zhu Wu, Song Liu, and Zhi-Hua Zhou. Heterogeneousmodel reuse via optimizing multiparty multiclass margin. InICML, pages 6840–6849, 2019. 2

[60] Chenglin Yang, Lingxi Xie, Chi Su, and Alan L. Yuille.Snapshot distillation: Teacher-student optimization in onegeneration. In CVPR, pages 2859–2868, 2019. 1, 2, 3

[61] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha.Learning embedding adaptation for few-shot learning.CoRR, 2018. 5, 8

[62] Han-Jia Ye, De-Chuan Zhan, Yuan Jiang, and Zhi-Hua Zhou.Rectify heterogeneous models with semantic mapping. InICML, pages 1904–1913, 2018. 2

[63] Han-Jia Ye, De-Chuan Zhan, Yuan Jiang, and Zhi-Hua Zhou.What makes objects similar: A unified multi-metric learningapproach. TPAMI, 41(5):1257–1270, 2019. 2

[64] Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. Agift from knowledge distillation: Fast optimization, networkminimization and transfer learning. In CVPR, pages 7130–7138, 2017. 1, 2, 3

[65] Lu Yu, Vacit Oguz Yazici, Xialei Liu, Joost van de Wei-jer, Yongmei Cheng, and Arnau Ramisa. Learning metricsfrom teachers: Compact networks for image embedding. InCVPR, pages 2907–2916, 2019. 1, 2

[66] Sergey Zagoruyko and Nikos Komodakis. Wide residual net-works. In BMVC, 2016. 6

[67] Sergey Zagoruyko and Nikos Komodakis. Paying more at-tention to attention: Improving the performance of convolu-tional neural networks via attention transfer. In ICLR, 2017.2, 7, 8

[68] Ying Zhang, Tao Xiang, Timothy M. Hospedales, andHuchuan Lu. Deep mutual learning. In CVPR, pages 4320–4328, 2018. 1

[69] Peng Zhou, Long Mai, Jianming Zhang, Ning Xu, ZuxuanWu, and Larry S. Davis. M2KD: multi-model and multi-level knowledge distillation for incremental learning. CoRR,abs/1904.01769, 2019. 1

[70] Zhi-Hua Zhou. Learnware: on the future of machine learn-ing. FCS, 10(4):589–590, 2016. 2

[71] Zhi-Hua Zhou and Yuan Jiang. Nec4.5: Neural ensemblebased C4.5. TKDE, 16(6):770–773, 2004. 2

Supplemental Material ofDistilling Cross-Task Knowledge via Relationship Matching

Han-Jia YeNanjing University

[email protected]

Su LuNanjing University

[email protected]

De-Chuan ZhanNanjing University

[email protected]

Abstract

In the supplemental material of the paper “Distill-ing Cross-Task Knowledge via Relationship Matching”,we provide more discussions, analyses, experiments forour proposed RElationship FacIlitated Local cLassifiErDistillation (REFILLED) approach, which reuses the classi-fication knowledge from a cross-task model to facilitate thetraining of the current task classifier. There are three partsin the supplementary: first, we provide concrete deviationsand discussions to explain the two steps in REFILLED; then,we describe the concrete settings for all the experiments indetail; last are additional experimental results.

1. Discussion on Two-Steps in REFILLED

There are two steps in REFILLED to distill the knowl-edge from a cross-task teacher w.r.t. the embedding andthe (top-layer) classifier, respectively. In this section, wediscuss the advantage of each step, and provide concretederivations in the main paper.

1.1. Distill the Embedding

REFILLED distills the discriminative ability of the em-bedding by aligning the stochastic probability over tripletswith the teacher [22]. Given a triplet (xi, xj , xk), the anchorxi is similar to its target neighbor xj and is dissimilar to theimpostor xk. Thus a high-quality embedding pulls similarinstances together and pushes dissimilar ones far away. Thesimilarity between instances are usually measured by theirlabels — we think two instances are similar if they comefrom the same class, and they are dissimilar if they havedifferent labels.

Usually, we apply a loss function ι(·) over the triplets toforce the distance between embeddings to be matched withthe relationship indicated by the triplet. Specifically, we canminimize the embedding φ over the sampled triplets:

minφ

∑ijk

ι (Distφ (xi,xk)−Distφ (xi,xj)) . (1)

The summation over ijk means we sum the loss over thesampled triplets {xi,xj ,xk}. The loss function ι(x) overthe embedding usually acts as an upper bound of the zero-one loss, where the larger the value of the input x, thesmaller of the loss output. By minimizing Eq. 1, the em-bedding makes distance between dissimilar instances largerthan the distance between similar ones. There are sev-eral options for the loss function, such as the hinge loss(ι(x) = max(1 − x, 0)) [25, 19, 11] and logistic loss(ι(x) = log(1 + exp(−x))).

As in [22], we define the stochastic probability of atriplet (xi, xj , xk) as a Bernoulli distribution Pijk, wherethe random variable means whether such a triplet is satis-fied (with probability pijk) or not (with probability 1−pijk)based on the current embedding φ.

pijk(φ) = (2)

exp(−Distφ(xi,xj)/τ)

exp(−Distφ(xi,xj)/τ) + exp(−Distφ(xi,xk)/τ).

In Eq. 3, we match the Bernoulli distribution of tripletsbased on both teacher’s and student’s embedding φT and φSwith KL-divergence, so that the teacher’s embedding super-vises the optimizing of the student’s embedding.

minφS

∑ijk

KL(Pijk(φT ) ‖ Pijk(φS)

). (3)

Sampling Semi-Hard Triplets. All triplets are composedby the current task instances. In other words, when optimiz-ing the student model with stochastic gradient descent, wegenerate triplets for each sampled mini-batch. We make `2-normalization on all the embeddings before computing theirdistances, and only apply the temperature τ in Pijk(φT ).The triplets are sampled following the semi-hard proto-col [19]. In detail, we enumerate all instances in the mini-batch and treat each one as the triplet anchor. For each an-chor, xi we first find all target neighbors xj (with the samelabel) in the given mini-batch. Then for each pair of xiand xj , we set xk as the nearest impostor with a different

label to xi, which meanwhile locates father away from xithan xj . The distances in the previous sampling process aremeasured based on the student’s embedding. Therefore, ifthe student model finds some triplets are hard to evaluate, itwill ask the teacher for help about the concrete measures ofthe similarity proportions.

Interpretation of the Matching. We can rethink the ob-jective in Eq. 3 by the following reformulation:

KL (Pijk(φT ) ‖ Pijk(φS)) (4)

= pijk(φT ) lnpijk(φT )

pijk(φS)+ (1− pijk(φT )) ln

1− pijk(φT )

1− pijk(φS)

= pijk(φT ) ln pijk(φT )︸ ︷︷ ︸constant

−pijk(φT ) ln pijk(φS)

+ (1− pijk(φT )) ln (1− pijk(φT ))︸ ︷︷ ︸constant

− (1− pijk(φT )) ln (1− pijk(φS))∼= − pijk(φT ) ln pijk(φS) − ln (1− pijk(φS))

+ pijk(φT ) ln (1− pijk(φS))∼= − pijk(φT )(−DistφS

(xi,xj)− ln ∆) + ln ∆

+ DistφS(xi,xk) + pijk(φT )(−DistφS

(xi,xk)− ln ∆)

∆ , exp (−DistφS(xi,xj)) + exp (−DistφS

(xi,xk))∼= pijk(φT )DistφS

(xi,xj) + (1− pijk(φT ))DistφS(xi,xk)

+ ln ∆∼= pijk(φT )DistφS

(xi,xj) + (1− pijk(φT ))DistφS(xi,xk)

+ ln (exp (−DistφS(xi,xj)) + exp (−DistφS

(xi,xk)))∼= (pijk(φT )− 1)DistφS

(xi,xj)

+ (1− pijk(φT ))DistφS(xi,xk)

+ ln (1 + exp (−(DistφS(xi,xk)−DistφS

(xi,xj))))∼= ρijk (DistφS

(xi,xk)−DistφS(xi,xj))

+ ι (DistφS(xi,xk)−DistφS

(xi,xj))

The notation ∼= neglects the constant term in the equation.Define ρijk = 1 − pijk(φT ) and ι(x) = ln(1 + exp(−x))as the logistic loss. From Eq. 4, we find by matching thestochastic triplet probability based on φT and φS , the ob-jective first optimizes embedding with a logistic loss, wherethe similar instances indicated by the triplets have small dis-tances while dissimilar ones should have large distances.Furthermore, Eq. 4 rectify the minimizing/maximizing ofdistances by adding different weights upon the distancesbetween similar/dissimilar pairs based on the teacher’s es-timation. For example, if a pair (xi,xj) is similar, thenpijk(φT ) is large and ρijk is small, which applies moreweights on their distances DistφS

(xi,xj) and the rectifi-cation of the logistic loss has minor influence. Otherwise,ρijk is large and the force to minimize DistφS

(xi,xj) willnot be strong since xi and xj are not too similar measured

by the teacher’s embedding. In other words, different fromthe binary label (“similar” or “dissimilar”) indicated by thetriplet, the relative similarities between pairs are specifiedby the teacher’s embedding φT .

Differences with Related Methods. There are several re-cent methods propose to distill the knowledge from the em-bedding perspective [2, 1, 10, 14, 21], and the effectivenessof the distilled embedding is usually verified for the repre-sentation learning tasks or standard knowledge distillationtasks. For example, an implementation choice in [13] con-structs angels over triplets first, and then matches the angelsby regression; Qi et al [15] take advantage of the imprintedweights to initialize the classifier for low-shot learning. Inour REFILLED approach, we emphasize to distill the clas-sification ability from a cross-task teacher model, and inthe embedding distillation stage, we take advantage of thestochastic triplet probability, which is more general as re-vealed in Eq. 4. The superiority of REFILLED is also vali-dated in the experiments.

1.2. Distill the Local Classifier

In the standard knowledge distillation [5], we optimizethe student model together with matching its prediction con-fidences with a (fixed) teacher:

minfS

N∑i=1

`(fS(xi),yi) + λR(sτ (fT (xi)), sτ (fS(xi))) .

(5)Denote the vanilla objective with the cross-entropy loss(i.e., the first part in Eq. 5) asOxent and the whole objectivein Eq. 5 as Okd.

Facilitated by the improved embedding, we propose toutilize a local knowledge distillation term to assist distillthe classification ability from the teacher:

minfS

N∑i=1

`(fS(xi),yi)+λKL (pφT(yi | xi), sτ (fS(xi))) .

(6)It is notable that the pφT

(yi | xi) is the Nearest CenterMean (NCM) confidence of the teacher over the C ′ classesin the current mini-batch.1 fS(xi) = W>φS(xi) is the pre-diction of the student model, while sτ (fS(xi)) is the cor-responding probability normalized with the softmax. Eq. 6is a local version to distill the knowledge since it only con-siders the classes locally in the current sampled mini-batch.We denote the objective with the Local Knowledge Distil-lation (LKD) term as OLKD.

We analyze the effectiveness of the knowledge distilla-tion by its gradient over the top-layer classifier W ∈ Rd×C ,and we omit the bias for simplicity. The c-th column of W

1Recall there are C classes in total in the data set.

corresponds to the classifier wc. In the following, withoutloss of generality, we analyze the gradient of wc over onesingle instance x, whose target label is c. Denote pc and qcas the c-th element in the student’s and teacher’s normalizedprediction sτ (fS(x)) and sτ (fT (x)) (the posterior proba-bility of the c-th class given the student’s and the teacher’sembedding of the instance), respectively.

In vanilla learning scenario, we have the gradient w.r.t.wc as

∂Oxent

∂wc= [−pc(1− pc)]φ(x) . (7)

With an additional knowledge distillation term (refer toEq. 5), the gradient w.r.t. wc changes to

∂Okd

∂wc=

[−pc +

C∑c′=1

pc′qc

]φ(x) . (8)

Comparing Eq. 7 and Eq. 8, when considering the soft su-pervision from the teacher, not only the instance from thetarget class, but also those from related classes (the relat-edness is specified by the student’s prediction pc′ , weightedby the teacher’s prediction qc) will be incorporated to directthe update of the classifier.

The summation in Eq. 8 is computed over allC classes inthe data set, if the number ofC is large, the normalized classposterior probability qc is small, so that the helpful relatedclass instance will not be weighted obviously. Therefore,the supervision made by the teacher will be weakened a lot.

We verify this claim by an experiment on CIFAR-100.The averaged norm differences between the vanilla cross-entropy loss and the knowledge distillation variants over thegradient of all top-layer classifiers serve as a measure. Thesmaller the norm difference, the weaker the additional su-pervision signal provided by the teacher. Fig. 1 plots thechange of the norm difference when we increase the num-ber of randomly sampled classes in a task. All the gra-dients are measured during the initial optimization of themodel. As demonstrated by the figure, when the numberof classes grows larger, the norm of gradient differencebetween vanilla KD loss and cross-entropy loss decreasesfaster than that between local KD loss and cross-entropyloss, which means the supervision made by the vanilla KDteacher is weakened more than the supervision made by thelocal KD teacher.

Therefore, we consider a local version of the knowledgedistillation term in Eq. 6, where only the classes in the cur-rent mini-batch are considered, i.e., the posterior probabili-ties are only normalized over the current set of classes.

2. Experimental SettingsWe will provide detailed settings for all tasks, namely the

cross-task knowledge distillation, the standard knowledgedistillation, and the middle-shot classification task.

Figure 1. The averaged norm differences between the vanillacross-entropy loss and the knowledge distillation variants (KD andLKD) over the gradient of all top-layer classifiers. When the num-ber of classes in a task grows, the norm difference decreases fast.However, the decrease is mitigated when using local KD loss.

2.1. Cross-Task Knowledge Distillation

In the cross-task knowledge distillation task, we ’d liketo verify whether a well-trained cross-task teacher modelcould help the training of the student model in current task.

Datasets. We investigate the Caltech-UCSD Birds-200-2011 (CUB) [24] data set, which is a fine-grained classi-fication problem over 200 different species of birds. We donot use the attribute information of all the instances. Asa pre-processing, we crop all images based on the providebounding boxes.

Splits. We implement a cross-task knowledge transfertask by selecting two sets for teacher and student respec-tively with non-overlapping classes. Therefore, 100 of the200 classes are selected to train a teacher model, while theremaining 100 classes are used to train the student. Theclasses in the CUB data set are sorted in alphabetic order,and we consider two different kinds of split criteria.

In a “hard” case, the two sets of 100 classes are splitbased on the given alphabetic order. While in the “easy”case, we randomly select 100 classes from the class pool totrain the teacher, and the remaining classes are used for thestudent. The main difference between the two cases is thedomain gap between the training set of the teacher and thestudent. Since the original classes are sorted in alphabeticorder, the classes with numerically close indexes are moresimilar. Thus, the teacher is more “distant” from the studentin the hard case than in the easy one.

For each 100-way classification task for the teacher orthe student, we split 70% data in each class for training, and

the remaining instances are used for test.

Implementation of the Teacher. By optimizing a cross-entropy loss, we train a teacher model based on the corre-sponding training set with a MobileNets [6] model, whosewidth multiplier is 1.0. We use the Stochastic Gradient De-scent (SGD) as the default optimizer, where the momen-tum is 0.9, batch-size is 128, maximum epoch number is200, the initial learning rate is 0.1, and we time the learn-ing rate by 0.2 after 50 epochs. We hold out a part of ex-amples from the training set for validation, from which thebest set of hyper-parameters are selected. With the best se-lected hyper-parameters, we re-train the teacher model onthe whole training set. When training the model, we firstresize the image to 224 × 224, and use the random croptogether with the horizontal flip as the data augmentation.This is the same when training the student.

Implementation of the Student. We use different config-urations of the MobileNets [6] and adjust the model com-plexity with different width multipliers (complicated mod-els have larger multipliers) in {1, 0.75, 0.5, 0.25}. There aretwo stages for the student. In both stages, the temperature τof the teacher’s model is set to 2, and we do not smooth thelogits of the student. When distilling the embedding, we setthe momentum 0.9, the batch-size 128, the maximum epoch200, the initial learning rate 0.1, and we time the learningrate by 0.2 after 50 epochs. While in the second stage, weuse the same hyper-parameter values. λ is set to 0.01, andthe same λ is used across different experiments on a dataset. We find the performance of REFILLED is not very sen-sitive to λ. Since the weight parameter λ in Eq. 6 is decayedduring the optimization, we use λ = 1000 ∗ exp(−0.05)λto update the weight of the local knowledge distillation termafter each epoch.

Evaluations. The averaged classification accuracy over 3trials (with different random seeds) is reported.

Comparison Baselines. We consider three kinds of base-line methods in this task.

• Classification based on the teacher’s embedding. Wecan extract features with the pre-trained teacher’s em-bedding function φT for those instances in the stu-dent’s split. Based on the teacher’s embedding, wecan make classification by either the nearest neighbor(1NN) classifier or the linear logistic regression (LR).Besides, we fine-tune the teacher’s model with the in-stances in the student’s split with small learning rate(0.0001) and fixed number of epochs (50). Fine-tuning

the teacher model requires the student has the same ar-chitecture with the teacher. Since in this case, the stu-dent has the same number of training instances as theteacher, using a large learning rate will make the stu-dent obtain the same weights as training from scratch,so we use a small initial learning rate in our experi-ments.

• Knowledge distillation baselines. In the cross-taskknowledge distillation task, we compare our methodwith one representative embedding-based distilla-tion approach, the Relational Knowledge Distillation(RKD), and fine-tune the whole student model with itsdistilled embedding. The hyper-parameters for RKDare tuned in the same way as our REFILLED approach.

• Variants of REFILLED. We investigate the importanceof different components in REFILLED.

2.2. Standard Knowledge Distillation

In the standard knowledge distillation, we reuse theknowledge from a same-task teacher model, i.e., both theteacher and the student target the same classification task.

Datasets. Following [1], we test the knowledge distilla-tion ability of our REFILLED on the benchmark data setCIFAR-100 [9]. CIFAR-100 is a small-image data set (withsize 32× 32), which contains 100 classes and 6000 imagesin each class. In addition, we also consider the CUB dataset, which has a different split strategy compared with theone in the previous task.

Splits. In CIFAR-100, we follow the standard split, wherethere are 5,000 images in each class for training and 1,000images for test. While on CUB, all 200 classes are usedduring training. The standard training and test partitions areused here. It is notable that both the teacher and the studentmodels are investigated on the same training and test sets.

Implementation of the Teacher and the Student Threedifferent families of the neural networks are taken intoaccount to test the ability of REFILLED, namely theResNet [4] , Wide-ResNet [27], and MobileNets [6]. To-wards getting different capacities of the model, we changethe depth of the ResNet (through the number of layers),the (depth, width) pair of the Wide-ResNet, and the widthof the MobileNets (through the width multipliers). We setλ = 0.02 on CIFAR-100.

We mainly consider two sub-tasks in the standard knowl-edge distillation, i.e., setting the teacher and the studentmodel from either the same or different model families.

• Same-family knowledge distillation. Both the teacherand the student come from the same model family. We

use the same configuration as [1]. In CIFAR-100, boththe teacher and the student are Wide-ResNets. Weset the (depth, width) pair of the teacher as (40, 2),and change such configuration parameters of the stu-dent model among (40, 2), (16, 2), (40, 1), and (16, 1).On CUB, we consider the MobileNets, by setting theteacher’s width multiplier to 1, we vary the width mul-tipliers of the student among {1, 0.75, 0.5, 0.25}.

• Different-family knowledge distillation. The model ofthe teacher and the student come from different archi-tecture families. We consider the knowledge transferflow from ResNet to MobileNets. Taking the com-putational burden into consideration, when in CIFAR-100, we choose the teacher as the ResNet-110, andwe use ResNet-34 as the teacher in CUB. We onlychange the width multipliers of the student model in{0.75, 0.5, 0.25} on CUB to keep the student modelhas smaller capacity when compared with the teacher.

Similar ways with the previous section are used to trainboth the teacher and the student model in the standardknowledge distillation task. Here we set the temperatureof the teacher’s model to 4. For CIFAR-100, we pad 4 foreach edge before we do the random crop.

Evaluations. Both teacher and student are trained on thesame set with three different seeds of initialization, and wereport the mean accuracy of the student on the test set.

2.3. Middle-Shot Classification Task

Similar to the popular few-shot learning task (FSL) [23,17, 3, 16, 12, 26, 15], we also investigate the problem learn-ing with middle-shot of examples.

Datasets. We use the popular MiniImageNet data set [23],which contains 100 classes in total and 600 images in eachclass. All images are resized to 84 × 84 before inputtinginto the models.

Splits. Following [23, 17], there are 64 classes (SEENclass) to train the teacher (a.k.a. the meta-train set), 16classes for validation (a.k.a. the meta-val set), and we sam-ple tasks from the remaining 20 classes (a.k.a. the meta-testset) to train the student.

Implementation of the Teacher and the Student. We setthe student as a 4-layer ConvNet [23, 20, 3], and considertwo types of the teacher model, i.e., the same 4-layer Con-vNet (but trained on different classes in the meta-train set)and the ResNet [12, 26]. The ConvNet contains 4 identicalblocks, and each block is a sequential of convolution oper-ator, batch normalization [8], ReLU, and Max pooling. We

add another global max pooling layer to reduce the compu-tational burden after the 4 blocks, which gives rise to a 64-dimensional embedding before the top-layer classifier. ForResNet, we follow the architecture in [12, 26], which re-moves the two down-sampling layers in the vanilla ResNet.The ResNet outputs 640-dimension embeddings.

We train a teacher model on the SEEN classes set (meta-train set) with ResNet and ConvNet. Supervised by thecross-entropy loss, we use random crop and horizontal flipas the data augmentation, SGD w/ momentum 0.9 as theoptimizer, 128 as the batch size. Then the student is trainedwith the help of the teacher and the tasks are sampled fromthe UNSEEN classes (meta-test set).

Evaluations. Define a K-shot C-way task as a C-classclassification problem, and there are K instances in eachclass. Different from the few-shot learning where C = 5and K ∈ {1, 5}, here we consider there are a bit more in-stances in each class, i.e., K = {10, 30}. Although thevalue of K increases in the middle-shot learning, it is stillsmall to train a complicated neural network from scratch.We sample tasks from the 20-class split (meta-test set) totrain the student model and evaluate the results by classi-fying another 15 instances from each of the C classes. Weevaluate the performance by mean accuracy over 600 trials.

Comparison Methods. We compare our methods withtwo branches of baselines:

• Meta-learning methods. Meta-learning is a popularway to solve the few-shot classification problem. Tomimic the test case, it samples C-Way K-Shot tasksfrom the SEEN class set to learn task-level inductivebias like embedding [23, 20]. However, the computa-tional burden (e.g., the batch size) becomes large whenthe number of shots increases. Besides, meta-learningneeds to specify the way to obtain a meta-model fromthe SEEN classes. We compare our methods with theembedding-based meta-learning approaches like andProtoNet [20] and FEAT [26].

• Embedding-based baselines. We can make predictionsdirectly with the teacher’s embedding, the penultimatelayer of the teacher model, by leveraging the nearestneighbor. Based on the embedding, we also train linearclassifiers like SVM on the current task’s middle-shotdata or fine-tune the whole model. It is notable that wetune the hyper-parameter of such methods with sam-pled middle-shot tasks on the validation split.

3. Additional Experimental ResultsThis section shows the additional experimental results on

the knowledge distillation and middle-shot learning tasks.

Table 1. The cross-task distillation mean accuracy on CUB data set, where teacher and student are trained for non-overlapping 100 classeswith MobileNets. Two split scenarios are considered, i.e., the “Easy” and “Hard” cases. The three values in the teacher’s correspond tobaselines: applying 1NN based on teacher’s embedding, train a linear LR classifier based on fixed teacher’s embedding, and Fine-Tune (FT)based on teacher’s embedding. More details can be found in the text.

Easy Hard

Channel 1 0.75 0.5 0.25 1 0.75 0.5 0.25

Teacher 1NN: 49.23, LR: 56.77, FT: 66.94 1NN: 45.31, LR: 53.82, FT: 65.72Student 70.04 68.13 66.44 64.63 71.25 67.56 66.85 64.48

RKD [13] 71.10 68.81 67.15 64.28 70.83 68.8 67.44 63.97Vanilla 71.62 70.27 70.15 66.75 71.90 69.14 68.91 65.38LKD 71.93 70.73 70.88 67.41 72.53 70.01 69.50 66.42

REFILLED 72.48 71.04 71.35 67.87 73.38 70.42 69.77 67.10

3.1. Cross-Task Knowledge Distillation

The results of cross-task distillation could be found inTable 1. Two splits of the data are considered, where thedomain gaps between the teacher and student are different.

We first adapt the teacher model for the cross-task classi-fication via three baselines, i.e., the 1NN based on teacher’sembedding, train a linear Logistic Regression (LR) clas-sifier based on the fixed teacher’s embedding, and Fine-Tune (FT) the model based on teacher’s embedding. Theteacher has the width multiplier 1, so it achieves the test ac-curacy around 70.04 when directly training on the student’ssplit (equal the student’s performance with width multiplier1). The test accuracy of the student becomes higher whenlearning with more complicated models (larger width mul-tiplier values). It can be found that the baselines of theteacher model perform higher when trained in the easy sce-nario (the left part in Table 1).

We also compare with one representative embedding-based approach Relation Knowledge Distilla-tion (RKD) [13], and fine-tune the model after obtainingthe distilled embedding from the cross-task teacher. RKDsometimes gets better accuracy than the vanilla studentmodel (denoted as “student” in the table).

Since REFILLED distills both the embedding and theclassification ability from the teacher, we perform two base-lines upon the improved embedding of REFILLED. We testthe quality of the embedding by fine-tune the model withcross-entropy based on the REFILLED’s distilled embed-ding, which is denoted as “Vanilla” in the table. It getsbetter classification ability than RKD, which verifies our ap-proach gets high-quality embedding in the first stage.

LKD denotes the local version in REFILLED utilizing thelocal knowledge distillation but without decayed weights. Itcan further improve the vanilla training. Our REFILLEDapproach achieves the best classification performance inall cases. Benefited from reusing the knowledge from theteacher, the classification achieves a further improvement

w.r.t. the vanilla training.

Table 2. The mean classification accuracy of knowledge distilla-tion methods on CUB data set. Teacher is trained with ResNet-34,which gets 75.31% test accuracy. Student is learned with Mo-bileNets, whose width multiplier is changed.

width multiplier 0.75 0.5 0.25

Student 74.87 72.41 69.72

KD [5] 76.02 74.17 71.97FitNet [18] 75.03 72.17 70.03

AT [28] 76.11 72.94 70.99NST [7] 75.89 73.82 71.92

KD+VID-I [1] 76.41 74.04 72.20RKD [13] 76.11 75.24 72.84

REFILLED 78.01 75.90 73.15

3.2. Standard Knowledge Distillation

We use REFILLED to distill the knowledge from a cross-family teacher on CUB. We set the teacher as ResNet-34, and use the MobileNets with different width multipli-ers (from {0.75, 0.5, 0.25}) as the student model. Table 2demonstrates the results, where we re-implement all thecomparison methods in this case. REFILLED keeps its su-periority in this case.

3.3. Few-Shot Learning and Middle-Shot Learning

The same configuration of REFILLED on the middle-shot learning tasks could also be applied to the few-shotlearning scenarios. In addition to the middle-shot learn-ing results, we provide additional few-shot learning results(where the number of shots is equal to 1 or 5) in Table 3.Both ProtoNet and FEAT are meta-learned over the pre-trained embeddings from the SEEN classes set (meta-trainset). REFILLED gets better results when reusing a strongteacher (i.e., the ResNet), and works well with larger shots.

Table 3. The mean accuracy over 600 trials of few-shot and middle-shot tasks. We set the student model as the ConvNet, and investigate bothResNet and ConvNet as the teacher model, for our REFILLED approach. Detailed results and configurations are in the supp. REFILLED1

denotes the results reusing a ResNet teacher and REFILLED2 stands for the results reusing a ConvNet teacher.

Tasks 1-Shot 5-Way 5-Shot 5-Way 10-Shot 5-Way 30-Shot 5-Way

1NN 49.73 63.11 66.56 69.80SVM 51.61 69.17 74.24 77.87

Fine-Tune 45.89 68.61 74.95 78.62

MAML [3] 48.70 63.11 - -ProtoNet [20] 51.79 70.38 74.42 78.10

FEAT [26] 55.15 71.61 74.86 78.84

REFILLED1 54.82 71.97 76.42 80.33REFILLED2 53.44 70.60 75.37 78.94

3.4. Ablation Study, 1-Stage vs. 2-Stage Learning

We train REFILLED in a two-stage manner as [18, 1].We discuss the pros and cons of the two ways first, and thenprovide the ablation study results in Table 4 and Table 5.Experiment settings in Table 4 and Table 5 are the same asthose in Section 5.1 and Section 5.2 of the paper. In both ta-bles, REFILLEDγ means training REFILLED in a one-stagemanner with balancing hyper-parameter γ, i.e.,

minφS ,fS

γ∑ijk

KL(Pijk(φT ) ‖ Pijk(φS)

)(9)

+

N∑i=1

` (fS(xi),yi) + λKL(pφT

(yi | xi), sτ (fS(xi)))

From the model design perspective: The two-stage train-ing in REFILLED works well since the distilled discrim-inative embedding acts as a better initialization hence toimproves the discerning ability of the model; while train-ing with a combined objective regularizes the classifier bymatching the predictions between student and teacher, sothat it relies on a suitable strength of the regularization.

From the implementation perspective: An important is-sue for the joint training of the combined objective is to setthe right balance among the embedding learning (relation-ship distillation), classification (cross-entropy), and knowl-edge transition (local knowledge distillation) losses. In ourempirical study, it is a bit hard to tune the weights amongthem. While in the two-stage training strategy, we can fistlearn a good embedding till convergence, and then use suchembedding to initialize the second stage, where the balancebetween classification and distillation is solved with an an-nealing strategy.

From the results in both Table 4 and Table 5 on two dif-ferent data sets, the two-stage training makes REFILLEDeasier to achieve higher performance. (The best result ineach configuration is in bold).

Table 4. The mean accuracy of cross-task distillation on CUB dataset. REFILLED0.1 and REFILLED0.2 denote the results using one-stage training protocol with balancing hyper-parameter 0.1 and 0.2respectively.

Width Multiplier 1 0.75 0.5 0.25

Teacher 1NN: 45.31 , LR: 53.82 , FT: 65.72

Student 71.25 67.56 66.85 64.48RKD [13] 70.83 68.80 67.44 63.97

REFILLED 73.38 70.42 69.77 67.10

REFILLED0.1 72.64 70.13 69.28 66.41REFILLED0.2 72.42 70.22 68.65 66.85

Table 5. The average classification results of knowledge distilla-tion methods on CIFAR-100 data set based on the Wide-ResNet.REFILLED0.1 and REFILLED0.2 denote the results using one-stage training protocol with balancing hyper-parameter 0.1 and 0.2respectively.

(depth, width) (40, 2) (16, 2) (40, 1) (16, 1)

Teacher 74.44Student 74.44 70.15 68.97 65.44

KD [5] 75.47 71.87 70.46 66.54FitNet [18] 74.29 70.89 68.66 65.38

AT [28] 74.76 71.06 69.85 65.31NST [7] 74.81 71.19 68.00 64.95

VID-I [1] 75.25 73.31 71.51 66.32KD+VID-I [1] 76.11 73.69 72.16 67.19

RKD [13] 76.62 72.56 72.18 65.22

REFILLED 77.49 74.01 72.72 67.56

REFILLED0.1 77.03 73.81 72.30 67.14REFILLED0.2 76.95 73.90 71.64 67.34

References[1] Sungsoo Ahn, Shell Xu Hu, Andreas C. Damianou,

Neil D. Lawrence, and Zhenwen Dai. Variational

information distillation for knowledge transfer. InCVPR, pages 9163–9171, 2019. 2, 4, 5, 6, 7

[2] Hanting Chen, Yunhe Wang, Chang Xu, Chao Xu, andDacheng Tao. Learning student networks via featureembedding. CoRR, abs/1812.06597, 2018. 2

[3] Chelsea Finn, Pieter Abbeel, and Sergey Levine.Model-agnostic meta-learning for fast adaptation ofdeep networks. In ICML, pages 1126–1135, 2017. 5,7

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016. 4

[5] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean.Distilling the knowledge in a neural network. CoRR,abs/1503.02531, 2015. 2, 6, 7

[6] Andrew G. Howard, Menglong Zhu, Bo Chen, DmitryKalenichenko, Weijun Wang, Tobias Weyand, MarcoAndreetto, and Hartwig Adam. Mobilenets: Efficientconvolutional neural networks for mobile vision appli-cations. CoRR, abs/1704.04861, 2017. 4

[7] Zehao Huang and Naiyan Wang. Like what youlike: Knowledge distill via neuron selectivity transfer.CoRR, abs/1707.01219, 2017. 6, 7

[8] Sergey Ioffe and Christian Szegedy. Batch normal-ization: Accelerating deep network training by reduc-ing internal covariate shift. In ICML, pages 448–456,2015. 5

[9] Alex Krizhevsky and Geoffrey Hinton. Learning mul-tiple layers of features from tiny images. Technicalreport, University of Toronto, 2009. 4

[10] Yufan Liu, Jiajiong Cao, Bing Li, Chunfeng Yuan,Weiming Hu, Yangxi Li, and Yunqiang Duan. Knowl-edge distillation via instance relationship graph. InCVPR, pages 7096–7104, 2019. 2

[11] R. Manmatha, Chao-Yuan Wu, Alexander J. Smola,and Philipp Krahenbuhl. Sampling matters in deepembedding learning. In CVPR, pages 2859–2867,2017. 1

[12] Boris N. Oreshkin, Pau Rodrıguez Lopez, and Alexan-dre Lacoste. TADAM: task dependent adaptive metricfor improved few-shot learning. In NeurIPS, pages719–729. 2018. 5

[13] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho.Relational knowledge distillation. In CVPR, pages3967–3976, 2019. 2, 6, 7

[14] Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li,Yichao Wu, Yu Liu, Shunfeng Zhou, and ZhaoningZhang. Correlation congruence for knowledge distil-lation. In ICCV, pages 5007–5016, 2019. 2

[15] Hang Qi, Matthew Brown, and David G. Lowe. Low-shot learning with imprinted weights. In CVPR, pages5822–5830, 2018. 2, 5

[16] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L.Yuille. Few-shot image recognition by predicting pa-rameters from activations. In CVPR, pages 7229–7238, 2018. 5

[17] Sachin Ravi and Hugo Larochelle. Optimization as amodel for few-shot learning. In ICLR, 2017. 5

[18] Adriana Romero, Nicolas Ballas, Samira EbrahimiKahou, Antoine Chassang, Carlo Gatta, and YoshuaBengio. Fitnets: Hints for thin deep nets. In ICLR,2015. 6, 7

[19] Florian Schroff, Dmitry Kalenichenko, and JamesPhilbin. Facenet: A unified embedding for face recog-nition and clustering. In CVPR, pages 815–823, 2015.1

[20] Jake Snell, Kevin Swersky, and Richard S. Zemel.Prototypical networks for few-shot learning. InNeurIPS, pages 4080–4090. 2017. 5, 7

[21] Frederick Tung and Greg Mori. Similarity-preservingknowledge distillation. In ICCV, pages 1365–1374,2019. 2

[22] Laurens van der Maaten and Kilian Q. Weinberger.Stochastic triplet embedding. In MLSP, pages 1–6,2012. 1

[23] Oriol Vinyals, Charles Blundell, Tim Lillicrap, KorayKavukcuoglu, and Daan Wierstra. Matching networksfor one shot learning. In NeurIPS, pages 3630–3638.2016. 5

[24] Catherine Wah, Steve Branson, Peter Welinder, PietroPerona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. 3

[25] Kilian Q. Weinberger and Lawrence K. Saul. Distancemetric learning for large margin nearest neighbor clas-sification. JMLR, 10:207–244, 2009. 1

[26] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and FeiSha. Learning embedding adaptation for few-shotlearning. CoRR, 2018. 5, 7

[27] Sergey Zagoruyko and Nikos Komodakis. Wide resid-ual networks. In BMVC, 2016. 4

[28] Sergey Zagoruyko and Nikos Komodakis. Payingmore attention to attention: Improving the perfor-mance of convolutional neural networks via attentiontransfer. In ICLR, 2017. 6, 7