knowledge distillation for action anticipation via label ... · department of mathematics “tullio...

8
Knowledge Distillation for Action Anticipation via Label Smoothing Guglielmo Camporese * , Pasquale Coscia * , Antonino Furnari , Giovanni Maria Farinella , Lamberto Ballan * * Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy Department of Mathematics and Computer Science, University of Catania, Italy Abstract—Human capability to anticipate near future from visual observations and non-verbal cues is essential for developing intelligent systems that need to interact with people. Several research areas, such as human-robot interaction (HRI), assisted living or autonomous driving need to foresee future events to avoid crashes or help visually impaired people. Such challenging task requires to capture and understand the underlying structure of the analyzed domain in order to reduce prediction uncertainty. Since the action anticipation task can be seen as a multi-label problem with missing labels, we design and extend the idea of label smoothing extracting semantics from the target labels. We show that such generalization is equivalent to considering a knowledge distillation framework where a teacher injects useful semantic information into the model during training. In our experiments, we implement a multi-modal framework based on long short-term memory (LSTM) networks to anticipate future actions which is able to summarise past observations while making predictions of the future at different time steps. To validate our soft labeling procedure we perform extensive experiments on the egocentric EPIC-Kitchens dataset which includes more than 2500 action classes. The experiments show that label smoothing systematically improves performance of state-of-the-art models. I. I NTRODUCTION Human action analysis is a central task in computer vision that has a enormous impact on many applications, such as, video content analysis [1], [2], video surveillance [3], [4], and automated driving vehicles [5], [6]. Systems interacting with humans also need the capability to promptly react to the context changes, and plan their actions accordingly. Most previous works focus on the tasks of action recognition [7], [8], [9] or early-action recognition [10], [11], [12], i.e., recognition of an action after its observation (happened in the past) or recognition of an ongoing action from its partial observation (only part of the current action is available). A more challenging task is to predict near future, i.e., to forecast actions that will be performed ahead in time. Predicting future actions before observing the corresponding frames [13], [14] is demanded by many applications which need to antici- pate human behaviour. For example, intelligent surveillance systems may support human operators to avoid hazards or assistive robotics may help non-self-sufficient people. Such task requires to analyze significant spatio-temporal variations among actions performed by different people. For this rea- son, multiple modalities (e.g., appearance and motion) are typically considered to improve the identification of similar actions. Egocentric scenarios provide useful settings to study early-action recognition or action anticipation tasks. Indeed, Fig. 1: Our framework for action anticipation. After an en- coding procedure of the video, during the decoding stage our model anticipates the next action that will occur in τ a seconds. Afterwards, a teacher model distills semantic information via label smoothing into the action anticipation model during training in order to reduce the uncertainty on the future predictions. wearable cameras offer an explicit point-of-view to capture human motion and object interaction. In this work, we address the problem of anticipating ego- centric human actions in an indoor scenario at several time steps. More specifically, we anticipate an action by leveraging video segments that precede the action. We disentangle the processing of the video into encoding and decoding stages. In the first stage, the model summarizes the video content while in the second stage the model predicts at multiple anticipation times τ a the next action (see Fig. 1). We exploit a recurrent neural network (RNN) to capture temporal correlations be- tween subsequent frames and consider three different modal- ities for representing the input: appearance (RGB), motion (optical flow) and object-based features. An important aspect to consider when dealing with human action anticipation is that the future is uncertain, which means that different prediction of future actions are equally likely to occur. For example, the actions “sprinkle over pumpkin seeds” and “sprinkle over sunflower seeds” may be equally performed when preparing a recipe. For this reason, to deal with the uncertainty of future arXiv:2004.07711v1 [cs.CV] 16 Apr 2020

Upload: others

Post on 28-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Knowledge Distillation for Action Anticipation via Label ... · Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy yDepartment of Mathematics and Computer

Knowledge Distillation for Action Anticipationvia Label Smoothing

Guglielmo Camporese∗, Pasquale Coscia∗, Antonino Furnari†, Giovanni Maria Farinella†, Lamberto Ballan∗∗Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy†Department of Mathematics and Computer Science, University of Catania, Italy

Abstract—Human capability to anticipate near future fromvisual observations and non-verbal cues is essential for developingintelligent systems that need to interact with people. Severalresearch areas, such as human-robot interaction (HRI), assistedliving or autonomous driving need to foresee future events toavoid crashes or help visually impaired people. Such challengingtask requires to capture and understand the underlying structureof the analyzed domain in order to reduce prediction uncertainty.Since the action anticipation task can be seen as a multi-labelproblem with missing labels, we design and extend the ideaof label smoothing extracting semantics from the target labels.We show that such generalization is equivalent to considering aknowledge distillation framework where a teacher injects usefulsemantic information into the model during training. In ourexperiments, we implement a multi-modal framework basedon long short-term memory (LSTM) networks to anticipatefuture actions which is able to summarise past observationswhile making predictions of the future at different time steps.To validate our soft labeling procedure we perform extensiveexperiments on the egocentric EPIC-Kitchens dataset whichincludes more than 2500 action classes. The experiments showthat label smoothing systematically improves performance ofstate-of-the-art models.

I. INTRODUCTION

Human action analysis is a central task in computer visionthat has a enormous impact on many applications, such as,video content analysis [1], [2], video surveillance [3], [4],and automated driving vehicles [5], [6]. Systems interactingwith humans also need the capability to promptly react tothe context changes, and plan their actions accordingly. Mostprevious works focus on the tasks of action recognition[7], [8], [9] or early-action recognition [10], [11], [12], i.e.,recognition of an action after its observation (happened inthe past) or recognition of an ongoing action from its partialobservation (only part of the current action is available). Amore challenging task is to predict near future, i.e., to forecastactions that will be performed ahead in time. Predicting futureactions before observing the corresponding frames [13], [14]is demanded by many applications which need to antici-pate human behaviour. For example, intelligent surveillancesystems may support human operators to avoid hazards orassistive robotics may help non-self-sufficient people. Suchtask requires to analyze significant spatio-temporal variationsamong actions performed by different people. For this rea-son, multiple modalities (e.g., appearance and motion) aretypically considered to improve the identification of similaractions. Egocentric scenarios provide useful settings to studyearly-action recognition or action anticipation tasks. Indeed,

Fig. 1: Our framework for action anticipation. After an en-coding procedure of the video, during the decoding stage ourmodel anticipates the next action that will occur in τa seconds.Afterwards, a teacher model distills semantic information vialabel smoothing into the action anticipation model duringtraining in order to reduce the uncertainty on the futurepredictions.

wearable cameras offer an explicit point-of-view to capturehuman motion and object interaction.

In this work, we address the problem of anticipating ego-centric human actions in an indoor scenario at several timesteps. More specifically, we anticipate an action by leveragingvideo segments that precede the action. We disentangle theprocessing of the video into encoding and decoding stages. Inthe first stage, the model summarizes the video content whilein the second stage the model predicts at multiple anticipationtimes τa the next action (see Fig. 1). We exploit a recurrentneural network (RNN) to capture temporal correlations be-tween subsequent frames and consider three different modal-ities for representing the input: appearance (RGB), motion(optical flow) and object-based features. An important aspectto consider when dealing with human action anticipation is thatthe future is uncertain, which means that different predictionof future actions are equally likely to occur. For example,the actions “sprinkle over pumpkin seeds” and “sprinkle oversunflower seeds” may be equally performed when preparing arecipe. For this reason, to deal with the uncertainty of future

arX

iv:2

004.

0771

1v1

[cs

.CV

] 1

6 A

pr 2

020

Page 2: Knowledge Distillation for Action Anticipation via Label ... · Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy yDepartment of Mathematics and Computer

predictions, we propose to group similar actions comparingseveral label smoothing techniques in order to broaden theset of possible futures and reduce the uncertainty caused byone-hot encoded labels. Label smoothing is introduced in [15]as a form of regularization for classification models sinceit introduces a positive constant value into wrong classescomponents of the one-hot target. A peculiar feature of suchmethod is to make models robust to overfitting especially whenlabels in the dataset are noisy, e.g., the targets are ambiguous.In our work, we extend label smoothing by using them as abridge for distilling knowledge into the model during training.Our experiments on the large-scale EPIC-Kitchens datasetshow that label smoothing increases the performance of state-of-the-art models and yields better generalization on test data.

The main contributions of our work are as follows: 1) wegeneralize the label smoothing idea extrapolating semanticpriors from the action labels to capture the multi-modal futurecomponent of the action anticipation problem. We show thatlabel smoothing, in this context, can be seen as a knowledgedistillation process where the teacher gives semantic priorinformation for the action anticipation model; 2) we performextensive experiments on egocentric videos proving that ourlabel smoothing techniques systematically improve results ofaction anticipation of state-of-the-art models.

II. RELATED WORK

Action anticipation requires the interpretation of the currentactivity using a number of observations in order to foreseethe most likely actions. For this reason, we briefly reviewthree related research areas: action recognition, early-actionrecognition and action anticipation.

Action recognition. Action recognition is the task of rec-ognizing the action contained in an observed trimmed video.Classic approaches to action recognition have leveraged hand-designed features, coupled with machine learning algorithmsto recognize actions from video [7], [16], [17]. More recentworks have investigated the use of deep learning to obtainrepresentations suitable for action recognition directly fromvideo in an end-to-end fashion. Among these approaches, aline of works has investigated ways to exploit standard 2DCNNs for action recognition, often relying on optical flowas a mean to represent motion [8], [18], [9], [19], [20], [21].Other works have focused on the extension of 2D CNNs to 3DCNNs able to process spatio-temporal volumes [2], [22], [23].Some approaches have used recurrent networks to model thetemporal relationships between per-frame observations [24],[25], [14]. All of these works investigate deeply how torepresent and leverage the input video but less or even noimportance is given to the representation of the action labels.

Early-action recognition. Early action recognition consistsin recognizing on-going actions from partial observations ofstreaming video [12]. Classic works have addressed the taskusing integral histograms of spatio-temporal features [10],sparse-coding [11], Structured Output SVMs [26], and Se-quential Max-Margin event detectors [27]. Another line of

research has leveraged the use LSTMs [28], [29], [30], [14]to account for the sequential nature of the task.

Action anticipation. Action anticipation deals with forecast-ing actions that will happen in the future. Previous studies haveinvestigated different approaches such as hierarchical repre-sentations [31], auto-regressive HMMs [32], regressing futurerepresentations [33], using encoder-decoder LSTMs [34], andinverse reinforcement learning [35]. Some other approachesproposed to perform long-term predictions focusing only onappearance features [36], [37]. However, differently from thiswork, very little or even no attention has been payed eitherto the knowledge distillation of the action semantics or labelsmoothing for action anticipation.

Knowledge distillation and label smoothing Knowledgedistillation [38] is the procedure of transferring the informationextracted by a teacher network (with high learning capacity)to a student network (with low learning capacity) in order toallow the latter to reach similar performance. This is usuallyobtained by training the student via a distillation loss whichtakes into account both the ground truth and the predictionof the pre-trained teacher. Since this procedure can distilluseful information form the teacher to the student, we performsemantic distillation via label smoothing.

Label smoothing, introduced in [15], is the procedure ofsoftening the distribution of the target labels, reducing themost confident value of the one-hot vector and consideringa uniform value for all the zero vector components. Althoughsuch procedure improves results for classification problemsreducing overfitting, no previous works investigate other de-sign approaches except for the uniform smoothing. In ourwork, we both generalize this idea and show a systematicallyimprovement of state-of-the-art models.

III. PROPOSED APPROACH

Anticipating human actions is essential for developing in-telligent systems able to avoid accidents or guide people tocorrectly perform their actions. We study the suitability oflabel smoothing techniques to address the issue.

A. Label Smoothing

As investigated in [39], there is an inherent uncertainty onpredicting future actions. In fact, starting from the currentstate observation of an action there can be multiple, but stillplausible, future scenarios that can occur. Hence, the problemcan be reformulated as a multi-label task with missing labelswhere, from all the valid future realizations, only one issampled from the dataset. All previous models designed foraction anticipation are trained with cross-entropy using one-hot labels, leveraging only one of the possible future scenarioas the ground truth. A major drawback of using hard labels isto favour logits of correct classes weakening the importance ofother plausible ones. In fact, given the one-hot encoded vectory(k) for a class k, the prediction of the model p and the logitsof the model z such that p(i) = ez(i)/

∑j e

z(j), the crossentropy is minimized only if z(k) � z(i) ∀i 6= k. This fact

Page 3: Knowledge Distillation for Action Anticipation via Label ... · Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy yDepartment of Mathematics and Computer

Fig. 2: Action anticipation protocol based on the encoding and decoding stages. We summarize past observations by processingvideo snippets sampled every α = 0.25 seconds in the encoding stage. After 6 steps, we start making predictions every αseconds for 8 steps to anticipate the future action.

Fig. 3: Encoding (left) and decoding (right) of our architecture.The encoding block takes as input the video snippet repre-sented by three modalities (optical flow, RGB and objects).Each modality is separately fed into an LSTM. The decodingblock is composed by three LSTM branches that take as inputthe processed video snippets. We apply droput at each branch.Finally, all the branches are concatenated and passed to a fullyconnected layer with softmax activation.

encourages the model to be over-confident about its predictionssince during training it tries to focus all the energy on onesingle logit leading to overfitting and scarce adaptivity [15].To this purpose, we smooth the target distribution enablingthe chance of negative classes to be plausible. However,the usual label smoothing procedure introduces a uniformpositive component among all the classes, without capturingthe difference between actions. To this end, we propose severalways of designing such smoothing procedure by encodingsemantic priors into the labels and weighting the actionsaccording their feature representation. We can connect our softlabels approach to the knowledge distillation framework [38]where the teacher provides useful information to the studentmodel during training. What differs is that the teacher doesnot depend on the input data but solely on the target, i.e.,it distills information using ground-truth data. Since teacher’sprediction is constant w.r.t. the input, such information can be

encoded before training into the target via label smoothing.As a form of regularization, [15] introduces the idea of

smoothing hard labels by averaging one-hot encoded targetswith constant vectors as follows:

ysoft(i) = (1− α)y(i) + α/K, (1)

where y is the one-hot encoding, α is the smoothing factor(0 ≤ α ≤ 1) and K represents the number of classes. Sincecross entropy is linear w.r.t. its first argument, it can be writtenas follows:

CE[ysoft, p] =∑i

−ysoft(i) log(p(i)) =

= (1− α)CE[y, p] + αCE[1/K, p].

(2)

The optimization based on the above loss can be seen asa distillation knowledge procedure [38] where the teacherrandomly predicts the output, i.e., pteacher(i) = 1/K, ∀i.Hence, the connection with the distillation loss proves thatthe second term in Eq. (1) can be seen as a prior knowledge,given by an agnostic teacher, for the target y. Although usingan agnostic teacher seems an unusual choice, uniform labelsmoothing can be seen as a form of regularization [15] andthus it can improve the model’s generalization capability.Taking this into account, we extend the idea of smoothinglabels by modeling the second term of Eq. (1), i.e., the priorknowledge of the targets, as follows:

ysoft(i) = (1− α)y(i) + απ(i), (3)

where π ∈ RK is the prior vector such that∑

i π(i) = 1 andπ(i) ≥ 0 ∀i.

Therefore, the resulting cross entropy with soft labels iswritten as follows:

CE[ysoft, p] = (1− α)CE[y, p] + αCE[π, p] (4)

This loss not only penalizes errors related to the correctclass but also errors related to the positive entries of the prior.Starting from this formulation, we introduce Verb-Noun, GloVe

Page 4: Knowledge Distillation for Action Anticipation via Label ... · Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy yDepartment of Mathematics and Computer

and Temporal priors for smoothing labels in the knowledgedistillation procedure. In the following, we detail our labelsmoothing techniques.

Verb-Noun label smoothing. EPIC-KITCHENS [13] con-tains action labels structured as verbs-noun pairs, like “cutonion” or “dry spoon”. More formally, if we define A the setof actions, V the set of verbs, and N the set of nouns, thenan action is represented by a tuple a = (v, n) where v ∈ Vand n ∈ N . Let Av(v) the set of actions sharing the sameverb v and An(n) the set of actions sharing the same noun n,defined as follows:

Av(v) = {(v, n) ∈ A ∀ n ∈ N}, (5)

An(n) = {(v, n) ∈ A ∀ v ∈ V}, (6)

where n ∈ N and v ∈ V .We define the prior of the kth ground-truth action class as

π(k)V N (i) = 1

[a(i) ∈ Av(v(k)) ∪ An(n(k))

] 1

Ck, (7)

where a(i) is the ith action, v(k) and n(k) are the verb andthe noun of the kth action, 1[·] is the indicator function, andCk = |Av(v(k))| + |An(n(k))| − 1 is a normalization term.Using such encoding rule, the cross entropy not only penalizesthe error related to the correct class but also the errors withrespect to all the other “similar” actions with either the sameverb or noun1.

GloVe based label smoothing. An important aspect toconsider when dealing with actions represented by verbsand/or nouns is their semantic meaning. In the Verb-Noun labelsmoothing, we define the prior considering a rough yet stillmeaningful semantic where actions that share either the sameverb or noun are considered similar. To extend this idea, weextrapolate the prior from the word embedding of the action.One of the most important properties of word embeddingsis to put closer words with similar semantic meanings andto move dissimilar ones far away, as opposed to hard labelsthat cannot capture at all similarity between classes sincey(i)T y(j) = 0 ∀i 6= j.

Using such action representation, we enable the distillationof useful information into the model during training since thecross entropy not only penalizes the error related to the correctclass but also the error related to all other similar actions. Inorder to compute the word embeddings of the actions we usethe GloVe model [40] pretrained on the Wikipedia 2014 andGigaword 5 datasets. We use the GloVe model since it does notrely just on local statistics of words, but incorporates globalstatistics to obtain word vectors. Since the model takes as inputonly single words, we encode the action as follows:

φ(k) = Concat[GloV e(v(k)), GloV e(n(k))

](8)

1It can be proved that in terms of scalar product two different classeshaving the same noun or verb and encoded with Verb-Noun label smoothingare closer respect to classes encoded with hard labels

Temporal Uniform Verb-Noun GloVe GLoVe+Verb-Nounα∗ 0.6 0.1 0.45 0.6 0.5

TABLE I: Results of a grid search procedure to detect the bestsmooth factor α for proposed label smoothing techniques.

where φ is the obtained action representation of a(k) =(v(k), n(k)) and GloV e(·) ∈ R300 is the output of the GloVemodel. We finally compute the prior probability for smoothingthe labels as the similarity between two action representations,which is computed as follows:

π(k)GL(i) =

|φ(k)Tφ(i)|∑j |φ(k)Tφ(j)|

. (9)

Hence, π(k)GL(i) in Eq. (9) represents the similarity between

the kth and the ith action.

Temporal label smoothing. Some actions are more likelyto co-occur than others. Furthermore, only specific action se-quences may be considered plausible. For this reason, it couldbe reasonable to focus on most frequent action sequences sincethey may reveal possible valid paths in the actions space. Inthis case, we build the prior probability of their observation byconsidering subsequent actions of length two, i.e., we estimatefrom the training set the transition probability from the ith tothe kth action as follows:

π(k)TE(i) =

Occ[a(i) → a(k)

]∑j Occ

[a(j) → a(k)

] (10)

where Occ[a(i) → a(k)

]is the number of times that the ith

action is followed by the kth action. Using such representation,we reward both the correct class and most frequent actions thatprecede the correct class.

B. Action Anticipation Architecture

For our experiments, we consider a learning architecturebased on recurrent neural networks. Following [14], our ap-proach uses the protocol depicted in Fig. 2 for anticipatingfuture actions. We process the frames preceding the actionthat we want to anticipate grouping them into video snippetsof length 5. Each video snippet is collected every α = 0.25seconds and processed considering three different modalities:RGB features computed using a Batch Normalized InceptionCNN [41] trained for action recognition, objects features com-puted using Fast-R CNN [42] and optical flow computed as in[19], processed through a Batch Normalized Inception CNNtrained for action recognition. Our multi-modal architectureprocesses the above inputs and encompasses two buildingblocks: an encoder which recognizes and summarises pastobservations and a decoder which predicts future actions atdifferent anticipation time steps. As shown in Fig. 3, duringthe encoding stage each modality is separately processed by aLSTM layer. During the decoding stage such streams are thenmerged with late fusion and fed into a fully connected layerusing softmax activation.

Page 5: Knowledge Distillation for Action Anticipation via Label ... · Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy yDepartment of Mathematics and Computer

IV. EXPERIMENTS

A. Dataset and Evaluation Measures

Dataset. Our experiments are performed on the EPIC-KITCHENS [13] dataset. This is a large-scale collection ofegocentric videos that contains 39, 596 action annotationsdivided into 2513 unique actions, 125 verbs, and 352 nouns.We use the same split as [14] producing 23, 493 segments fortraining and 4, 979 segments for validation.

Evaluation Measures. To asses the quality of predictionsand compare all methods, we use the Top-k accuracy, i.e. weassume the prediction correct if the label falls into the besttop-k predictions. As reported in [39], [43], such measure isone of the most appropriate given the uncertainty of futurepredictions. More specifically, we use the Top-5 accuracy formethods comparison. For the test set, we also use the Top-1accuracy, the Macro Average Class Precision and the MacroAverage Class Recall. The last two metrics are computed onlyon many-shot nouns, verbs and actions as explained in [13].

B. Models and Baselines

In the comparative analysis, we exploit the architectureproposed in Sec. III-B employing the different label smoothingtechniques defined in Sec. III-A. In our experiments weconsider models trained using one-hot vectors (One Hot),uniform smoothing (Smooth Uniform), temporal soft labels(Smooth TE), Verb-Noun soft labels (Smooth VN), GloVebased soft labels (Smooth GL) and GloVe + Verb-Noun softlabels (Smooth GL+VN). We design the last method (SmoothGL+VN) by smoothing hard labels with the average of thetwo related priors. We also evaluate the effect of the proposedlabel smoothing techniques on the state-of-the-art approachRU-LSTM [14]. In this case, we choose GL+VN as labelsmoothing technique.

To implement the architecture, we used TensorFlow 2. Eachmodel is trained for 100 epochs with batch size of 256 usingAdam optimizer with learning rate of 0.001. The best modelis selected using early stopping by monitoring the Top-5accuracy at anticipation time τa = 1 on the validation set.

Results. For each label smoothing method we select the bestsmooth factor α∗ with a grid search procedure between 0 and1 and step size ∆α = 0.05. The smooth factors used for theresults discussed in this section are shown in Table I.

We notice that our label smoothing procedures such as Verb-Noun, GloVe and Temporal perform well with higher smoothfactors (α∗ ' 0.5) compared to Uniform label smoothing(α∗ = 0.1). This suggests that, when soft labels encodessemantic information, prior information becomes more rele-vant assuming the same importance of the ground-truth dueto the similar smooth factors (α ' 1 − α). During training,we fix α = α∗ for each method and, in order to have a robustperformance estimation among trials, we iterate ten times allthe experiments.

Table II reports our results on the EPIC-KITCHENS val-idation set. We notice that all our proposed label smooth-ing methods improve model performance as compared withtraining using one-hot encoded labels. Soft labels based onGloVe + Verb-Noun attain best performance improving theTop-5 accuracy from +1.17% to +2.13% compared to hardlabels. To validate our results, we trained [14], the state-of-the-art model for action anticipation, considering label smoothingusing EPIC-KITCHENS validation and test sets. Table IIIreports results of RU-LSTM architecture using both one-hot encoding (i.e., the baseline) and smoothed labels on thevalidation set. We select GloVe + Verb-Noun soft labels sincethey show a higher performance increase. As shown by theobtained results, label smoothing improves the Top-5 accuracyfor all the anticipation times of a margin from +%0.58 to+1.59%. Such behavior highlights the systematic effect ofsmoothed labels on the model performance.

In Table IV, we report the results obtained consideringthe test set of EPIC-KITCHENS. Different approaches areconsidered for the comparison. It is wort noting that themethod which consider label smoothing together with RU-LSTM improves the performances obtaining best Top-1 andTop-5 accuracy for anticipating verb, noun and action. Labelsmoothing helps also to improve Precision for verb andobtains comparable results in anticipating action and noun.Results in terms of Recall point out that label smoothing helpsfor noun anticipation maintaining comparable results for theanticipation of verb and action. Finally, Fig. 4 shows somequalitative results obtained with our framework, whereas Fig. 5depicts prior component representations of the proposed labelsmoothing procedures.

V. CONCLUSION

This study proposed a knowledge distillation procedurevia label smoothing for leveraging the multi-modal futurecomponent of the action anticipation problem. We generalizedthe idea of label smoothing by designing semantic priors ofactions that are used during training as ground truth labels.We implemented a LSTM baseline model that can anticipateactions at multiple time steps starting from multi-modal repre-sentation of the input video. Experimental results corroborateout findings compared to state-of-the-art models highlightingthat label smoothing systematically improves performancewhen dealing with future uncertainty.

Acknowledgements: Research at the University of Padovais partially supported by MIUR PRIN-2017 PREVUE grant.Authors of Univ. of Padova gratefully acknowledge the supportof NVIDIA for their donation of GPUs, and the UNIPDCAPRI Consortium, for its support and access to computingresources. Research at the University of Catania is supportedby Piano della Ricerca 2016-2018 linea di Intervento 2 of DMIand MIUR AIM - Attrazione e Mobilit Internazionale Linea 1- AIM1893589 - CUP E64118002540007.

Page 6: Knowledge Distillation for Action Anticipation via Label ... · Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy yDepartment of Mathematics and Computer

Top-5 Action Accuracy % @ different anticipation times [s]2 1.75 1.5 1.25 1 0.75 0.5 0.25

One Hot (baseline) 27.71±0.33 28.69±0.34 29.84±0.24 30.90±0.48 31.93±0.45 33.14±0.36 34.10±0.44 35.16±0.35

Smooth TE 27.94±0.24 28.90±0.27 30.06±0.24 31.13±0.19 32.19±0.28 33.21±0.36 34.17±0.37 35.10±0.25

Smooth Uniform 28.16±0.27 29.06±0.26 30.23±0.24 31.25±0.27 32.41±0.28 33.64±0.27 34.69±0.19 35.75±0.13

Smooth VN 28.43±0.30 29.41±0.31 30.68±0.28 31.85±0.21 33.08±0.18 34.35±0.19 35.38±0.34 36.46±0.26

Smooth GL 28.61±0.26 29.87±0.25 30.97±0.34 31.94±0.34 33.12±0.36 34.40±0.37 35.51±0.37 36.87±0.25

Smooth GL+VN 28.88±0.20 29.94±0.19 31.23±0.32 32.54±0.31 33.56±0.28 34.92±0.25 36.06±0.33 37.29±0.30

Improv. +1.17 +1.25 +1.39 +1.64 +1.63 +1.78 +1.96 +2.13

TABLE II: Top-5 accuracy for action anticipation task at different anticipation time steps on EPIC-KITCHENS validation set.We report results of several label smoothing techniques and show a systematic performance improvement compared to one-hotencoded labels.

Top-5 Action Accuracy % @ different anticipation times [s]2 1.75 1.5 1.25 1 0.75 0.5 0.25

RU-LSTM 29.44 30.73 32.24 33.41 35.32 36.34 37.37 38.39RU-LSTM Smooth GL + NV 30.37 31.64 33.17 34.86 35.90 37.07 38.96 39.74Improv. +0.93 +0.91 +0.93 +1.45 +0.58 +0.73 +1.59 +1.35

TABLE III: Top-5 accuracy for action anticipation task at different anticipation time step on the EPIC-KITCHENS validationset. We introduce GL+VN smoothing technique into state-of-the-art RU-LSTM [14] model.

Top-1 Accuracy% Top-5 Accuracy% Avg Class Precision% Avg Class Recall%VERB NOUN ACTION VERB NOUN ACTION VERB NOUN ACTION VERB NOUN ACTION

S1

DMR [33] 26.53 10.43 01.27 73.30 28.86 07.17 06.13 04.67 00.33 05.22 05.59 00.472SCNN [13] 29.76 15.15 04.32 76.03 38.56 15.21 13.76 17.19 02.48 07.32 10.72 01.81ATSN [13] 31.81 16.22 06.00 76.56 42.15 28.21 23.91 19.13 03.13 09.33 11.93 02.39MCE [39] 27.92 16.09 10.76 73.59 39.32 25.28 23.43 17.53 06.05 14.79 11.65 05.11ED [34] 29.35 16.07 08.08 74.49 38.83 18.19 18.08 16.37 05.69 13.58 14.62 04.33Miech et al. [44] 30.74 16.47 09.74 76.21 42.72 25.44 12.42 16.67 03.67 08.80 12.66 03.85RU-LSTM [14] 33.04 22.78 14.39 79.55 50.95 33.73 25.50 24.12 07.37 15.73 19.81 07.66RU-LSTM Smooth GL + VN 35.04 23.03 14.43 79.56 52.90 34.99 30.65 24.29 06.64 14.85 20.91 07.61

S2

DMR [33] 24.79 08.12 00.55 64.76 20.19 04.39 09.18 01.15 00.55 05.39 04.03 00.202SCNN [13] 25.23 09.97 02.29 68.66 27.38 09.35 16.37 06.98 00.85 05.80 06.37 01.14ATSN [13] 25.30 10.41 02.39 68.32 29.50 06.63 07.63 08.79 00.80 06.06 06.74 01.07MCE [39] 21.27 09.90 05.57 63.33 25.50 15.71 10.02 06.88 01.99 07.68 06.61 02.39ED [34] 22.52 07.81 02.65 62.65 21.42 07.57 07.91 05.77 01.35 06.67 05.63 01.38Miech et al.[44] 28.37 12.43 07.24 69.96 32.20 19.29 11.62 08.36 02.20 07.80 09.94 03.36RU-LSTM [14] 27.01 15.19 08.16 69.55 34.38 21.10 13.69 09.87 03.64 09.21 11.97 04.83RU-LSTM Smooth GL + VN 29.29 15.33 08.81 70.71 36.63 21.34 14.66 09.86 04.48 08.95 12.36 04.78

TABLE IV: Results of action anticipation on the test sets of EPIC-Kitchens. The test set is divided into kitchens already seenS1 or unseen S2 from the model. The table confirms also on the test set that our smoothing labels procedure can improveperformances of the state-of-the-art model RU-LSTM.

REFERENCES

[1] L. Ballan, M. Bertini, A. Del Bimbo, L. Seidenari, and G. Serra, “Eventdetection and recognition for semantic annotation of video,” Multimediatools and applications, vol. 51, no. 1, pp. 279–302, 2011.

[2] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei, “Large-scale video classification with convolutional neuralnetworks,” in IEEE Computer Vision and Pattern Recognition (CVPR),Jun 2014.

[3] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networksfor human action recognition,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 35, no. 1, pp. 221–231, Jan 2013.

[4] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detectionin surveillance videos,” in IEEE Conference on Computer Vision andPattern Recognition (CVPR), June 2018.

[5] A. Bhattacharyya, M. Fritz, and B. Schiele, “Long-term on-boardprediction of people in traffic scenes under uncertainty,” in The IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2018.

[6] A. I. Maqueda, A. Loquercio, G. Gallego, N. Garca, and D. Scaramuzza,“Event-based vision meets deep learning on steering prediction for self-driving cars,” in The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2018.

[7] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learningrealistic human actions from movies,” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2008.

[8] K. Simonyan and A. Zisserman, “Two-stream convolutional networksfor action recognition in videos,” in Advances in Neural InformationProcessing Systems (NeurIPS), 2014.

[9] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-streamnetwork fusion for video action recognition,” in IEEE Computer Visionand Pattern Recognition (CVPR), 2016.

[10] M. S. Ryoo, “Human activity prediction: Early recognition of ongoingactivities from streaming videos,” in IEEE International Conference onComputer Vision (ICCV), 2011.

[11] Y. Cao, D. Barrett, A. Barbu, S. Narayanaswamy, H. Yu, A. Michaux,Y. Lin, S. Dickinson, J. Mark Siskind, and S. Wang, “Recognize humanactivities from partially observed videos,” in IEEE Computer Vision andPattern Recognition (CVPR), 2013.

[12] R. De Geest, E. Gavves, A. Ghodrati, Z. Li, C. Snoek, and T. Tuytelaars,“Online action detection,” in European Conference on Computer Vision(ECCV), 2016.

[13] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kaza-kos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, “Scalingegocentric vision: The epic-kitchens dataset,” in European Conferenceon Computer Vision (ECCV), Sep 2018.

Page 7: Knowledge Distillation for Action Anticipation via Label ... · Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy yDepartment of Mathematics and Computer

Fig. 4: Qualitative results of our smoothing labels procedure. We show the comparison between the top-10 predictions of ourmodel at τa = 1 second trained either with smoothed labels or with one-hot vectors. These examples are from the validationset where both the models predict correctly, under the top-5 accuracy, the upcoming action. The model trained on one-hotlabels is more confident and tries to concentrates all the prediction energy on a very restricted set of actions, without capturingthe uncertainty of the future. By contrast, smoothing labels shape the prediction distribution considering similar actions to theground truth.

(a) GloVe Smoothing Matrix. (b) Verb-Noun Smoothing Matrix. (c) Temporal Smoothing Matrix.

Fig. 5: Smooth Labels Matrices. Each matrix represents the prior component for the label smoothing procedure reported inEq. (3) and each row corresponds to a single label prior. In the GloVe and Verb-Noun priors can be recognized squaredstructures on the diagonal because the labels are alphabetically ordered. The Temporal prior has sparse row entries since thereis no major occurrence trend or structure in the training set.

[14] A. Furnari and G. M. Farinella, “What would you expect? anticipatingegocentric actions with rolling-unrolling lstms and modality attention,”in IEEE International Conference on Computer Vision (ICCV), 2019.

[15] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), June 2016.

[16] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu, “Dense trajectoriesand motion boundary descriptors for action recognition,” InternationalJournal of Computer Vision, vol. 103, no. 1, pp. 60–79, 2013.

[17] H. Wang and C. Schmid, “Action recognition with improved trajecto-ries,” in IEEE Int. Conference on Computer Vision (ICCV), 2013.

[18] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal multipliernetworks for video action recognition,” in IEEE Computer Vision andPattern Recognition (CVPR), 2017.

[19] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool,“Temporal segment networks: Towards good practices for deep actionrecognition,” in European Conference on Computer Vision (ECCV),2016.

[20] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relationalreasoning in videos,” in European Conference on Computer Vision(ECCV), 2018.

[21] J. Lin, C. Gan, and S. Han, “TSM: Temporal shift module for efficientvideo understanding,” in IEEE International Conference on ComputerVision (ICCV), 2019.

[22] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learn-ing spatiotemporal features with 3d convolutional networks,” in IEEEInternational Conference on Computer Vision (ICCV), 2015.

[23] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A

Page 8: Knowledge Distillation for Action Anticipation via Label ... · Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy yDepartment of Mathematics and Computer

closer look at spatiotemporal convolutions for action recognition,” inIEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.

[24] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutionalnetworks for visual recognition and description,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2015.

[25] S. Sudhakaran, S. Escalera, and O. Lanz, “LSTA: Long short-termattention for egocentric action recognition,” in IEEE Computer Visionand Pattern Recognition (CVPR), 2018.

[26] M. Hoai and F. De la Torre, “Max-margin early event detectors,”International Journal of Computer Vision, vol. 107, no. 2, pp. 191–202,2014.

[27] D. Huang, S. Yao, Y. Wang, and F. De La Torre, “Sequential max-marginevent detectors,” in European Conference on Computer Vision (ECCV),2014.

[28] M. S. Aliakbarian, F. S. Saleh, M. Salzmann, B. Fernando, L. Petersson,and L. Andersson, “Encouraging LSTMs to anticipate actions veryearly,” in IEEE Int. Conference on Computer Vision (ICCV), 2017.

[29] F. Becattini, T. Uricchio, L. Seidenari, A. Del Bimbo, and L. Ballan,“Am I done? predicting action progress in videos,” arXiv preprint,1705.01781, 2017.

[30] R. De Geest and T. Tuytelaars, “Modeling temporal structure with lstmfor online action detection,” in IEEE Winter Conference on Applicationsin Computer Vision (WACV), 2018.

[31] T. Lan, T.-C. Chen, and S. Savarese, “A hierarchical representation forfuture action prediction,” in European Conference on Computer Vision(ECCV), 2014, pp. 689–704.

[32] A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car thatknows before you do: Anticipating maneuvers via learning temporaldriving models,” in IEEE International Conference on Computer Vision(ICCV), 2015.

[33] C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual rep-resentations from unlabeled video,” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2016.

[34] J. Gao, Z. Yang, and R. Nevatia, “RED: Reinforced encoder-decodernetworks for action anticipation,” British Machine Vision Conference,2017.

[35] K.-H. Zeng, W. B. Shen, D.-A. Huang, M. Sun, and J. C. Niebles,“Visual forecasting by imitating dynamics in natural sequences,” in IEEEInternational Conference on Computer Vision (ICCV), Oct 2017.

[36] T. Mahmud, M. Hasan, and A. K. Roy-Chowdhury, “Joint predictionof activity labels and starting times in untrimmed videos,” in IEEEInternational Conference on Computer Vision (ICCV), Oct 2017.

[37] Y. Abu Farha, A. Richard, and J. Gall, “When will you do what? -anticipating temporal occurrences of activities,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), June 2018.

[38] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in aneural network,” in NIPS Deep Learning and Representation LearningWorkshop, 2015.

[39] A. Furnari, S. Battiato, and G. Maria Farinella, “Leveraging uncertaintyto rethink loss functions and evaluation measures for egocentric actionanticipation,” in European Conference on Computer Vision Workshops(ECCVW), 2018.

[40] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectorsfor word representation,” in Empirical Methods in Natural LanguageProcessing (EMNLP), 2014.

[41] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in InternationalConference on Machine Learning (ICML), 2015.

[42] R. B. Girshick, “Fast R-CNN,” in IEEE International Conference onComputer Vision (ICCV), 2015.

[43] H. S. Koppula and A. Saxena, “Anticipating human activities usingobject affordances for reactive robotic response,” IEEE Trans. on PatternAnalysis and Machine Intelligence, vol. 38, no. 1, pp. 14–29, 2016.

[44] A. Miech, I. Laptev, J. Sivic, H. Wang, L. Torresani, and D. Tran,“Leveraging the present to anticipate the future in videos,” in IEEEConference on Computer Vision and Pattern Recognition Workshops(CVPRW), 2019.