arxiv:2010.10019v2 [cs.cv] 3 jan 2021

21
Noname manuscript No. (will be inserted by the editor) Hierarchical Conditional Relation Networks for Multimodal Video Question Answering Thao Minh Le · Vuong Le · Svetha Venkatesh · Truyen Tran Received: date / Accepted: date Abstract Video Question Answering (Video QA) challenges modelers in multiple fronts. Modeling video necessitates building not only spatio-temporal models for the dynamic visual channel but also multimodal structures for associated information channels such as subtitles or audio. Video QA adds at least two more layers of complexity – selecting rele- vant content for each channel in the context of the linguistic query, and composing spatio-temporal concepts and relations hidden in the data in response to the query. To address these requirements, we start with two insights: (a) content selection and relation construction can be jointly encapsulated into a conditional computational structure, and (b) video-length structures can be composed hierarchically. For (a) this paper introduces a general-reusable reusable neural unit dubbed Conditional Relation Network (CRN) taking as input a set of tensorial objects and translating into a new set of objects that encode relations of the inputs. The generic design of CRN helps ease the common complex model building process of Video QA by simple block stacking and rearrangements with flexibility in accommodating diverse input modalities and conditioning features across both visual and linguistic domains. As a result, we realize insight (b) by introducing Hierarchical Conditional Relation Networks (HCRN) for Video QA. The HCRN primarily aims at exploiting intrinsic properties of the visual content of a video as well as its ac- companying channels in terms of compositionality, hierarchy, and near-term and far-term relation. HCRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content of a video, and long- form where an additional associated information channel, such as movie subtitles, presented. Our rigorous evaluations Thao Minh Le, Vuong Le, Svetha Venkatesh and Truyen Tran Applied AI Institute, Deakin University, Australia E-mail: [email protected] show consistent improvements over state-of-the-art methods on well-studied benchmarks including large-scale real-world datasets such as TGIF-QA and TVQA, demonstrating the strong capabilities of our CRN unit and the HCRN for com- plex domains such as Video QA. To the best of our knowl- edge, the HCRN is the very first method attempting to handle long and short-form multimodal Video QA at the same time. Keywords Video QA · Relational networks · Conditional modules · Hierarchy 1 Introduction Answering natural questions about a video is a powerful demonstration of cognitive capability. The task involves ac- quisition and manipulation of spatio-temporal visual, acous- tic and linguistic representations from the video guided by the compositional semantics of linguistic cues [1, 2, 3, 4, 5, 6]. As questions are potentially unconstrained, Video QA requires deep modeling capacity to encode and represent cru- cial multimodal video properties such as linguistic content, object permanence, motion profiles, prolonged actions, and varying-length temporal relations in a hierarchical manner. For Video QA, the visual and textual representations should ideally be question-specific and answer-ready. The common approach toward modeling videos for QA is to build neural architectures specially designed for a par- ticular data format and modality. In this perspective, the two common variants of Video QA emerge: short-form Video QA where relevant information is contained in the visual con- tent of short video snippets of a single event (see Fig. 1 for examples), and long-form Video QA (also called Video Story QA) where the keys to answer are carried in the mixed visual-textual data of longer multi-shot video sequences (see Fig. 2 for example). Because of this specificity, such hand crafted architectures tend to be non-optimal for changes in arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

Upload: others

Post on 15-Oct-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

Noname manuscript No.(will be inserted by the editor)

Hierarchical Conditional Relation Networks for Multimodal VideoQuestion Answering

Thao Minh Le · Vuong Le · Svetha Venkatesh · Truyen Tran

Received: date / Accepted: date

Abstract Video Question Answering (Video QA) challengesmodelers in multiple fronts. Modeling video necessitatesbuilding not only spatio-temporal models for the dynamicvisual channel but also multimodal structures for associatedinformation channels such as subtitles or audio. Video QAadds at least two more layers of complexity – selecting rele-vant content for each channel in the context of the linguisticquery, and composing spatio-temporal concepts and relationshidden in the data in response to the query. To address theserequirements, we start with two insights: (a) content selectionand relation construction can be jointly encapsulated into aconditional computational structure, and (b) video-lengthstructures can be composed hierarchically. For (a) this paperintroduces a general-reusable reusable neural unit dubbedConditional Relation Network (CRN) taking as input a set oftensorial objects and translating into a new set of objects thatencode relations of the inputs. The generic design of CRNhelps ease the common complex model building processof Video QA by simple block stacking and rearrangementswith flexibility in accommodating diverse input modalitiesand conditioning features across both visual and linguisticdomains. As a result, we realize insight (b) by introducingHierarchical Conditional Relation Networks (HCRN) forVideo QA. The HCRN primarily aims at exploiting intrinsicproperties of the visual content of a video as well as its ac-companying channels in terms of compositionality, hierarchy,and near-term and far-term relation. HCRN is then appliedfor Video QA in two forms, short-form where answers arereasoned solely from the visual content of a video, and long-form where an additional associated information channel,such as movie subtitles, presented. Our rigorous evaluations

Thao Minh Le, Vuong Le, Svetha Venkatesh and Truyen TranApplied AI Institute, Deakin University, AustraliaE-mail: [email protected]

show consistent improvements over state-of-the-art methodson well-studied benchmarks including large-scale real-worlddatasets such as TGIF-QA and TVQA, demonstrating thestrong capabilities of our CRN unit and the HCRN for com-plex domains such as Video QA. To the best of our knowl-edge, the HCRN is the very first method attempting to handlelong and short-form multimodal Video QA at the same time.

Keywords Video QA · Relational networks · Conditionalmodules · Hierarchy

1 Introduction

Answering natural questions about a video is a powerfuldemonstration of cognitive capability. The task involves ac-quisition and manipulation of spatio-temporal visual, acous-tic and linguistic representations from the video guided bythe compositional semantics of linguistic cues [1,2,3,4,5,6]. As questions are potentially unconstrained, Video QArequires deep modeling capacity to encode and represent cru-cial multimodal video properties such as linguistic content,object permanence, motion profiles, prolonged actions, andvarying-length temporal relations in a hierarchical manner.For Video QA, the visual and textual representations shouldideally be question-specific and answer-ready.

The common approach toward modeling videos for QAis to build neural architectures specially designed for a par-ticular data format and modality. In this perspective, the twocommon variants of Video QA emerge: short-form Video QAwhere relevant information is contained in the visual con-tent of short video snippets of a single event (see Fig. 1for examples), and long-form Video QA (also called VideoStory QA) where the keys to answer are carried in the mixedvisual-textual data of longer multi-shot video sequences (seeFig. 2 for example). Because of this specificity, such handcrafted architectures tend to be non-optimal for changes in

arX

iv:2

010.

1001

9v2

[cs

.CV

] 3

Jan

202

1

Page 2: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

2 Le et al.

(1) Question: What does the girl do 9 times?Choice 1: walkChoice 2: blocks a person’s punchChoice 3: stepChoice 4: shuffle feetChoice 5: wag tail

Baseline: walkHCRN: blocks a person’s punchGround truth: blocks a person’s punch

(2) Question: What does the man do before turning body to left?Choice 1: run across a ringChoice 2: pick up the man’s handChoice 3: flip cover face with handChoice 4: raise handChoice 5: breath

Baseline: pick up the man’s handHCRN: breathGround truth: breath

Fig. 1 Examples of short-form Video QA for which frame relations are key toward correct answers. (1) Near-term frame relations are required forcounting of fast actions. (2) Far-term frame relations connect the actions in long transition. HCRN with the ability to model hierarchical conditionalrelations handles successfully, while baseline struggles.

Subtitle:

00:00:0.395 --> 00:00:1.896(Keith:) I’m not gonna stand here and let youaccuse me

00:00:1.897 --> 00:00:4.210(Keith:) of killing one of my best friends, allright?

00:00:8.851 --> 00:00:10.394(Castle:) You hear that sound?

Question: What did Keith do when he was on the stage?Choice 1: Keith drank beerChoice 2: Keith played drumChoice 3: Keith sing to the microphone

Choice 4: Keith played guitarChoice 5: Keith got off the stage and walked out

Baseline: Keith played guitarHCRN: Keith got off the stage and walked outGround truth: Keith got off the stage and walked out

Fig. 2 Example of long-form Video QA. This is a typical question where a model needs to collect sufficient relevant cues from both visual contentof a given video and textual subtitles to give the correct answer. In this particular example, our baseline is likely to suffer from the linguistic bias(“stage” and “played guitar”) while our model successfully manages to arrive at the correct answer by connecting the linguistic information fromthe first shot and visual content in the second one.

data modality [2], varying video length [7] or question types(such as frame QA [3] versus action count [8]). This hasresulted in proliferation of heterogeneous networks.

We wish to build a family of effective models for bothlong-form and short-form Video QA from reusable unitsthat are more homogeneous and easier to construct, main-tain and comprehend. We start by realizing that Video QAinvolves two sub-tasks: (a) selecting relevant content for eachchannel in the context of the linguistic query, and (b) com-posing spatio-temporal concepts and relations hidden in thedata in response to the query. Much of sub-task (a) can beabstracted into a conditional computational structure thatcomputes multi-way interaction between the query and theseveral objects. With this ability, solving sub-task (b) can beapproached by composing the hierarchical structure of suchabstraction from the ground up.

Toward this goal, we propose a general-purpose reusableneural unit called Conditional Relation Network (CRN) thatencapsulates and transforms an array of objects into a newarray of relations conditioned on a contextual feature. Theunit computes sparse high-order relations between the inputobjects, and then modulates the encoding through a specifiedcontext (See Fig. 4). The flexibility of CRN and its encap-sulating design allow it to be replicated and layered to formdeep hierarchical conditional relation networks (HCRN) ina straightforward manner (See Fig. 5 and Fig. 6). Withinthe scope of this work, the HCRN is a two-stream networkof visual content and textual subtitles, in which the twosub-networks share a similar design philosophy but are cus-tomized to suit the properties of each input modality. Whilstthe visual stream is built up by stacked CRN units at differentgranularities, the textual stream is composed of a single CRN

Page 3: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

HCRN for Multimodal Video QA 3

unit taking linguistic segments as inputs. The stacked unitsin the visual stream thus provide contextualized refinementof relational knowledge from visual objects – in a stage-wisemanner it combines appearance features with clip activityflow and linguistic context, and afterwards incorporates thecontext information from the whole video motion and lin-guistic features. On the textual side, the CRN unit functionsin the same manner but on textual objects. The resultingHCRN is homogeneous, agreeing with the design philosophyof networks such as InceptionNet [9], ResNet [10] and FiLM[11].

The hierarchy of the CRNs for each input modality isshown as follows. At the lowest level of the visual stream,the CRNs encode the relations between frame appearance ina clip and integrate the clip motion as context; this outputis processed at the next stage by CRNs that now integratein the linguistic context. In the following stage, the CRNscapture the relation between the clip encodings, and inte-grate in video motion as context. In the final stage the CRNintegrates the video encoding with the linguistic feature ascontext (See Fig. 5). As for the textual stream, due to itshigh-level abstraction as well as the diversity of nuances ofexpression comparing to its visual counterpart, we only usethe CRN to encode relations between linguistic segments ina given dialogue extracted from textual subtitles and lever-age well-known techniques in sequential modeling, such asLSTM [12] or BERT [13], to understand sequential relationsat the word level. By allowing the CRNs to be stacked hierar-chically, the model naturally supports modeling hierarchicalstructures in video and relational reasoning. Likewise, byallowing appropriate context to be introduced in stages, themodel handles multimodal fusion and multi-step reasoning.For long videos, further levels of hierarchy can be addedenabling encoding of relations between distant frames.

We demonstrate the capability of HCRN in answeringquestions in major Video QA datasets, including both short-form and long-form videos. The hierarchical architecturewith four-layers of CRN units achieves favorable answeraccuracy across all Video QA tasks. Notably, it performsconsistently well on questions involving either appearance,motion, state transition, temporal relations, or action repeti-tion demonstrating that the model can analyze and combineinformation in all of these channels. Furthermore the HCRNscales well on longer length videos simply with the addi-tion of an extra layer. Fig. 1 and Fig. 2 demonstrate severalrepresentative cases those were difficult for the baseline offlat visual-question interaction but can be handled by ourmodel. Our model and results demonstrate the impact ofbuilding general-purpose neural reasoning units that supportnative multimodality interaction in improving robustness andgeneralization capacities of Video QA models.

This paper advances the preliminary work [14] in threemain aspects: (1) devising a new network architecture that

leverages the design philosophy of HCRN in handling long-form Video QA, demonstrating the flexibility and generic ap-plicability of the proposed CRN unit in various input modali-ties; (2) providing a comprehensive theoretical analysis onthe complexity of CRN computational unit as well as theresulting HCRN architectures; (3) conducting more rigor-ous experiments and ablation study to fully examining thecapability of the proposed network architectures, especiallyfor the long-form Video QA. These experiments thoroughlyassure the consistency in behavior of the CRN unit upondifferent settings and input modalities. We also simplify thesampling procedure in the main algorithm, reducing the runtime significantly.

The rest of the paper is organized as follows. Section 2reviews related work. Section 3 details our main contributions– the CRN, the HCRNs for Video QA on both short-fromvideos and long-form videos that include subtitles, as well ascomplexity analysis. The next section describes the results ofthe experimental suite. Section 5 provides further discussionand concludes the paper.

2 Related work

Our proposed HCRN model advances the development ofVideo QA by addressing three key challenges: (1) Efficientlyrepresenting videos as amalgam of complementing factorsincluding appearance, motion and relations, (2) Effectivelyallows the interaction of such visual features with the lin-guistic query and (3) Allows integration of different inputmodalities in Video QA with one unique building block.

Long/short-form Video QA is a natural extension of im-age QA and it has been gathering a rising attention in recentyears, with the release of a number of large-scale short-formVideo QA datasets, such as TGIF-QA [15,16], as well aslong-form MovieQA datasets with accompanying textualmodality, such as TVQA [2], MovieQA [5] and PororoQA[17]. All studies on Video QA treats short-form and long-form Video QA as two separate problems in which proposedmethods [8,1,15,17,18,2] are deviated to handle either oneof the two problems. Different from those works, this papertakes the challenge of designing a generic method that cov-ers both long-form and short-form Video QA with simpleexercises of block stacking and rearrangements to switch oneunique model between the two problem depending upon theavailability of input modalities for each of them.

Spatio-temporal video representation is traditionallydone by variations of recurrent networks (RNNs) amongwhich many were used for Video QA such as recurrent encoder-decoder [19,20], bidirectional LSTM [17] and two-stagedLSTM [21]. To increase the memorizing ability, externalmemory can be added to these networks [1,21]. This tech-nique is more useful for videos that are longer [16] and with

Page 4: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

4 Le et al.

more complex structures such as movies [5] and TV pro-grams [2] with extra accompanying channels such as speechor subtitles. On these cases, memory networks [17,7,22]were used to store multimodal features [6] for later retrieval.Memory augmented RNNs can also compress video intoheterogeneous sets [8] of dual appearance/motion features.While in RNNs, appearance and motion are modeled sepa-rately, 3D and 2D/3D hybrid convolutional operators [23,24]intrinsically integrates spatio-temporal visual informationand are also used for Video QA [15,3]. Multiscale temporalstructure can be modeled by either mixing short and longterm convolutional filters [25] or combining pre-extractedframe features non-local operators [26,27]. Within the sec-ond approach, the TRN network [28] demonstrates the roleof temporal frame relations as an another important visualfeature for video reasoning and Video QA [29]. Relations ofpredetected objects were also considered in a separate pro-cessing stream [30] and combined with other modalities inlate-fusion [31]. Our HCRN model emerges on top of thesetrends by allowing all three channels of video informationnamely appearance, motion and relations to iteratively inter-act and complement each other in every step of a hierarchicalmulti-scale framework.

Earlier attempts for generic multimodal fusion for visualreasoning include bilinear operators, either applied directly[32] or through attention [32,33]. While these approachestreat the input tensors equally in a costly joint multiplicativeoperation, HCRN separates conditioning factors from refinedinformation, hence it is more efficient and also more flexibleon adapting operators to conditioning types.

Temporal hierarchy has been explored for video analy-sis [34], most recently with recurrent networks [35,36] andgraph networks [37]. However, we believe we are the first toconsider hierarchical interaction of multi-modalities includ-ing linguistic cues for Video QA.

Linguistic query–visual feature interaction in VideoQA has traditionally been formed as a visual informationretrieval task in a common representation space of indepen-dently transformed question and referred video [21]. The re-trieval is more convenient with heterogeneous memory slots[8]. On top of information retrieval, co-attention between thetwo modalities provides a more interactive combination [15].Developments along this direction include attribute-basedattention [38], hierarchical attention [39,40,41], multi-headattention [42,43], multi-step progressive attention memory[18] or combining self-attention with co-attention [3]. Forhigher order reasoning, question can interact iteratively withvideo features via episodic memory or through switchingmechanism [44]. Multi-step reasoning for Video QA is alsoapproached by [45] and [4] with refined attention.

Unlike these techniques, our HCRN model supports con-ditioning video features with linguistic clues as a context fac-tor in every stage of the multi-level refinement process. This

Answer decoder

Question Question Question

VisualStream

TextualStream

Video frames Subtitles

Fig. 3 Overall multimodal Video QA architecture. The two streamshandle visual and textual modalities in parallel, followed by an answerdecoder for feature joining and prediction.

allows the representation of linguistic cues to be involvedearlier and deeper into video presentation construction thanany available methods.

Neural building blocks - Beyond the Video QA domain,CRN unit shares the idealism of uniformity in neural architec-ture with other general purpose neural building blocks suchas the block in InceptionNet [9], Residual Block in ResNet[10], Recurrent Block in RNN, conditional linear layer inFiLM [11], and matrix-matrix-block in neural matrix net[46]. Our CRN departs significantly from these designs byassuming an array-to-array block that supports conditionalrelational reasoning and can be reused to build networks ofother purposes in vision and language processing. As a re-sult, our HCRN is a perfect fit not only for short-form videowhere questions are all about visual content of a video snip-pet but also for long-form video (Movie QA) where a modelhas to look at both visual cue and textual cue (subtitles) toarrive at correct answers. Due to the great challenges posedby long-form Video QA and the diversity in terms of modelbuilding, current approaches in Video QA mainly spend ef-forts on handling the visual part while leaving the textual partfor common techniques such as LSTM [2] or latest advancein natural language processing BERT [47]. To the best of ourknowledge, HCRN is the first model that could solve bothshort-form and long-form Video QA with models buildingup from a generic neural block.

3 Method

The goal of Video QA is to deduce an answer a from a videoV and optionally from additional information channels suchas subtitles S in response to a natural question q. The answera can be found in an answer space A which is a pre-definedset of possible answers for open-ended questions or a listof answer candidates in the case of multi-choice questions.Formally, Video QA can be formulated as follows:

a = argmaxa∈A

Fθ (a | q,V,S) , (1)

Page 5: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

HCRN for Multimodal Video QA 5

Algorithm 1: CRN UnitInput :Array X = {xi}n

i=1, conditioning feature cOutput :Array RMetaparams :{kmax, t | kmax < n}

1 Initialize R← {}2 for k ← 2 to kmax do3 Qk = randomly select t subsets of size k from X4 for each subset qi ∈ Qk do5 gi = gk(qi)6 hi = hk(gi, c)7 end8 rk = pk({hi})9 add rk to R

10 end

where θ is the model parameters of scoring function F .Within the scope of this work, we address two common

settings of Video QA: (a) short-form Video QA where thevisual content of a given single video shot singularly sufficeto answer the questions and (b) long-form Video QA wherethe essential information disperses among visual content inthe multi-shot video sequence and conversational content inaccompanying textual subtitles.

HCRN was designed in the endeavor for a homogeneousneural architecture that can adapt to solve both problems. Itsoverall workflow is depicted in Fig. 3. In long-form videos,when both visual and textual streams are present, HCRNdistills relevant information from the visual stream and thetextual stream, both are conditioned on the question. Even-tually, it combines them into an answer decoder for finalprediction in late-fusion multimodal integration. In short-form cases, where only video frames are available, the visualstream is solely active, working with the single-input an-swer decoder. One of the ambitions of the design is to buildeach processing stream as a hierarchical network simply bystacking common core processing units of the same family.Similar to all previous deep-learning-based approaches in theliterature [15,3,8,1,2,29], our HCRN operates on top of thefeature embeddings of multiple input modalities, making useof the powerful feature representations extracted by eithercommon visual recognition models pre-trained on large-scaledatasets such as ResNet [10], ResNeXt [48,49] or pre-trainedword embeddings such as GloVe [50] and BERT [13].

In the following subsections, we present the design ofthe core unit in Sec. 3.1, the hierarchical designs tailored toeach modality in Sec. 3.2, the answer decoder in Sec. 3.3.Theoretical analysis of the computational complexity of themodels follows in Sec. 3.4.

3.1 Conditional relation network unit

We introduce a general computation unit, termed ConditionalRelation Network (CRN), which takes as input an array of n

...

Input array

Conditioning feature

...

...

2-tuple

...

...

(n-1)-tupleTuples picked randomly

...

...1 2 3 4 5 n

CRN Unit

Fig. 4 Conditional Relation Network. a) Input array X of n objectsare first processed to model k-tuple relations from t sub-sampled size-k subsets by sub-network gk(.). The outputs are further conditionedwith the context c via sub-network hk(., .) and finally aggregated bypk(.) to obtain a result vector rk which represents k-tuple conditionalrelations. Tuple sizes can range from 2 to (n − 1), which outputs an(n− 2)-dimensional output array.

Notation RoleX Input array of n objects (e.g. frames, clips)c Conditioning feature (e.g. query, motion feat.)kmax Maximum subset (also tuple) size consideredk Each subset size from 2 to kmax

t Number of size-k subsets of X randomly sampledQk Set of t size-k subsets sampled from Xgk(.) Sub-network processing each size-k subsethk(., .) Conditioning sub-networkpk(.) Aggregating sub-networkR Result array of CRN unit on X given c

rk Member result vector of k-tuple relations

Table 1 Notations of CRN unit operations.

objects X = {xi}ni=1 and a conditioning feature c serving asglobal context. Objects are assumed to live either in the samevector space Rd or tensor space, for example, RW×H×d incase of images (or video frames). CRN generates an outputarray of objects of the same dimensions containing high-orderobject relations of input features given the global context. Theglobal context is problem-specific, serving as a modulatorto the formation of the relations. When in use for Video QA,CRN’s input array is composed of features at either framelevel, short-clip levels, or textual features. Examples of globalcontext include the question and motion profiles at a givenhierarchical level.

Page 6: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

6 Le et al.

Given input set of object X , CRN first considers theset of subsets Q =

{Qk}kmax

k=2 where each set Qk containst size-k subsets randomly sampled from X , where t is thesampling frequency, t < C(n, k). On each collection Qk,CRN then uses member relational sub-networks to infer thejoint representation of k-tuple object relations. In videos,due to temporal coherence, the objects {xi}ni=1 share a greatamount of mutual information, therefore, it is reasonable touse a subsampled set of t combinations instead of consideringall possible combinations for Qk. This is inspired by [28],however, we sample t size-k subsets directly from the originalX rather than randomly take subsets in a pool of all possiblesize-k subsets as in their method. By doing this, we canreduce the computational complexity of the CRN as it ismuch cheaper to sample elements out of a set than building allcombinations of subsets which cost O(2n) time. We providemore analysis on the complexity of the CRN unit later in Sec.3.4.

Regarding relational modeling, each member subset ofQk then goes through a set-based function gk(.) for rela-tional modeling. The relations across objects are then fur-ther refined with a conditioning function hk(., c) in the lightof conditioning feature c. Finally, the k-tuple relations aresummarized by the aggregating function pk(.). The similaroperations are done across the range of subset size from 2 tokmax. Regarding the choice of kmax, we use kmax = n− 1 inlater experiments, resulting in the output array of size n− 2if n > 2 and an array of size 1 if n = 2.

The detailed operation of the CRN unit is presented for-mally as pseudo-code in Alg. 1 and visually in Fig. 4. Table 1summarizes the notations used across these presentations.

3.1.1 Networks implementation

Set aggregation: The set functions gk(.) and pk(.) can beimplemented as any aggregation sub-networks that join arandom set into a single representation. As a choice in im-plementation, the function gk(.) is either average poolingor a simple concatenation operator while pk(.) is averagepooling.

Conditioning function: The design of the conditioning sub-network that implements hk(., c) depends on the relationshipbetween the input set X and the conditioning feature c aswell as the properties of the channels themselves. Here wepresent four forms of neural operation implementing thisfunction.

Additive form:A simple form of hk(., c) is feature concatenation fol-

lowed by a MLP that models the non-linear relationshipsbetween multiple input modalities:

hk (., c) = ELU (Wh1 [.; c]) , (2)

where [ ; ] to denote the tensor concatenation operation andELU is the non-linear activation function introduced in [51].Eq. 2 is sufficient when x and c are additively complemen-tary.

Multiplicative form:To support more complex the relationship between the

input x and the conditioning feature c, a more sophisticatedjoining operation is warranted. For example, when c impliesa selection criterion to modulate the relationship betweenelements in x, the multiplicative relation between them canbe represented in conditioning function by:

hk (x, c) = ELU (Wh1 [x;x� c; c]) , (3)

where � denotes Hadamard product.Sequential form:As aforementioned, how to properly choose the sub-

network hk(., c) is also driven by the properties of the inputset X . Given the context of Video QA of later use of theCRN unit where elements in X may contain strong temporalrelationships, we additionally integrate sequential modelingcapability, which is a BiLSTM network in this paper, alongwith the conditioning sub-network as presented in Eq. 2 andEq. 3. Formally, hk(., c) is defined as:

s = [x;x� c; c] , (4)

s′ = BiLSTM(s), (5)

hk(., c) = maxpool(s′). (6)

Dual conditioning form:In the later use of CRN in Video QA where it can happen

that two concurrent signals c1, c2 are used as conditioningfeatures, we simply extend Eq. 2 and Eq. 3 and Eq. 4 as:

hk (x, c) = ELU (Wh1 [x; c1; c2]) , (7)

hk (x, c) = ELU (Wh1 [x;x� c1;x� c2; c1; c2]) , (8)

s = [x;x� c1;x� c2; c1; c2] , (9)

respectively.

3.2 Hierarchical conditional relation network for multimodalVideo QA

We use CRN blocks to build a deep network architecture thatsupports a wide variety of Video QA settings. In particular,two variations are specifically designed to work on short-form and long-form Video QA. For each of the settings, thenetwork design adapts to exploit inherent characteristics of avideo sequence namely temporal relations, motion, linguis-tic conversation and the hierarchy of video structure, and tosupport reasoning guided by linguistic questions. We term

Page 7: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

HCRN for Multimodal Video QA 7

the proposed network architecture Hierarchical ConditionalRelation Networks (HCRN). The design of the HCRN bystacking reusable core units is partly inspired by modernCNN network architectures, of which InceptionNet [9] andResNet [10] are the most well-known examples. In the gen-eral form of Video QA, HCRN is a multi-stream end-to-enddifferentiable neural network in which one stream is to handlevisual content and the other one handling textual dialogues insubtitles. The network is modular and each network streamplays a plug-and-play role adaptively to the presence of inputmodalities.

3.2.1 Preprocessing

With the Video QA as defined in Eq. 1, HCRN takes inputand question represented as sets of visual or textual objectsand computes the answer. In this section, we describe thepreprocessing of raw videos into appropriate input sets forHCRN.

Visual representation: We begin by dividing the video Vof L frames into N equal length clips C = {C1, ..., CN}.For short-form videos, each clip Ci of length T = bL/Ncis represented by two sources of information: frame-wiseappearance feature vectors Vi =

{vi,j | vi,j ∈ R2048

}Tj=1 ,

and a motion feature vector at clip level fi ∈ R2048. Appear-ance features are vital for video understanding as the visualsaliency of objects/entities in the video is usually of interestto human questions. In short clips, moving objects and eventsare primary video artifacts that capture our attention. Hence,it is common to see the motion features coupled with theappearance features to represent videos in the video under-standing literature. On the contrary, in long-form video suchas those in movies and TV programs, the concerns can beless about specific motions but more into movie plot or filmgrammar. As a result, we use the frame-wise appearance asthe only feature for the long-form Video QA. In our experi-ments, vi,j are the pool5 output of ResNet [10] features andfi are derived by ResNeXt-101 [48,49].

Subsequently, linear feature transformations are appliedto project {vij} and fi into a standard d-dimensions featurespace to obtain

{vij | vij ∈ Rd

}and fi ∈ Rd, respectively.

Linguistic representation: Linguistic objects are built fromquestion, answer choices and long-form videos’ subtitles. Weexplore two options of representation learning for them usingBiLSTM and BERT.

– Sequential embedding with BiLSTM:

All words in linguistic cues including those in questions,answer choices and subtitles are first embedded into vectorsof 300 dimensions by pre-trained GloVe word embeddings[50].

For question and answer choices, we further pass thesecontext-independent embedding vectors through a BiLSTM.Output hidden states of the forward and backward LSTMpasses are finally concatenated to form the overall queryrepresentation q ∈ Rd for the questions and a ∈ Rd for theanswer choices if available (multi-choice question-answerpairs).

For the accompanying subtitles provided in long-formVideo QA, instead of treating them as one big passage as inprior works [2,18], we dissect the subtitle passage S into afixed number of M overlapping segments U = {U1, ..,UM}.The number of words in sibling segments is identical T =length(S)/M but varies from one video to another dependingon the overall length of the given subtitle passage S.

Similarly to the case of processing questions, we processeach segment with pre-trained word embeddings followed bya BiLSTM. The hidden states of the BiLSTM which is alsoof size T are then used as textual objects:

Ui = BiLSTM(Ui). (10)

{Ui}Mi=1 are the final representations ready to be used by thetextual stream which will be described in Sec. 3.2.3.

– Contextual embedding using BERT:

As an alternative option for linguistic representation, weutilize pre-trained BERT network [13] to extract contextualword embeddings. Instead of encoding words independentlyas in GloVe, BERT embeds each word in the context of itssurrounding words using a self attention mechanism.

For short-form Video QA, we tokenize the given ques-tion and answer choices in multi-choice questions and subse-quently feed the tokens into the pre-trained BERT network.Averaged embeddings of words in a sentence are used as aunified representation for that sentence. This applies to gen-erate both the question representation q and answer choices{ai}i=1,...,A.

For long-form Video QA, with each answer choice, weform a long string li by stacking it with subtitles S and thequestion sentence. We then tokenize and embed each stringli with BERT into contextual hidden matrix H of size m× dwhere m is the maximum number of tokens in the input andd is hidden dimensions of BERT. This tensor H is then splitup into corresponding embedding vectors of the subtitles Hs,of the question Hq and of the answer choice Hai :

(Hs, Hq, Hai) = BERT(li). (11)

Eventually, we suppress the contextual tokens questionrepresentation and answer choices into their respective singlerepresentation by mean pooling:

q = mean(Hq); ai = mean(Hai), (12)

while keep those of subtitles Hs as a set of textual objects.{Ui}Mi=1 are obtained by sliding overlapping windows of thesame length over Hs.

Page 8: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

8 Le et al.

3.2.2 Visual stream

An effective model for Video QA needs to distill the visualcontent in the context of the question and filter out the usuallylarge portion of the data that is not relevant to the question.Drawing inspiration from the hierarchy of video structure,we boil down the problem of Video QA into a process ofvideo representation in which a given video is encoded pro-gressively at different granularities, including short clip (asequence of video frames) and entire video levels (a sequenceof clips). It is crucial that the whole process conditions onlinguistic cues.

With that in mind, we design a two-level structure torepresent a video, one at clip level and the other one at videolevel, as illustrated in Fig. 5. At each hierarchy level, we usetwo stacked CRN units, one conditioned on motion featuresfollowed by one conditioned on linguistic cues. Intuitively,the motion feature serves as a dynamic context shaping thetemporal relations found among frames (at the clip level) orclips (at the video level). It also provides a saliency indicatorof which relations are worth the attention [52].

As the shaping effect is applied to all relations in a com-plementary way, selective (multiplicative) relation betweenthe relations and the conditioning feature is not needed, andthus a simple concatenation operator followed by a MLPsuffices. On the other hand, the linguistic cues are by natureselective, that is, not all relations are equally relevant to thequestion. Thus we utilize the multiplicative form for featurefusion as in Eq. 3 for the CRN units which condition onquestion representation.

With this particular design of network architecture, theinput array at clip level consists of frame-wise appearancefeature vectors {vij}, while that at a video level is the outputat the clip level. The motion conditioning feature at cliplevel CRNs is corresponding clip motion feature vector fi.They are further passed to an LSTM, whose final state isused as video-level motion features. Note that this particularimplementation is not the only option. We believe we arethe first to progressively incorporate multiple modalities ofinput in such a hierarchical manner in contrast to the typicalapproach of treating appearance features and motion featuresas a two-stream network.

Deeper hierarchy: To handle a video of longer size, upto thousands of frames which is equivalent to dozens ofshort-term clips, there are two options to reduce the com-putational cost of CRN in handling large sets of subsets{Qk | k = 2, ..., kmax

}given an input array X : (i) limit the

maximum subset size kmax, or (ii) extend the visual streamnetworks to deeper hierarchy. For the former option, thischoice of sparse sampling may have the potential to losecritical relation information of specific subsets. The latter,on the other hand, is able to densely sample subsets for rela-

tion modeling. Specifically, we can group N short-term clipsinto N1 ×N2 hyper-clips, of which N1 is the number of thehyper-clips and N2 is the number of short-term clips in onehyper-clip. By doing this, the visual stream now becomes a3-level of hierarchical network architecture. See Sec. 3.4 forthe effect of going deeper on running time and Sec. 4.4.3 onaccuracy.

Computing the output: At the end of the visual stream, wecompute the average visual feature which is driven by thequestion representation q. Assuming that the outputs of thelast CRN unit at video level are an arrayO =

{oi | oi ∈ RH×d

}N−4i=1 ,

we first stack them together, resulting in an output tensoro ∈ R(N−4)×H×d. We further vectorize this output tensor toobtain the final output o′ ∈ RH′×d, H ′ = (N − 4)×H . Theweighted average information is given by:

I = [Wo′o′;Wo′o′ �Wqq] , (13)

I ′ = ELU (WII + b) , (14)

γ = softmax (WI′I ′ + b) , (15)

o =H′∑h=1

γho′h; o ∈ Rd, (16)

where, [. , .] denotes concatenation operation, and � is theHadamard product. In case of multi-choice questions whichare coupled with answer choices {ai}, Eq. 13 becomes I =[Wo′o′;Wo′o′ �Wqq;Wo′o′ �Waai].

3.2.3 Textual stream

HCRN architecture is readily applicable to the accompanyingtextual subtitles in a similar bottom-up fashion as in visualstream. The HCRN textual stream also consists of two-levelhierarchy structure that process textual objects of each seg-ment and joining segments into passage level (See Fig. 6).

The input of the stream is preprocessed subtitles, ques-tion, and answer choices (Sec. 3.2.1). The subtitles S is repre-sented as its overall representationHs and also a sequence ofequal-length segments, each of which has been encoded intoa set of textual objects Ui = {uti}t=1,..,T ∈ RT×d. Mean-while, the question is encoded into a single vector q ∈ Rd.

Question-relevant pre-selection: Unlike video frames, sub-titles S contains irregularly timed conversations betweenmovie characters. Furthermore, while relevant visual featuresare abundant throughout the video, only a small portion of thesubtitles is relevant to the query and reflective of the answersa. To assure such relevance, antecedent to CRN unit, wemodulate the representation of passage Hs and those of Msegments {Ui}Mi=1 with both use the question and the answerchoice. It is done with the pre-selection operator describedbelow.

Page 9: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

HCRN for Multimodal Video QA 9

Video frames

clip 1 clip 2

...

Clip-levelmotion

CRN

CRN CRN

CRN

CRN

...

...

CRN

Question

Video-levelmotion

softmax and weighted average

Question

clip level

video level

...

Visualfeatures

Fig. 5 Visual stream. The CRNs are stacked in a hierarchy, embedding the video input at different granularities including frame, short clip andentire video levels. At each level of granularity, the video feature embedding is conditioned on the respective level-wise motion feature and universallinguistic cue.

CRN

Segment 1 Segment 2 ...

Question

Max pooling temporally

Passage level

Subtitles

Pre-selection

Pre-selection Pre-selection

Question Segment level

...

Max pooling temporally

WordEmbedding

...WordEmbedding

WordEmbedding

Fig. 6 Textual stream. Both segment-level and passage-level textual objects are modulated with the question by a pre-selection module. They thenserves as input and conditioning features for an CRN modeling long-term relationships between segments.

At the segment level, the modulated representation Ui ∈RT×d of each segment i of T objects are produced by

Ui = Wu [Ui;Ui � q] . (17)

Similarly, at video level, the subtitle modulated representa-tion Hs ∈ RS×d is built as

Hs = Wh [Hs;Hs � q] . (18)

CRN unit: A single CRN unit of the stream operates at pas-sage level which models the relationships between segments.

The modulated{Ui

}Mi=1

are passed as input objects to a

Page 10: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

10 Le et al.

CRN unit. The conditioning feature of the CRN is a max-pooled vector of the modulated representation of the wholesubtitle passage Hs. At the end of the textual stream, a tem-poral max-pooling operator is applied over the outputs of theCRN to obtain a single vector.

3.2.4 Adaptation & implementation

Short-form Video QA: Recall that short-form Video QA inthis paper refers to QA about single-shot videos of a few sec-onds without accompanying textual data. For these cases, weemploy the standard visual stream as described in Sec. 3.2.2to distill video/question joint representation. This representa-tion is ready to be used by the answer decoder (See Sec.3.3)for generating the final output.

Long-form Video QA: Different from the short-form VideoQA, long-form Video QA involves reasoning on both visualinformation from video frames and textual content from subti-tles. Compared to short snippets where local low-level motionsaliency plays a critical role, discovering high-level semanticconcepts associated with movie characters is more important[53]. Such semantics interleave in the data of both modalities.Further difference comes from the fact that long-form videosare of greater duration hence require appropriate treatment.

Although long-form videos share with short-forms inhaving a hierarchical structure, they are distinctive in termsof semantic compositionality and length. We employ visualand textual streams as described in Sec. 3.2.2 and Sec. 3.2.3with some adaptation for better suitability with the data for-mat and structure and simply use the joint representationof the two streams for answer prediction. Ideally, long-formVideo QA requires modeling interactions between a question,visual content and textual content in subtitles. Whist boththe visual stream and textual stream described above involveearly integration of the question representation into a visualrepresentation and textual representation of subtitles, we donot opt for early interaction between visual content and tex-tual content in subtitles in this work. Pairwise joint semanticrepresentation between visual and language has proven to beuseful in Video QA and closely related tasks [54], however,it assumes the existence of pairs of multimodal sequencedata in which they are highly semantically compatible. Thosepairs are either a video sequence and a textual description ofthe video or a video sequence and positive/negative answerchoices to a question about the visual content in the video.This is not always the case for the visual content and textualcontent in subtitles in long-form videos such as those takenfrom movies. Although the visual content and textual contentmay complement each other to some extent, in many cases,they may be greatly distant from each other. Let’s take thefollowing scenario as an example, in a movie scene, twocharacters are standing and chatting with each other. While

the visual content may provide information about where theconversation takes place, it hardly contains any informationabout the topic of their conversation. As a result, combiningthe visual information and textual information at an earlystage, in this case, has the potentials to cause informationdistortion and make it difficult for information retrieval. Inaddition, treating visual stream and textual stream in sepa-ration make it easier to justify the benefits of using CRNunits in modeling the relational information in each modality,hence, easier to assess the generic use of the CRN unit.

Note that in the visual stream in the HCRN architecturefor long-form Video QA, we use CRN units to handle a sub-set of sub-sampled frames in each clip at the clip level. At thevideo level, another CRN gathers long-range dependenciesbetween this clip-level information sent up from lower levelCRN outputs.

All CRN units at both levels take question representa-tion as conditioning features (See Fig. 7). Compared to thestandard architecture introduced in Sec. 3.2.2 and Fig. 5, wedrop all CRN units that condition on motion features. Thisadaptation is to accommodate the fact that low-level motionis less relevant than overall semantic flow in both clip- andvideo-level.

3.3 Answer decoders

Following the previous works [8,15,4], we adopt differentanswer decoders depending on the task, as follows:

– QA pairs with open-ended answers are treated as multi-class classification problems. For these, we employ aclassifier which takes as input the combination of theretrieved information from visual stream ov , the retrievedinformation from textual stream ot and the question repre-sentation q, and computes answer probabilities p ∈ R|A|:

y = ELU (Wo [ov; ot;Wqq + b] + b) , (19)

y′ = ELU (Wyy + b) , (20)

p = softmax (Wy′y′ + b) . (21)

– In the case of multi-choice questions where answer choicesare available, we iteratively treat each answer choice{ai}i=1,...,A as a query in exactly the manner that wehandle the question. Eq. 17 and Eq. 18, therefore, takeeach of the answer choices’ representation {ai} as a con-ditioning feature along with the question representation.Regarding Eq. 19, it is replaced by:

y = ELU(Wo

[ovqa ; otqa ;Wqq + b;Waa+ b

]+ b),

(22)

where ovqa is output of visual stream respect to queriesas question and answer choices, whereas otqa is output ofthe textual stream counterpart.

Page 11: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

HCRN for Multimodal Video QA 11

Video frames

clip 1 clip 2

CRN ...

CRN

Question

softmax and weighted average

Question

Clip level

Video level

frame 2

frame 5

...

CRN

...Visualfeatures

Fig. 7 The adapted architecture of visual stream (Fig.5) for long-form video. At each level, only question-conditioned CRNs are employed. Themotion-conditioned CRNs are unnecessary as the low-level motion features are less relevant in long-form media.

– For short-form Video QA, we simply drop the retrievedinformation from textual stream ot in Eq. 19 and Eq. 22.We also use the popular hinge loss as what presents in[15] for pairwise comparisons, max (0, 1 + sn − sp), be-tween scores for incorrect sn and correct answers sp totrain the network:

s = Wy′y′ + b. (23)

Regarding long-form Video QA, we use cross entropy astraining loss for fair comparisons with prior works.

– For repetition count task, we use a linear regression func-tion taking y′ in Eq. 20 as input, followed by a roundingfunction for integer count results. The loss for this task isMean Squared Error (MSE).

3.4 Complexity analysis

We now provide a theoretical analysis of running time forCRN units and HCRN. We will show that adding layerssaves computation time, just providing a justification fordeep hierarchy.

3.4.1 CRN units

For clarity, let us recall the notations introduced in our CRNunits: kmax is maximum subset (also tuple) size consideredfrom a given input array of n objects, subject to kmax < n;t is number of size-k subsets, (k = 2, 3, ..., kmax), randomly

sampled from the input set; gk(.), hk(., .) and pk(.) are sub-networks for relation modeling, conditioning and aggregat-ing, respectively. In our implementation, gk(.) and pk(.) arechosen to be set functions and hk(., .) is a nonlinear transfor-mation that fuses modalities.

Each input object in CRN is arranged into a matrix ofsize K × F , where K is the number of object elements andF is the embedding size for each element. All the operationsinvolving input objects are element-wise, that is, they arelinear in time w.r.t. K. Assume that the set function of orderk in the CRN’s operation in Alg. 1 has linear time complexityin k. This holds true for most aggregation functions such asmean, max, sum or product. With the relation orders rangingfrom k = 2, 3, ..., kmax and sampling frequency t, inferencecost in time for a CRN is:

costCRN (t, kmax,K, F ) = cost(g) + cost(h), (24)

where,

cost(g) = O(t

2kmax(kmax − 1)KF),

cost(h) = O((4t+ 2)(kmax − 1)KF 2) .

Here the running time of cost(g) is quadratic in length be-cause each g(.) that takes k objects as input will cost ktime, for k = 2, 3, ..., kmax. The running time of cost(g) isquadratic in F due to the feature transformation operationthat costs F 2 time. When kmax � F , cost(h) will domi-nate the time complexity. However, since the function h(.)involves matrix operations only, it is usually fast.

Page 12: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

12 Le et al.

The unit produces an output array of length kmax − 1,where each output object is of the same size as the inputobjects.

3.4.2 HCRN models

We adhere to the complexity analysis of visual stream onlywhich increases linearly in complexity with a video length.The overall complexity of HCRN depends on the designchoice for each CRN unit and the specific arrangement ofCRN units. For clarity, let t = 2 and kmax = n−1, which arefound to work well in experiments. Let L be the video length,organized into N clips of length T each, i.e., L = NT .

2-level HCRN: Consider, for example, the 2-level architec-ture HCRN, representing clips and video. Each level is astack of two CRN layers, one for motion conditioning fol-lowed by the other for linguistic conditioning. The clip-levelCRNs costN×costCRN (2, T − 1, 1, F ) time for motion con-ditioning and N × costCRN (2, T − 3, 1, F ) time for questionconditioning, where costCRN is the cost estimator in Eq. (24).This adds to roughly O (2TLF ) +O

(10LF 2) time.

Now the output array of size (T−4)×F for the question-conditioned clip-level CRN becomes one in N input objectsthe video-level CRNs. The CRNs at the video level, there-fore, take a cost of costCRN (2, N − 1, T − 4, F ) time for themotion-conditioned one and costCRN (2, N − 3, T − 4, F )time for the question-conditioned one, respectively, total-ing O (2NLF ) +O

(10LF 2) in order. Here we have made

use of the identity L = NT . The total cost is therefore in theorder of O (2(T +N)LF ) +O

(20LF 2).

3-level HCRN: Let us now analyze a 3-level architectureHCRN that generalizes the 2-level HCRN. The N clips areorganized into P sub-videos, each hasQ clips, i.e.,N = PQ.Since the CRNs at clip level remain the same, the first levelcosts 2TLF time to compute as before. Moving to the nextlevel, each sub-video CRN takes as input an array of lengthQ, whose elements have size (T − 4) × F . Applying thesame logic as before, the set of sub-video-level CRNs costroughly P × costCRN (2, Q− 1, T − 4, F ) time or approxi-mately O

(2NP LF

)+ O

(10LF 2). Here we have used the

identities N = PQ and L = NT .A stack of two sub-video CRNs now produces an output

array of size (Q− 4)(T − 4)×F , serving as an input objectin an array of length P for the video-level CRNs. Thus thevideo-level CRNs cost roughly

costCRN (2, P − 1, (Q− 4)(T − 4), F )

time or approximately O (2PLF ) +O(10LF 2) . Here we

again have used the identities L = NT and N = PQ. Thusthe total cost is in the order of O

(2(T + N

P + P )LF)

+O(30LF 2).

Deeper models might save time for long videos:

Recall that the 2-level HCRN has time cost ofO (2(T +N)LF ) +O

(20LF 2) ,

and the 3-level HCRN the cost of

O(

2(T + N

P+ P )LF

)+O

(30LF 2) .

Here the cost that is linear in F is due to the g functions, andthe quadratic cost in F is due to the h functions.

When going from 2-level to 3-level architectures, the costfor the g functions drops by O

(2(N − N

P − P )LF), and

the cost for the h functions increases by O(10LF 2). Now

assumingN � max{P, NP

}, for example P ≈

√N and the

number of clips N > 20. Then the linear drop can be approx-imated further asO (2NLF ). AsN = L

T , this can be written

as O(

2L2

T F)

. In practice the clip size T is often fixed, thusthe drop in the g functions scales quadratically with videolength L, whereas the increase in the h functions scales lin-early with L. This suggests that going deeper in hierarchycould actually save the running time for long videos.

See Sec. 4.4.3 for empirical validation of the saving.

4 Experiments

4.1 Datasets

We evaluate the effectiveness of the proposed CRN unit andthe HCRN architecture on a series of short-form and long-form Video QA datasets. In particular, we use three differentdatasets as benchmarks for the short-form Video QA, namelyTGIF-QA [15], MSVD-QA [45] and MSRVTT-QA [16]. Allthose three datasets are collected from real-world videos. Wealso evaluate HCRN on long-form Video QA using one ofthe largest datasets publicly available, TVQA [2]. Details ofeach benchmark are as below.

TGIF-QA: This is currently the most prominent dataset forVideo QA, containing 165K QA pairs and 72K animatedGIFs. The dataset covers four tasks addressing unique prop-erties of video. Of which, the first three require strong spatio-temporal reasoning abilities: Repetition Count - to retrievethe number of occurrences of an action, Repeating Action-multi-choice task to identify the action that is repeated for agiven number of times, State Transition - multi-choice tasksregarding temporal order of events. The last task - Frame QA- is akin to image QA where the answer to a given questioncan be found from one particular video frame.

MSVD-QA: This is a small dataset of 50,505 question answerpairs annotated from 1,970 short clips. Questions are of fivetypes, including what, who, how, when and where, of which61% of questions are used for training whilst 13% and 26%are used as the validation set and test set, respectively.

Page 13: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

HCRN for Multimodal Video QA 13

MSRVTT-QA: The dataset contains 10K videos and 243Kquestion answer pairs. Similar to MSVD-QA, questions areof five types. Splits for train, validation and test are with theproportions are 65%, 5%, and 30%, respectively. Comparedto the other two datasets, videos in MSRVTT-QA containmore complex scenes. They are also much longer, rangingfrom 10 to 30 seconds long, equivalent to 300 to 900 framesper video.

TVQA: This is one of the largest long-form Video QA datasetsannotated from 6 different TV shows: The Big Bang Theory,How I Met Your Mother, Friends, Grey’s Anatomy, House,Castle. There are total 152.5K question-answer pairs asso-ciated with 5 answer choices each from 21,8K long clips of60/90 secs which comes down to 122K, 15,25K and 15,25Kfor train, validation and test set, respectively. The dataset alsoprovides start and end timestamps for each question to limitthe video portion where one can find corresponding answers.

Regarding the evaluation metric, we mainly use accu-racy in all experiments, excluding those for repetition counton TGIF-QA dataset where Mean Square Error (MSE) isapplied.

4.2 Implementation details

4.2.1 Feature extraction

For short-form Video QA datasets, each video is prepro-cessed into N short clips of fixed lengths, 16 frames each. Indetail, we first locate N equally spaced anchor frames. Eachclip is then defined as a sequence of 16 consecutive videoframes taking a pre-computed anchor as the middle frame.For the first and the last clip where frame indices may exceedthe boundaries, we repeat the first frame or the last frame ofthe video multiple times until it fills up the clip’s size.

Given segmented clips, we extract motion features foreach clip using a pre-trained model of the ResNeXt-1011

[48,49]. Regarding the appearance feature used in the ex-periments, we take the pool5 output of ResNet [10] featuresas a feature representation of each frame. This means wecompletely ignore the 2D structure of spatial information ofvideo frames which is likely to be beneficial for answeringquestions particularly interested in object’s appearance, suchas those in the Frame QA task in the TGIF-QA dataset. Weare aware of this but deliberately opt for light-weighted ex-tracted features, and drive the main focus of our work on thesignificance of temporal relation, motion, and the hierarchyof video data by nature. Note that most of the videos in thedatasets in the short-form Video QA category, except those

1 https://github.com/kenshohara/video-classification-3d-cnn-pytorch

in the MSRVTT-QA dataset, are short. Hence, we intention-ally divide each video into 8 clips (8×16 frames) to producepartially overlapping frames between clips to avoid temporaldiscontinuity. Longer videos in MSRVTT-QA are addition-ally segmented into 24 clips of 16 frames each, primarilyaiming at evaluating the model’s ability to handle very longsequences.

For the TVQA long-form dataset, we did a similar strat-egy as for short-video divide each video intoN clips. How-ever, as TVQA videos are longer and only recorded at 3fps,we adapt by choosing N = 6, each clip contains 8 framesbased on empirical experiences. The pool5 output of ResNetfeatures is also used as a feature representation of each frame.

For the subtitles, the maximum subtitle’s length is set at256. We simply cut off those who are longer than that and dozero paddings for those who are shorter. Subtitles are furtherdivided into 6 segments overlapping at half a segment size.

4.2.2 Network training

HCRN and its variations are implemented in Python 3.6 withPytorch 1.2.0. Common settings include d = 512, t = 2 forboth visual and textual streams. For all experiments, we trainthe model using Adam optimizer with a batch size of 32 ,initially at a learning rate of 10−4 and decay by half afterevery 5 epochs for counting task in the TGIF-QA datasetand after every 10 epochs for the others. All experimentsare terminated after 25 epochs and reported results are atthe epoch giving the best validation accuracy. Depending onthe amount of training data and hierarchy depth, it may takearound 4-30 hours of training on one single NVIDIA TeslaV100 GPU. Pytorch implementation of the model is publiclyavailable. 2.

As for experiments with the large-scale language repre-sentation model BERT, we use the latest pre-trained modelprovide by Hugging Face3. We fine-tune the BERT modelduring training with a learning rate of 2× 10−5 in all experi-ments.

4.3 Quantitative results

We compare our proposed model with state-of-the-art meth-ods (SOTAs) on the aforementioned datasets. By default,we use pre-trained GloVe embedding [50] for word em-bedding following by a BiLSTM for sequential modeling.Experiments using contextual embeddings by a pre-trainedBERT network [13] is explicitly specified with “(BERT)”.Detailed experiments for short-form Video QA and long-form Video QA are as the following.

2 https://github.com/thaolmk54/hcrn-videoqa3 https://github.com/huggingface/transformers

Page 14: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

14 Le et al.

4.3.1 Short-form Video QA

For TGIF-QA, we compare with most recent SOTAs, includ-ing [8,1,15,3], over four tasks. These works, except for [3],make use of motion features extracted from either opticalflow or 3D CNNs and its variants.

The results are summarized in Table 2 for TGIF-QA, andin Fig. 8 for MSVD-QA and MSRVTT-QA. Results of theprevious works are taken from either the original papers orwhat reported by [8]. It is clear that our model consistentlyoutperforms or is competitive with SOTA models on all tasksacross all datasets. The improvements are particularly no-ticeable when strong temporal reasoning is required, i.e., forthe questions involving actions and transitions in TGIF-QA.These results confirm the significance of considering bothnear-term and far-term temporal relations toward finding cor-rect answers. In addition, results on the TGIF-QA datasetover 10 runs of different random seeds confirm the robustnessof the CRN units against the randomness in input subsetsselection. More analysis on how relations affect the model’sperformance is provided in later ablation studies.

Regarding results with the recent advance in languagerepresentation model BERT, it does not have much effecton the results across tasks requiring strong temporal reason-ing in the TGIF-QA. This can be explained by the fact thatquestions in this dataset are relatively short and all questionsare created from a limited number of patterns, hence, con-textual embeddings do not account much benefit. Whereasthe Frame QA task relies on much richer vocabulary, thanksto its free-form questions, hence, contextual embeddings ex-tracted by BERT greatly boost the model’s performance onthis task. We also empirically find out that fine-tune only thelast two layers of the BERT network gives more favorableperformance comparing to fine-tuning all layers.

The MSVD-QA and MSRVTT-QA datasets representhighly challenging benchmarks for machine compared tothe TGIF-QA, thanks to their open-ended nature. Our modelHCRN outperforms existing methods on both datasets, achiev-ing 36.8% and 35.4% accuracy which are 2.4 points and 0.4points improvement on MSVD-QA and MSRVTT-QA, re-spectively. This suggests that the model can handle bothsmall and large datasets better than existing methods. Wealso fine-tune the contextual embeddings by BERT on thesetwo datasets. Results show that the contextual embeddingsbring great benefits on both MSVD-QA and MSRVTT-QA,achieving 39.3% and 38.3% accuracy, respectively. Thesefigures are consistent with the result on the Frame QA task ofthe TGIF-QA dataset. Please note that we also fine-tune onlylast two layers of the BERT network in these experimentssimply due to empirical results.

Finally, we provide a justification for the competitive per-formance of our HCRN against existing rivals by comparingmodel features in Table 3. Whilst it is not straightforward to

compare head-to-head on internal model designs, it is evi-dent that effective video modeling necessitates handling ofmotion, temporal relation and hierarchy at the same time.We will back this hypothesis by further detailed studies inSec. 4.4 (for motion, temporal relations, shallow hierarchy)and Sec. 4.4.3 (deep hierarchy).

4.3.2 Long-form Video QA

For TVQA, we compare our method to several baselines aswell as state-of-the-art methods (See Table 4). The datasetis relatively new and due to the challenges with long-formVideo QA, there are only several attempts in benchmarking it.Comparisons of the proposed method with the baselines aremade on the validation set while results of compared methodsare taken as reported in original publications. Experimentsusing timestamp information are indicated by w/ ts and thosewith full-length subtitles are with w/o ts. If not indicatedexplicitly, “V.”, “S.”, “Q.” short for visual features, subtitlefeatures and question features, respectively. The evaluatedsettings are as follows:

– (B) Q. : This is the simplest baseline where we onlyuse question features for predicting the correct answer.Results in this setting will reveal how much our modelrelying on the linguistic bias to arrive at correct answers.

– (B) S. + Q. (w/o pre-selection): In this baseline, we usequestion and subtitle features for prediction. Both ques-tion and subtitle features are simply obtained by con-catenating the final output hidden states of forward andbackward LSTM passes and further pass to a classifier asin Sec. 3.3.

– (B) S. + Q. (w/ pre-selection): This baseline is to showthe significance of pre-selection in textual stream as inEq. 18. Having said that, question features and selectiveoutput between the question and the subtitles as explainedin Eq. 18 are fed into a classifier for prediction.

– (B) V. + Q.: In this baseline, we simply apply averagepooling to smash visual features of entire videos into avector and further combine it with question features forprediction.

– (B) S. + V. + Q. (w/ pre-selection): We combine twobaselines “(B) S. + Q. (w/ pre-selection)” and “(B) V. +Q.” right above. Given that, subtitle features are extractedwith pre-selection as in Eq. 18 and video features aresmashed over space-time before going through a classi-fier.

– (HCRN) S. + Q.: This is to evaluate the effect of thetextual stream alone in our proposed network architectureas described in Fig. 6.

– (HCRN) V. + Q.: This is to evaluate the effect of thevisual stream counterpart alone in Fig. 7.

– (HCRN) S. + V. + Q.: Our full proposed architecture withthe presence of both two modalities.

Page 15: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

HCRN for Multimodal Video QA 15

As shown in Table 4, using BERT for contextual word em-beddings significantly improves performance comparing toGloVe embeddings and BiLSTM counterpart. The results areconsistent with what reported by [47]. Although our modelonly achieves competitive performance with [47] in the set-ting of using timestamps (71.3 vs. 72.1), we significantlyoutperform them by 3.0 absolute points on the more chal-lenging setting when we do not have access to timestampindicators of where answers located. This clearly shows thatour bottom-up approach is promising to solve long-formVideo QA in its general setting.

Even though BERT is designed to do contextual embed-ding which pre-computes some word relations, HCRN stillshows additional benefit in modeling relations between seg-ments in the passage. However, this benefit is not as clearlydemonstrated as in the case of using GloVe embedding cou-pled with BiLSTM (Table 4 - Exp. 9 vs. Exp. 7 and Exp. 13vs. Exp. 14). To deeper understanding the behavior of HCRN,we will concentrate our further analysis using comparisonwith GloVe + LSTM.

It clearly shows that using output hidden states of BiL-STM to represent subtitle features directly for classificationhas a very limited effect (See Table 4 - Exp. 3 vs Exp. 4).The results could be explained as LSTM fails to handle suchlong subtitles of hundreds of words. In the meantime, pre-selection plays a critical role to find relevant information inthe subtitles to a question which boosts performance from42.3% to 57.9% when using full subtitles and from 41.8%to 62.1% when using timestamps (See Exp. 5). Our HCRNfurther improves approximately 1.0 points (without times-tamp annotation) and 1.5 points (with timestamp annotation)comparing to the baseline of GloVe embeddings and BiL-STM with pre-selection for the textual stream only setting.As for the effects on visual stream, our HCRN gains around1.0 points over the baseline of averaging visual features. Thisleads to an improvement of north of 1.0 points when lever-aging both visual and textual stream together (See Exp. 12vs. Exp. 15) on both settings with timestamp annotation andwithout timestamp annotation. Even though our results arebehind [2], we wish to point out that their context match-ing model with the bi-directional attention flow (BiDAF)[55] significantly contributes to the performance. Whereas,our model only makes use of vanilla BiLSTM to sequencemodeling. Therefore, a direct comparison between the twomethods is unfair. We instead mainly compare our methodwith a simple variant of [2] by dropping off the BiDAF toshow the contribution of our CRN unit as well as the HCRNnetwork. This is equivalent to the baseline (B) S. + V. + Q.(w/ pre-selection) in this paper.

Fig. 8 Performance comparison on MSVD-QA and MSRVTT-QAdataset with state-of-the-art methods: Co-mem [1], HME [8], HRA[56], and AMU [45].

4.4 Ablation studies

To provide more insights about the roles of HCRN’s compo-nents, we conduct extensive ablation studies on the TGIF-QAand TVQA dataset with a wide range of configurations. Theresults are reported in Table 5 for the short-form Video QAand in Table 6 for the long-form Video QA.

4.4.1 Short-form Video QA

Overall we find that our sampling method does not hurtthe HCRN performance that much while it is much moreeffective than the one used by [28] in terms of complexity.We also find that ablating any of the design componentsor CRN units would degrade the performance for temporalreasoning tasks (actions, transition and action counting). Theeffects are detailed as follows.

Effect of relation order kmax and resolution t: Without rela-tions (kmax = 1, t = 1) the performance suffers, specificallyon actions and events reasoning whereas counting tends tobe better. This is expected since those questions often requireputting actions and events in relation with a larger context(e.g., what happens before something else) while motion flowis critical for counting but for far-term relations. In this case,most of the tasks benefit from increasing sampling resolutiont (t > 1) because of better chance to find a relevant frame aswell as the benefits of the far-term temporal relation learnedby the aggregating sub-network pk(.) of the CRN unit. How-ever, when taking relations into account (kmax > 1), wefind that HCRN is robust against sampling resolution t butdepends critically on the maximum relation order kmax. Therelative independence w.r.t. t can be due to visual redundancybetween frames, so that resampling may capture almost thesame information. On the other hand, when considering onlylow-order object relations, the performance is significantly

Page 16: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

16 Le et al.

Model Action Trans. Frame CountST-TP [15] 62.9 69.4 49.5 4.32Co-mem [1] 68.2 74.3 51.5 4.10PSAC [3] 70.4 76.9 55.7 4.27HME [8] 73.9 77.8 53.8 4.02HCRN* 75.2±0.4 81.3±0.2 55.9±0.3 3.93±0.03HCRN (embeddings with BERT) 69.8 79.8 57.9 3.96

Table 2 Comparison with the state-of-the-art methods on TGIF-QA dataset. For count, the lower the better (MSE) and the higher the better for theothers (accuracy). *Means with standard deviations over 10 runs.

Model Appear. Motion Hiera. RelationST-TP [15] X XCo-mem [1] X XPSAC [3] XHME [8] X XHCRN X X X X

Table 3 Model design choices and input modalities in comparison. SeeTable 2 for corresponding performance on TGIF-QA dataset.

dropped in action and transition tasks while it is slightlybetter for counting and frame QA. These results confirmthat high-order relations are required for temporal reasoning.As the frame QA task requires only reasoning on a singleframe, incorporating temporal information might confuse themodel. Similarly, when the model only makes use of the high-order relations (Fix k = kmax, kmax = n − 1, t = 2), theperformance suffers, suggesting combining both low-orderobject relations and high-order object relations is a lot moreefficient.

Effect of hierarchy: We design two simpler models with onlyone CRN layer:

– 1-level, 1 CRN video on key frames only: Using only oneCRN at the video-level whose input array consists of keyframes of the clips. Note that video-level motion featuresare still maintained.

– 1.5-level, clip CRNs→ pooling: Only the clip-level CRNsare used, and their outputs are mean-pooled to represent agiven video. The pooling operation represents a simplisticrelational operation across clips. The results confirm thata hierarchy is needed for high performance on temporalreasoning tasks.

Effect of motion conditioning: We evaluate the followingsettings:

– w/o short-term motions: Remove all CRN units that con-dition on the short-term motion features (clip level) inthe HCRN.

– w/o long-term motions: Remove the CRN unit that con-ditions on the long-term motion features (video level) inthe HCRN.

– w/o motions: Remove motion feature from being usedby HCRN. We find that motion, in agreeing with priorarts, is critical to detect actions, hence computing actioncount. Long-term motion is particularly significant for thecounting task, as this task requires maintaining a globaltemporal context during the entire process. For othertasks, short-term motion is usually sufficient. E.g. in theaction task, wherein one action is repeatedly performedduring the entire video, long-term context contributeslittle. Not surprisingly, motion does not play a positiverole in answering questions on single frames as onlyappearance information needed.

Effect of linguistic conditioning and multiplicative relation:Linguistic cues represent a crucial context for selecting rele-vant visual artifacts. For that we test the following ablations:

– w/o question @clip level: Remove the CRN unit thatconditions on question representation at clip level.

– w/o question @video level: Remove the CRN unit thatconditions on question representation at video level.

– w/o linguistic condition: Exclude all CRN units condi-tioning on linguistic cue while the linguistic cue is still inthe answer decoder. Likewise, the multiplicative relationform offers a selection mechanism. Thus we study itseffect as follows:

– w/o MUL. in all CRNs: Exclude the use of multiplicativerelations in all CRN units.

– w/ MUL. relation for question & motion: Leverage multi-plicative relations in all CRN units.

We find that the conditioning question provides an importantcontext for encoding video. Conditioning features (motionand language), through the multiplicative relation as in Eq. 3,offers further performance gain in all tasks rather than FrameQA, possibly by selectively passing question-relevant infor-mation up the inference chain.

Effect of subset sampling We conduct experiments with thefull model of Fig. 5 (a) with kmax = n − 1, t = 2. Theexperiment “Sampled from pre-computed superset” refers toour CRN units of using the same sampling trick as what isin [28], where the set Qk is sampled from a pre-computedcollection of all possible size-k subsets. “Directly sampled”,

Page 17: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

HCRN for Multimodal Video QA 17

Exp. Model Val. Acc.w/o ts w/ ts

State-of-the-art methods1 TVQA S.+Q.+V. [2][47] 64.7 67.72 VideoQABERT [47] 63.1 72.1

Textual stream only3 (B) Q. 41.6 41.64 (B) S. + Q. (w/o pre-selection) 42.3 41.85 (B) S. + Q. (w/ pre-selection) 57.9 62.16 (B) Q. (BERT) 44.2 44.27 (B) S. + Q. (BERT) 65.1 71.18 (HCRN) S. + Q. 59.0 63.69 (HCRN) S. + Q. (BERT) 66.0 71.1

Visual stream only10 (B) V. + Q. 42.2 42.211 (HCRN) V. + Q. 43.1 43.1

Two streams12 (B) S. + V. + Q. (w/ pre-selection) 58.5 63.213 (B) S. + V. + Q. (BERT) 65.7 71.414 (HCRN) S. + V. + Q. (BERT) 66.1 71.315 (HCRN) S. + V. + Q. 59.1 64.7

Table 4 Comparison with baselines and state-of-the-art methods on TVQA dataset. w/o ts: without using timestamp annotation to limit the searchspace of where to find answers; w/ ts: making use of the timestamp annotation.

Model Action Transition FrameQA CountRelations (kmax, t)

kmax = 1, t = 1 72.8 80.7 56.3 3.87kmax = 1, t = 3 73.4 81.1 56.0 3.94kmax = 1, t = 5 74.0 80.3 56.4 3.86kmax = 1, t = 7 74.8 80.0 56.5 3.85kmax = 1, t = 9 73.7 81.1 56.4 3.82kmax = 2, t = 2 72.8 80.8 56.3 3.80kmax = 2, t = 9 73.1 81.4 56.0 3.85kmax = 4, t = 2 74.2 81.5 56.8 3.88kmax = 4, t = 9 73.4 81.8 56.5 3.83kmax = bn/2c , t = 2 74.7 81.1 55.7 3.85kmax = bn/2c , t = 9 74.7 81.0 55.4 3.94kmax = n− 1, t = 1 74.2 80.8 55.3 3.98kmax = n− 1, t = 3 75.2 81.1 56.2 4.00kmax = n− 1, t = 5 75.0 81.3 55.6 3.95kmax = n− 1, t = 7 75.3 81.9 55.8 4.00kmax = n− 1, t = 9 74.9 81.1 55.6 3.94Fix k = kmax, kmax = n− 1, t = 2 72.9 80.2 56.5 3.90

Hierarchy1-level, video CRN only 72.7 81.4 57.2 3.881.5-level, clips→pool 73.1 81.2 57.2 3.88

Motion conditioningw/o motion 69.8 78.4 57.9 4.38w/o short-term motion 74.1 80.9 56.4 3.87w/o long-term motion 75.8 80.8 56.8 3.97

Linguistic conditioningw/o linguistic condition 68.7 80.5 56.6 3.92w/o question @clip level 75.0 81.0 56.0 3.85w/o question @video level 74.2 81.0 55.3 3.90

Multiplicative relationw/o MUL. in all CRNs 74.0 81.7 55.8 3.86w/ MUL. for both question & motion 75.4 80.2 55.1 3.98

Subset samplingSampled from pre-computed superset 75.2 81.3 55.9 3.90Directly sampled 75.3 81.8 55.3 3.89

Table 5 Ablation studies on TGIF-QA dataset. For count, the lower the better (MSE) and the higher the better for the others (accuracy). When notexplicitly specified, we use kmax = n− 1, t = 2 for relation order and sampling resolution.

Page 18: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

18 Le et al.

Exp. Model Val. Acc.w/o ts w/ ts

Textual stream only1 S. + Q. (w/ MUL. + w/ LSTM) 59.0 63.62 S. + Q. (w/ MUL. + w/o LSTM) 57.1 63.63 S. + Q. (w/o MUL. + w/ LSTM) 59.1 63.94 S. + Q. (w/o MUL. + w/o LSTM) 57.4 63.5

Visual stream only5 V. + Q. (w/ MUL. + w/ LSTM) 42.7 42.76 V. + Q. (w/ MUL. + w/o LSTM) 43.1 43.17 V. + Q. (w/o MUL. + w/ LSTM) 42.2 42.28 V. + Q. (w/o MUL. + w/o LSTM) 42.1 42.1

Two streams9 S. + V. + Q. (S. w/ MUL. + LSTM, V. w/ MUL.) 59.1 64.7

Table 6 Ablation studies on TVQA dataset.

in contrast, refers to our sampling method as described inAlg. 1. The empirical results show that directly samplingsize-k subsets from an input set does not degrade much ofthe performance of the HCRN for short-form Video QA, sug-gesting it a better choice when dealing with large input setsin size to reduce the complexity of the CRN units.

4.4.2 Long-form Video QA

We focus on studying the effect of different options for theconditioning sub-network hk(., .) in Sec. 3.1 which reflectsthe flexibility of our model when dealing with different formsof input modalities. The options are:

– S. + Q. (w/ MUL. + w/ LSTM): Only textual stream isused. The conditioning sub-network hk(., .) is with mul-tiplicative relation between tuples of segments and condi-tioning feature, and coupled with BiLSTM as formulatedin Eqs. (4, 5 and 6).

– S. + Q. (w/ MUL. + w/o LSTM): Remove the BiLSTMnetwork in the experiment S. + Q. (w/ MUL. + w/ LSTM)to evaluate the effect of sequential modeling in textualstream.

– S. + Q. (w/o MUL. + w/ LSTM): The multiplicative rela-tion in the experiment S. + Q. (w/ MUL. + w/ LSTM) isnow replaced with a simple concatenation of conditioningfeature and tuples of segments.

– S. + Q. (w/o MUL. + w/o LSTM): Remove the BiLSTMnetwork in the right above experiment S. + Q. (w/o MUL.+ w/ LSTM) to evaluate the effect of when both selectiverelation and sequential modeling are missing.

– V. + Q. (w/ MUL. + w/ LSTM): Only visual stream isunder consideration. We use the multiplicative form asin Eq. 3 for the conditioning sub-network instead of asimple concatenation operation in Eq. 2. We additionallyuse a BiLSTM network for sequential modeling as sameas the way we have done with the textual stream. Thisis because both visual content and textual content aretemporal sequences by nature.

– V. + Q. (w/ MUL. + w/o LSTM): Similar to the aboveexperiment V. + Q. (w/ MUL. + w/ LSTM) but withoutthe use of the BiLSTM network.

– V. + Q. (w/o MUL. + w/ LSTM): The conditioning sub-network is simply a tensor concatenation operation. Thisis to compare against the one using multiplicative form V.+ Q. (w/ MUL. + w/ LSTM).

– V. + Q. (w/o MUL. + w/o LSTM): Similar to V. + Q. (w/oMUL. + w/ LSTM) but without the use of the BiLSTMnetwork for sequential modeling.

– S. + V. + Q. (S. w/ MUL. + LSTM, V. w/ MUL.): Bothtwo streams are present for prediction. We combine thebest option of each network stream, S. + Q. (w/ MUL. +w/ LSTM) for the textual stream and V. + Q. (w/ MUL. +w/o LSTM) for the visual stream for comparison.

It is empirically shown that the simple concatenation as inEq. 2 is insufficient to combine conditioning features andoutput of relation network in this dataset. In the meantime,the multiplicative relation between the conditioning featuresand those relations is a better fit as it greatly improves themodel’s performance, especially in the visual stream. Thisshows the consistency in empirical results between long-form Video QA and shot-form Video QA where the relationbetween visual content and the query is more about multi-plicative (selection) rather than simple additive.

On the other side, sequential modeling with BiLSTM ismore favorable for textual stream than visual stream eventhough both streams are sequential by nature. This well alignswith our analysis in Sec. 3.1.

4.4.3 Deepening model hierarchy saves time

We test the scalability of the HCRN on long videos in theMSRVTT-QA dataset, which are organized into 24 clips (3times longer than the other two datasets). We consider twosettings:

Page 19: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

HCRN for Multimodal Video QA 19

Depth of hierarchy Overall Acc.Val. Test

2-level, 24 clips→ 1 vid 35.4 35.53-level, 24 clips→ 4 sub-vids→ 1 vid 35.1 35.4

Table 7 Results for going deeper hierarchy on MSRVTT-QA dataset.Run time is reduced by factor of 4 for going from 2-level to 3-levelhierarchy.

– 2-level hierarchy, 24 clips→1 vid: The model is as illus-trated in Fig. 5, where 24 clip-level CRNs are followedby a video-level CRN.

– 3-level hierarchy, 24 clips→4 sub-vids→1 vid: Startingfrom the 24 clips as in the 2-level hierarchy, we group 24clips into 4 sub-videos, each is a group of 6 consecutiveclips, resulting in a 3-level hierarchy. These two modelsare designed to have a similar number of parameters,approx. 44M.

The results are reported in Table 7. Unlike existing meth-ods which usually struggle with handling long videos, ourmethod is scalable for them by offering deeper hierarchy, asanalyzed theoretically in Sec. 3.4. The theory suggests thatusing a deeper hierarchy can reduce the training time andinference time for HCRN when the video is long. This isvalidated in our experiments, where we achieve 4 times re-duction in training and inference time by going from 2-levelHCRN to 3-level counterpart whilst maintaining competitiveperformance.

5 Discussion

HCRN presents a new class of neural architectures for multi-modal Video QA, pursuing the ease of model constructionthrough reusable uniform building blocks. Different fromtemporal attention based approaches which put effort intoselecting objects, HCRN concentrates on modeling relationsand hierarchy in different input modalities in Video QA. Inthe scope of this work, we study how to deal with visualcontent and subtitles where applicable. The visual and textstreams share similar structure in terms of near-term, far-term relations and information hierarchy. The difference inmethodology and design choices between ours and the exist-ing works leads to distinctive benefits in different scenariosas empirically proven.

Within the scope of this paper we did not consider theaudio channel and early-fusion between modalities, leavingthese open for future work. Since the CRN is generic, weenvision a HCRN for the audio stream similar to the visualstream. The modalities can be combined at any hierarchicallevel, or any step within the same level. For example, as theaudio, subtitle and visual content are partly synchronized,we can use a sub-HCRN to represent the three streams per

segment. Alternatively, we can fuse the modalities into thesame CRN as long as the feature representations are projectedonto the same tensor space. Future works will also includeobject-oriented representation of videos as these are native toCRNs, thanks to its generality. As CRN is a relational model,both cross-object relations and cross-time relations can bemodeled. These are likely to improve the interpretabilityof the model and get closer to how human reasons acrossmultiple modalities. CRN units can be further augmentedwith attention mechanisms to cover better object selectionability, so that related tasks such as frame QA in the TGIF-QA dataset can be further improved.

5.1 Conclusion

We introduced a general-purpose neural unit called Condi-tional Relational Networks (CRNs) and a method to constructhierarchical networks for multimodal Video QA using CRNsas building blocks. A CRN is a relational transformer that en-capsulates and maps an array of tensorial objects into a newarray of relations, all conditioned on a contextual feature. Inthe process, high-order relations among input objects are en-coded and modulated by the conditioning feature. This designallows flexible construction of sophisticated structure such asstack and hierarchy, and supports iterative reasoning, makingit suitable for QA over multimodal and structured domainslike video. The HCRN was evaluated on multiple Video QAdatasets covering both short-form Video QA where ques-tions are mainly about activities/events happening in a shortsnippet (TGIF-QA, MSVD-QA, MSRVTT-QA) as well aslong-form Video QA where answers are located at either vi-sual cues or textual cues as movie subtitles (TVQA dataset).HCRN demonstrates its competitive reasoning capability ona wide range of different settings against state-of-the-artmethods. The examination of CRN in Video QA highlightsthe importance of building a generic neural reasoning unitthat supports native multimodal interaction in improving ro-bustness of visual reasoning.

References

1. J. Gao, R. Ge, K. Chen, and R. Nevatia, “Motion-appearance co-memory networks for video question answering,” CVPR, 2018.

2. J. Lei, L. Yu, M. Bansal, and T. L. Berg, “Tvqa: Localized, com-positional video question answering,” Conference on EmpiricalMethods in Natural Language Processing, 2018.

3. X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He, and C. Gan,“Beyond RNNs: Positional Self-Attention with Co-Attention forVideo Question Answering,” in AAAI, 2019.

4. X. Song, Y. Shi, X. Chen, and Y. Han, “Explore multi-step rea-soning in video question answering,” in 2018 ACM MultimediaConference on Multimedia Conference. ACM, 2018, pp. 239–247.

5. M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun,and S. Fidler, “Movieqa: Understanding stories in movies through

Page 20: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

20 Le et al.

question-answering,” in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2016, pp. 4631–4640.

6. B. Wang, Y. Xu, Y. Han, and R. Hong, “Movie question answer-ing: Remembering the textual cues for layered visual contents,”AAAI’18, 2018.

7. S. Na, S. Lee, J. Kim, and G. Kim, “A read-write memory networkfor movie story understanding,” in International Conference onComputer Vision (ICCV 2017). Venice, Italy, 2017.

8. C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang,“Heterogeneous memory enhanced multimodal attention model forvideo question answering,” in CVPR, 2019, pp. 1999–2007.

9. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” in Proceedings of the IEEE conference on computervision and pattern recognition, 2015, pp. 1–9.

10. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” CVPR, 2016.

11. E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville,“Film: Visual reasoning with a general conditioning layer,” in AAAI,2018.

12. S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

13. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language under-standing,” NAACL, 2019.

14. T. M. Le, V. Le, S. Venkatesh, and T. Tran, “Hierarchical condi-tional relation networks for video question answering,” in Proceed-ings of the IEEE/CVF Conference on Computer Vision and PatternRecognition, 2020, pp. 9972–9981.

15. Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “Tgif-qa: Towardspatio-temporal reasoning in visual question answering,” in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2017, pp. 2758–2766.

16. J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video descrip-tion dataset for bridging video and language,” in Proceedings ofthe IEEE conference on computer vision and pattern recognition,2016, pp. 5288–5296.

17. K.-M. Kim, M.-O. Heo, S.-H. Choi, and B.-T. Zhang, “Deepstory:Video story qa by deep embedded memory networks,” InternationalJoint Conferences on Artificial Intelligence, pp. 2016–2022, 2017.

18. J. Kim, M. Ma, K. Kim, S. Kim, and C. D. Yoo, “Progressiveattention memory network for movie story question answering,”in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2019, pp. 8337–8346.

19. L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann, “Uncovering thetemporal context for video question answering,” International Jour-nal of Computer Vision, vol. 124, no. 3, pp. 409–421, 2017.

20. Z. Zhao, Z. Zhang, S. Xiao, Z. Xiao, X. Yan, J. Yu, D. Cai, andF. Wu, “Long-form video question answering via dynamic hierar-chical reinforced networks,” IEEE Transactions on Image Process-ing, vol. 28, no. 12, pp. 5939–5952, 2019.

21. K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C. Niebles,and M. Sun, “Leveraging video descriptions to learn video questionanswering,” in Thirty-First AAAI Conference on Artificial Intelli-gence, 2017.

22. A. Wang, A. T. Luu, C.-S. Foo, H. Zhu, Y. Tay, and V. Chan-drasekhar, “Holistic multi-modal memory network for movie ques-tion answering,” IEEE Transactions on Image Processing, vol. 29,pp. 489–499, 2019.

23. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “Acloser look at spatiotemporal convolutions for action recognition,”in Proceedings of the IEEE conference on Computer Vision andPattern Recognition, 2018, pp. 6450–6459.

24. Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal represen-tation with pseudo-3d residual networks,” in Proceedings of theIEEE International Conference on Computer Vision, 2017, pp.5533–5541.

25. C.-Y. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, andR. Girshick, “Long-term feature banks for detailed video under-standing,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2019, pp. 284–293.

26. Y. Tang, X. Zhang, L. Ma, J. Wang, S. Chen, and Y.-G. Jiang, “Non-local netvlad encoding for video classification,” in Proceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp.0–0.

27. F. Li, C. Gan, X. Liu, Y. Bian, X. Long, Y. Li, Z. Li, J. Zhou, andS. Wen, “Temporal modeling approaches for large-scale youtube-8m video understanding,” CVPR workshop, 2017.

28. B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal rela-tional reasoning in videos,” in Proceedings of the European Con-ference on Computer Vision (ECCV), 2018, pp. 803–818.

29. T. M. Le, V. Le, S. Venkatesh, and T. Tran, “Neural reasoning,fast and slow, for video question answering,” International JointConference on Neural Networks, 2020.

30. W. Jin, Z. Zhao, M. Gu, J. Yu, J. Xiao, and Y. Zhuang, “Multi-interaction network with object relation for video question answer-ing,” in Proceedings of the 27th ACM International Conference onMultimedia. ACM, 2019, pp. 1193–1201.

31. G. Singh, L. Sigal, and J. J. Little, “Spatio-temporal relationalreasoning for video question answering,” in BMVC, 2019.

32. J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,”in Advances in Neural Information Processing Systems, 2018, pp.1564–1574.

33. Z. Yu, J. Yu, J. Fan, and D. Tao, “Multi-modal factorized bilinearpooling with co-attention learning for visual question answering,”in Proceedings of the IEEE international conference on computervision, 2017, pp. 1821–1830.

34. R. Lienhart, “Abstracting home video automatically,” in Proceed-ings of the seventh ACM international conference on Multimedia(Part 2). ACM, 1999, pp. 37–40.

35. P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical re-current neural encoder for video representation with application tocaptioning,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2016, pp. 1029–1038.

36. L. Baraldi, C. Grana, and R. Cucchiara, “Hierarchical boundary-aware neural encoder for video captioning,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition,2017, pp. 1657–1666.

37. F. Mao, X. Wu, H. Xue, and R. Zhang, “Hierarchical video framesequence representation with deep convolutional graph network,”in Proceedings of the European Conference on Computer Vision(ECCV), 2018, pp. 0–0.

38. Y. Ye, Z. Zhao, Y. Li, L. Chen, J. Xiao, and Y. Zhuang, “Video ques-tion answering via attribute-augmented attention network learning,”in Proceedings of the 40th International ACM SIGIR Conferenceon Research and Development in Information Retrieval. ACM,2017, pp. 829–832.

39. J. Liang, L. Jiang, L. Cao, L.-J. Li, and A. G. Hauptmann, “Focalvisual-text attention for visual question answering,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recogni-tion, 2018, pp. 6135–6143.

40. Z. Zhao, X. Jiang, D. Cai, J. Xiao, X. He, and S. Pu, “Multi-turnvideo question answering via multi-stream hierarchical attentioncontext network.” in IJCAI, 2018, pp. 3690–3696.

41. Z. Zhao, Q. Yang, D. Cai, X. He, and Y. Zhuang, “Video questionanswering via hierarchical spatio-temporal attention networks.” inIJCAI, 2017, pp. 3518–3524.

42. K.-M. Kim, S.-H. Choi, J.-H. Kim, and B.-T. Zhang, “Multimodaldual attention memory for video story question answering,” in Pro-ceedings of the European Conference on Computer Vision (ECCV),2018, pp. 673–688.

43. X. Li, L. Gao, X. Wang, W. Liu, X. Xu, H. T. Shen, and J. Song,“Learnable aggregating net with diversity learning for video ques-tion answering,” in Proceedings of the 27th ACM InternationalConference on Multimedia. ACM, 2019, pp. 1166–1174.

Page 21: arXiv:2010.10019v2 [cs.CV] 3 Jan 2021

HCRN for Multimodal Video QA 21

44. T. Yang, Z.-J. Zha, H. Xie, M. Wang, and H. Zhang, “Question-aware tube-switch network for video question answering,” in Pro-ceedings of the 27th ACM International Conference on Multimedia.ACM, 2019, pp. 1184–1192.

45. D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang,“Video question answering via gradually refined attention overappearance and motion,” in Proceedings of the 25th ACM interna-tional conference on Multimedia. ACM, 2017, pp. 1645–1653.

46. K. Do, T. Tran, and S. Venkatesh, “Learning deep matrix represen-tations,” arXiv preprint arXiv:1703.01454, 2018.

47. Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, and H. Take-mura, “Bert representations for video question answering,” in TheIEEE Winter Conference on Applications of Computer Vision, 2020,pp. 1556–1565.

48. S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregatedresidual transformations for deep neural networks,” in Proceedingsof the IEEE conference on computer vision and pattern recognition,2017, pp. 1492–1500.

49. K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnnsretrace the history of 2d cnns and imagenet?” in Proceedings ofthe IEEE conference on Computer Vision and Pattern Recognition,2018, pp. 6546–6555.

50. J. Pennington, R. Socher, and C. D. Manning, “Glove: Globalvectors for word representation,” in Proceedings of the 2014 con-ference on empirical methods in natural language processing(EMNLP), 2014, pp. 1532–1543.

51. D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accuratedeep network learning by exponential linear units (elus),” Interna-tional Conference on Learning Representations, 2016.

52. D. Mahapatra, S. Winkler, and S.-C. Yen, “Motion saliency out-weighs other low-level features while watching videos,” in HumanVision and Electronic Imaging XIII, vol. 6806. InternationalSociety for Optics and Photonics, 2008, p. 68060P.

53. J. Sang and C. Xu, “Character-based movie summarization,” in Pro-ceedings of the 18th ACM international conference on Multimedia,2010, pp. 855–858.

54. Y. Yu, J. Kim, and G. Kim, “A joint sequence fusion model for videoquestion answering and retrieval,” in Proceedings of the EuropeanConference on Computer Vision (ECCV), 2018, pp. 471–487.

55. M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectionalattention flow for machine comprehension,” ICLR, 2017.

56. M. I. H. Chowdhury, K. Nguyen, S. Sridharan, and C. Fookes,“Hierarchical relational attention for video question answering,” in2018 25th IEEE International Conference on Image Processing(ICIP). IEEE, 2018, pp. 599–603.