bilinear attention networks 8min - visual question answering · -interactions between words and...

31
Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Biointelligence Lab. Seoul National University Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang

Upload: others

Post on 23-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Bilinear Attention Networks2018 VQA Challenge runner-up

(1st single model)

Biointelligence Lab.Seoul National University

Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang

–„„fi ¿� ¶ �

–„¿�„fi ¿� ¶ �

¿�„fi ¿� ¶ �A

¿�„fi ¿� ¶ �B

BS06‰ –�·ˇ ‡�-� ¿� ¶ �

‰ –�·ˇ ‡· �…›¿�·º —–‡� �‰�”…‚¶ '¿ �• �� ‚��� �`¶ ��»�‚» · �� �‚• ,���· ��ß¿º‚¯ …� �»� †¿¡��߶���ł�£,�•„��� ¿�����»���•` ¿'�

�ß�� ��� �»�…– ˆ ¿'�»�¿º �·�.�–�·��œ�¿º��¿¡�‚ ��• � ¿� ¶ �,�»� ˇ ¶ ����� �‰ –�·ˇ ‡� � ‚ƒ��‡„� ¿·��� ¿� ¶ ��¶ · �»� ˇ ¶ �� �

¿�…–��ß¿º–� ��”�”���• ��˛ ��� �‚‚�,�»�¿º‰ �”æ• ‡“��£��,�¯'–�‚ƒ���� • �”fl�� ��…���ł·�.�‰ –�·ˇ ‡� �»�»��”�Full�Color���¿º»�»��»�

¿�…–�ß�‚• �»�¿º ‚�����„�� �»�»�� �¿º¿¡�·º ��»� ��”�”»��¡����¶�� � �»�»� �¿º�»� ��� �·�.���»��”�CD-Rom ¿¡�…�• � ��¥�� ‚�»�¿º�»�

¿ł ¢�‚• � �·�

Page 2: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Visual Question Answering

• Learning joint representation of multiple modalities

• Linguistic and visual information

Credit: visualqa.org

Page 3: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

VQA 2.0 Dataset

• 204K MS COCO images

• 760K questions (10 answers for each question from unique AMT workers)

• train, val, test-dev (remote validation), test-standard (publications)

• and test-challenge (competitions), test-reserve (check overfitting)

• VQA Score is the avg. of 10 choose 9 accuracies →

Goyal et al., 2017

Images Questions AnswersTrain 80K 443K 4.4M

Val 40K 214K 2.1MTest 80K 447K unknown

# annotations VQA Score

0 0.0

1 0.3

2 0.6

3 0.9

>3 1.0

https://github.com/GT-Vision-Lab/VQA/issues/1 https://github.com/hengyuan-hu/bottom-up-attention-vqa/pull/18

Page 4: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Objective

• Introducing bilinear attention

- Interactions between words and visual concepts are meaningful

- Proposing an efficient method (with the same time complexity) on top of low-rank bilinear pooling

• Residual learning of attention

- Residual learning with attention mechanism for incremental inference

• Integration of counting module (Zhang et al., 2018)

Page 5: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Preliminary

• Question embedding (fine-tuning)

- Use the all outputs of GRU (every time steps)

- X ∈ RN×ρ where N is hidden dim., and ρ=|{xi}| is # of tokens

• Image embedding (fixed bottom-up-attention)

- Select 10-100 detected objects (rectangles) using pre-trained Faster RCNN, to extract rich features for each object (1600 classes, 400 attributes)

- Y ∈ RM×φ where M is feature dim., and φ=|{yj}| is # of objects

https://github.com/peteanderson80/bottom-up-attention

Page 6: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Low-rank Bilinear Pooling

• Bilinear model and its approximation (Wolf et al., 2007, Pirsiavash et al., 2009)

• Low-rank bilinear pooling (Kim et al., 2017)For vector output, instead of using three-dimensional tensors U and V, replace the vector of ones with a pooling matrix P (use three two-dimensional tensors).

Overview

ρ

K

= ρ

φ

Step 1. Bilinear Attention Maps

1K ρ

φ

φρ

1

K

=

Q

p

K

φ

V

Q’ V’ K

11

1K ρ

φ

φρ1

K

=Q” V” K

11

Residual Learning

Step 2. Bilinear Attention Networks

MLP classifier

1

GRUObject DetectionAll hidden

states

• After getting bilinear attention maps, we can stack multiple BANs.

What is the mustache made of ?

att_1

att_2∙ +

Figure 1: Overview of a two-layer BAN. Two multi-channel inputs, �-object detection features and⇢-length GRU hidden vectors, are used to get bilinear attention maps and joint representations to beused by a classifier. For the definition of the BAN, see the text in Section 3.

previous works [6, 15] where multiple attention maps are used by concatenating the attended features.Since the proposed residual learning method for BAN exploits residual summations instead of con-catenation, which leads to parameter-efficiently and performance-effectively learn up to eight-glimpseBAN. For the overview of two-glimpse BAN, please refer to Figure 1.

Our main contributions are:• We propose the bilinear attention networks (BAN) to learn and use bilinear attention distributions,

on top of low-rank bilinear pooling technique.• We propose a variant of multimodal residual networks (MRN) to efficiently utilize the multiple

bilinear attention maps generated by our model. Unlike previous works, our method successfullyutilizes up to 8 attention maps.

• Finally, we validate our proposed method on a large and highly-competitive dataset, VQA2.0 [8]. Our model achieves a new state-of-the-art maintaining simplicity of model structure.Moreover, we evaluate the visual grounding of bilinear attention map on Flickr30k Entities [23]outperforming previous methods, along with 25.37% improvement of inference speed takingadvantage of the processing of multi-channel inputs.

2 Low-rank bilinear pooling

We first review the low-rank bilinear pooling and its application to attention networks [15], whichuses single-channel input (question vector) to combine the other multi-channel input (image features)as single-channel intermediate representation (attended feature).

Low-rank bilinear model. Pirsiavash et al. [22] proposed a low-rank bilinear model to reduce therank of bilinear weight matrix Wi to give regularity. For this, Wi is replaced with the multiplicationof two smaller matrices UiVT

i , where Ui 2 RN⇥d and Vi 2 R

M⇥d. As a result, this replacementmakes the rank of Wi to be at most d min(N,M). For the scalar output fi (bias terms are omittedwithout loss of generality):

fi = xTWiy = xTUiVTi y = 1T (UT

i x �VTi y) (1)

where 1 2 Rd is a vector of ones and � denotes Hadamard product (element-wise multiplication).

Low-rank bilinear pooling. For a vector output f , a pooling matrix P is introduced:

f = PT (UTx �VTy) (2)

where P 2 Rd⇥c, U 2 RN⇥d, and V 2 RM⇥d. It allows U and V to be two-dimensional tensorsby introducing P for a vector output f 2 Rc, significantly reducing the number of parameters.

Unitary attention networks. Attention provides an efficient mechanism to reduce input channelby selectively utilizing given information. Assuming that a multi-channel input Y consisting of� = |{yi}| column vectors, we want to get single channel y from Y using the weights {↵i}:

y =X

i

↵iyi (3)

2

Overview

ρ

K

= ρ

φ

Step 1. Bilinear Attention Maps

1K ρ

φ

φρ

1

K

=

Q

p

K

φ

V

Q’ V’ K

11

1K ρ

φ

φρ1

K

=Q” V” K

11

Residual Learning

Step 2. Bilinear Attention Networks

MLP classifier

1

GRUObject DetectionAll hidden

states

• After getting bilinear attention maps, we can stack multiple BANs.

What is the mustache made of ?

att_1

att_2∙ +

Figure 1: Overview of a two-layer BAN. Two multi-channel inputs, �-object detection features and⇢-length GRU hidden vectors, are used to get bilinear attention maps and joint representations to beused by a classifier. For the definition of the BAN, see the text in Section 3.

previous works [6, 15] where multiple attention maps are used by concatenating the attended features.Since the proposed residual learning method for BAN exploits residual summations instead of con-catenation, which leads to parameter-efficiently and performance-effectively learn up to eight-glimpseBAN. For the overview of two-glimpse BAN, please refer to Figure 1.

Our main contributions are:• We propose the bilinear attention networks (BAN) to learn and use bilinear attention distributions,

on top of low-rank bilinear pooling technique.• We propose a variant of multimodal residual networks (MRN) to efficiently utilize the multiple

bilinear attention maps generated by our model. Unlike previous works, our method successfullyutilizes up to 8 attention maps.

• Finally, we validate our proposed method on a large and highly-competitive dataset, VQA2.0 [8]. Our model achieves a new state-of-the-art maintaining simplicity of model structure.Moreover, we evaluate the visual grounding of bilinear attention map on Flickr30k Entities [23]outperforming previous methods, along with 25.37% improvement of inference speed takingadvantage of the processing of multi-channel inputs.

2 Low-rank bilinear pooling

We first review the low-rank bilinear pooling and its application to attention networks [15], whichuses single-channel input (question vector) to combine the other multi-channel input (image features)as single-channel intermediate representation (attended feature).

Low-rank bilinear model. Pirsiavash et al. [22] proposed a low-rank bilinear model to reduce therank of bilinear weight matrix Wi to give regularity. For this, Wi is replaced with the multiplicationof two smaller matrices UiVT

i , where Ui 2 RN⇥d and Vi 2 R

M⇥d. As a result, this replacementmakes the rank of Wi to be at most d min(N,M). For the scalar output fi (bias terms are omittedwithout loss of generality):

fi = xTWiy = xTUiVTi y = 1T (UT

i x �VTi y) (1)

where 1 2 Rd is a vector of ones and � denotes Hadamard product (element-wise multiplication).

Low-rank bilinear pooling. For a vector output f , a pooling matrix P is introduced:

f = PT (UTx �VTy) (2)

where P 2 Rd⇥c, U 2 RN⇥d, and V 2 RM⇥d. It allows U and V to be two-dimensional tensorsby introducing P for a vector output f 2 Rc, significantly reducing the number of parameters.

Unitary attention networks. Attention provides an efficient mechanism to reduce input channelby selectively utilizing given information. Assuming that a multi-channel input Y consisting of� = |{yi}| column vectors, we want to get single channel y from Y using the weights {↵i}:

y =X

i

↵iyi (3)

2

Page 7: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Unitary Attention

• This pooling is used to get attention weights with a question embedding (single-channel) vector and visual feature vectors (multi-channel) as the two inputs.

• We call it unitary attention since a question embedding vector queried the feature vectors, unidirectionally.

A

TanhConv

TanhLinear

Replicate

Q V

SoftmaxConv

TanhLinear

TanhLinearLinear

Softmax

Kim et al., 2017

Page 8: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Bilinear Attention Maps

• U and V are linear embeddings

• p is a learnable projection vector

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

Bilinear Attention Networks

reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:

f0k = (XT

U0)TkA(YT

V0)k (10)

where (XTU

0)k 2 R|{xi}|, (YTV

0)k 2 R|{yj}|, and f0k de-

notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:

f0k =

|{xi}|X

i=1

|{yj}|X

j=1

Ai,j(XTi U

0k)(V

0Tk Yj) (11)

=

|{xi}|X

i=1

|{yj}|X

j=1

Ai,jXTi (U

0kV

0Tk )Yj (12)

where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U

0k and V

0k denotes the k-th column of U0 and V

0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V

0Tk Yj in Equation 11 is

transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X

Ti (U

0kV

0Tk )Yj of Equation 12

(eventually K-rank bilinear pooling for f 0 2 RK).

Then the joint representation f 2 RC is defined as:

f = P0Tf0 (13)

where P02 RK⇥C . For the convenience, we define the bi-

linear attention networks as a function of two multi-channelinputs as follows:

f = BAN(X,Y) (14)

since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.

3.1. Bilinear Attention Map

Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:

A := softmax⇣�

(1 · pT ) �XT

U�V

TY

⌘(15)

where 1 2 R|{xi}|, p 2 Rd, and remind that A 2

R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j

before the softmax function:

Ai,j = pT�(UT

Xi) � (VTYj)

�. (16)

The multiple bilinear attention maps can be extended asfollows:

Ag := softmax⇣�

(1 · pTg ) �X

TU�V

TY

⌘(17)

where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.

3.2. Residual Learning of Attention

Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:

fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)

where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.

3.3. Time Complexity

When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �

|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.

In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.

4. Related Works

4.1. Attention for Multimodal Learning Task or VQA

Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation

ρ×K ρ×N

element-wise multiplicationK×M

N×K

M×φ

ρ × K K × φ

ρ × φ

ρ

K

K

φ

= ρ

φ

Page 9: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Bilinear Attention Maps

• Exactly the same approach with low-rank bilinear pooling

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

Bilinear Attention Networks

reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:

f0k = (XT

U0)TkA(YT

V0)k (10)

where (XTU

0)k 2 R|{xi}|, (YTV

0)k 2 R|{yj}|, and f0k de-

notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:

f0k =

|{xi}|X

i=1

|{yj}|X

j=1

Ai,j(XTi U

0k)(V

0Tk Yj) (11)

=

|{xi}|X

i=1

|{yj}|X

j=1

Ai,jXTi (U

0kV

0Tk )Yj (12)

where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U

0k and V

0k denotes the k-th column of U0 and V

0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V

0Tk Yj in Equation 11 is

transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X

Ti (U

0kV

0Tk )Yj of Equation 12

(eventually K-rank bilinear pooling for f 0 2 RK).

Then the joint representation f 2 RC is defined as:

f = P0Tf0 (13)

where P02 RK⇥C . For the convenience, we define the bi-

linear attention networks as a function of two multi-channelinputs as follows:

f = BAN(X,Y) (14)

since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.

3.1. Bilinear Attention Map

Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:

A := softmax⇣�

(1 · pT ) �XT

U�V

TY

⌘(15)

where 1 2 R|{xi}|, p 2 Rd, and remind that A 2

R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j

before the softmax function:

Ai,j = pT�(UT

Xi) � (VTYj)

�. (16)

The multiple bilinear attention maps can be extended asfollows:

Ag := softmax⇣�

(1 · pTg ) �X

TU�V

TY

⌘(17)

where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.

3.2. Residual Learning of Attention

Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:

fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)

where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.

3.3. Time Complexity

When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �

|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.

In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.

4. Related Works

4.1. Attention for Multimodal Learning Task or VQA

Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation

element-wise multiplication

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

Bilinear Attention Networks

reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:

f0k = (XT

U0)TkA(YT

V0)k (10)

where (XTU

0)k 2 R|{xi}|, (YTV

0)k 2 R|{yj}|, and f0k de-

notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:

f0k =

|{xi}|X

i=1

|{yj}|X

j=1

Ai,j(XTi U

0k)(V

0Tk Yj) (11)

=

|{xi}|X

i=1

|{yj}|X

j=1

Ai,jXTi (U

0kV

0Tk )Yj (12)

where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U

0k and V

0k denotes the k-th column of U0 and V

0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V

0Tk Yj in Equation 11 is

transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X

Ti (U

0kV

0Tk )Yj of Equation 12

(eventually K-rank bilinear pooling for f 0 2 RK).

Then the joint representation f 2 RC is defined as:

f = P0Tf0 (13)

where P02 RK⇥C . For the convenience, we define the bi-

linear attention networks as a function of two multi-channelinputs as follows:

f = BAN(X,Y) (14)

since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.

3.1. Bilinear Attention Map

Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:

A := softmax⇣�

(1 · pT ) �XT

U�V

TY

⌘(15)

where 1 2 R|{xi}|, p 2 Rd, and remind that A 2

R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j

before the softmax function:

Ai,j = pT�(UT

Xi) � (VTYj)

�. (16)

The multiple bilinear attention maps can be extended asfollows:

Ag := softmax⇣�

(1 · pTg ) �X

TU�V

TY

⌘(17)

where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.

3.2. Residual Learning of Attention

Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:

fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)

where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.

3.3. Time Complexity

When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �

|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.

In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.

4. Related Works

4.1. Attention for Multimodal Learning Task or VQA

Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation

ρ

K

K

φ

= ρ

φ

Page 10: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Bilinear Attention Maps

• Multiple bilinear attention maps are acquired by different projection vectors pg as:110

111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

Bilinear Attention Networks

reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:

f0k = (XT

U0)TkA(YT

V0)k (10)

where (XTU

0)k 2 R|{xi}|, (YTV

0)k 2 R|{yj}|, and f0k de-

notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:

f0k =

|{xi}|X

i=1

|{yj}|X

j=1

Ai,j(XTi U

0k)(V

0Tk Yj) (11)

=

|{xi}|X

i=1

|{yj}|X

j=1

Ai,jXTi (U

0kV

0Tk )Yj (12)

where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U

0k and V

0k denotes the k-th column of U0 and V

0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V

0Tk Yj in Equation 11 is

transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X

Ti (U

0kV

0Tk )Yj of Equation 12

(eventually K-rank bilinear pooling for f 0 2 RK).

Then the joint representation f 2 RC is defined as:

f = P0Tf0 (13)

where P02 RK⇥C . For the convenience, we define the bi-

linear attention networks as a function of two multi-channelinputs as follows:

f = BAN(X,Y) (14)

since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.

3.1. Bilinear Attention Map

Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:

A := softmax⇣�

(1 · pT ) �XT

U�V

TY

⌘(15)

where 1 2 R|{xi}|, p 2 Rd, and remind that A 2

R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j

before the softmax function:

Ai,j = pT�(UT

Xi) � (VTYj)

�. (16)

The multiple bilinear attention maps can be extended asfollows:

Ag := softmax⇣�

(1 · pTg ) �X

TU�V

TY

⌘(17)

where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.

3.2. Residual Learning of Attention

Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:

fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)

where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.

3.3. Time Complexity

When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �

|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.

In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.

4. Related Works

4.1. Attention for Multimodal Learning Task or VQA

Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation

ρ

K

K

φ

= ρ

φ

Page 11: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Bilinear Attention Networks

• Each multimodal joint feature is filled with following equation (k is the index of K; broadcasting in PyTorch let you avoid for-loop for this):110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

Bilinear Attention Networks

reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:

f0k = (XT

U0)TkA(YT

V0)k (10)

where (XTU

0)k 2 R|{xi}|, (YTV

0)k 2 R|{yj}|, and f0k de-

notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:

f0k =

|{xi}|X

i=1

|{yj}|X

j=1

Ai,j(XTi U

0k)(V

0Tk Yj) (11)

=

|{xi}|X

i=1

|{yj}|X

j=1

Ai,jXTi (U

0kV

0Tk )Yj (12)

where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U

0k and V

0k denotes the k-th column of U0 and V

0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V

0Tk Yj in Equation 11 is

transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X

Ti (U

0kV

0Tk )Yj of Equation 12

(eventually K-rank bilinear pooling for f 0 2 RK).

Then the joint representation f 2 RC is defined as:

f = P0Tf0 (13)

where P02 RK⇥C . For the convenience, we define the bi-

linear attention networks as a function of two multi-channelinputs as follows:

f = BAN(X,Y) (14)

since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.

3.1. Bilinear Attention Map

Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:

A := softmax⇣�

(1 · pT ) �XT

U�V

TY

⌘(15)

where 1 2 R|{xi}|, p 2 Rd, and remind that A 2

R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j

before the softmax function:

Ai,j = pT�(UT

Xi) � (VTYj)

�. (16)

The multiple bilinear attention maps can be extended asfollows:

Ag := softmax⇣�

(1 · pTg ) �X

TU�V

TY

⌘(17)

where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.

3.2. Residual Learning of Attention

Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:

fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)

where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.

3.3. Time Complexity

When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �

|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.

In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.

4. Related Works

4.1. Attention for Multimodal Learning Task or VQA

Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation

1 × ρ φ × 1

ρ×φ

※ broadcasting: automatically repeat tensor operations in api-level supported by Numpy, Tensorflow, Pytorch

1k-th

K

ρ

φ

φρ

1

k-th K

= k-th

K

Page 12: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Bilinear Attention Networks

• One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

Bilinear Attention Networks

reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:

f0k = (XT

U0)TkA(YT

V0)k (10)

where (XTU

0)k 2 R|{xi}|, (YTV

0)k 2 R|{yj}|, and f0k de-

notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:

f0k =

|{xi}|X

i=1

|{yj}|X

j=1

Ai,j(XTi U

0k)(V

0Tk Yj) (11)

=

|{xi}|X

i=1

|{yj}|X

j=1

Ai,jXTi (U

0kV

0Tk )Yj (12)

where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U

0k and V

0k denotes the k-th column of U0 and V

0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V

0Tk Yj in Equation 11 is

transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X

Ti (U

0kV

0Tk )Yj of Equation 12

(eventually K-rank bilinear pooling for f 0 2 RK).

Then the joint representation f 2 RC is defined as:

f = P0Tf0 (13)

where P02 RK⇥C . For the convenience, we define the bi-

linear attention networks as a function of two multi-channelinputs as follows:

f = BAN(X,Y) (14)

since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.

3.1. Bilinear Attention Map

Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:

A := softmax⇣�

(1 · pT ) �XT

U�V

TY

⌘(15)

where 1 2 R|{xi}|, p 2 Rd, and remind that A 2

R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j

before the softmax function:

Ai,j = pT�(UT

Xi) � (VTYj)

�. (16)

The multiple bilinear attention maps can be extended asfollows:

Ag := softmax⇣�

(1 · pTg ) �X

TU�V

TY

⌘(17)

where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.

3.2. Residual Learning of Attention

Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:

fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)

where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.

3.3. Time Complexity

When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �

|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.

In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.

4. Related Works

4.1. Attention for Multimodal Learning Task or VQA

Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

Bilinear Attention Networks

reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:

f0k = (XT

U0)TkA(YT

V0)k (10)

where (XTU

0)k 2 R|{xi}|, (YTV

0)k 2 R|{yj}|, and f0k de-

notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:

f0k =

|{xi}|X

i=1

|{yj}|X

j=1

Ai,j(XTi U

0k)(V

0Tk Yj) (11)

=

|{xi}|X

i=1

|{yj}|X

j=1

Ai,jXTi (U

0kV

0Tk )Yj (12)

where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U

0k and V

0k denotes the k-th column of U0 and V

0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V

0Tk Yj in Equation 11 is

transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X

Ti (U

0kV

0Tk )Yj of Equation 12

(eventually K-rank bilinear pooling for f 0 2 RK).

Then the joint representation f 2 RC is defined as:

f = P0Tf0 (13)

where P02 RK⇥C . For the convenience, we define the bi-

linear attention networks as a function of two multi-channelinputs as follows:

f = BAN(X,Y) (14)

since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.

3.1. Bilinear Attention Map

Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:

A := softmax⇣�

(1 · pT ) �XT

U�V

TY

⌘(15)

where 1 2 R|{xi}|, p 2 Rd, and remind that A 2

R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j

before the softmax function:

Ai,j = pT�(UT

Xi) � (VTYj)

�. (16)

The multiple bilinear attention maps can be extended asfollows:

Ag := softmax⇣�

(1 · pTg ) �X

TU�V

TY

⌘(17)

where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.

3.2. Residual Learning of Attention

Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:

fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)

where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.

3.3. Time Complexity

When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �

|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.

In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.

4. Related Works

4.1. Attention for Multimodal Learning Task or VQA

Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation

low-rank bilinear pooling

Page 13: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Bilinear Attention Networks

• One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation

• Low-rank bilinear feature learning inside bilinear attention

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

Bilinear Attention Networks

reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:

f0k = (XT

U0)TkA(YT

V0)k (10)

where (XTU

0)k 2 R|{xi}|, (YTV

0)k 2 R|{yj}|, and f0k de-

notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:

f0k =

|{xi}|X

i=1

|{yj}|X

j=1

Ai,j(XTi U

0k)(V

0Tk Yj) (11)

=

|{xi}|X

i=1

|{yj}|X

j=1

Ai,jXTi (U

0kV

0Tk )Yj (12)

where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U

0k and V

0k denotes the k-th column of U0 and V

0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V

0Tk Yj in Equation 11 is

transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X

Ti (U

0kV

0Tk )Yj of Equation 12

(eventually K-rank bilinear pooling for f 0 2 RK).

Then the joint representation f 2 RC is defined as:

f = P0Tf0 (13)

where P02 RK⇥C . For the convenience, we define the bi-

linear attention networks as a function of two multi-channelinputs as follows:

f = BAN(X,Y) (14)

since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.

3.1. Bilinear Attention Map

Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:

A := softmax⇣�

(1 · pT ) �XT

U�V

TY

⌘(15)

where 1 2 R|{xi}|, p 2 Rd, and remind that A 2

R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j

before the softmax function:

Ai,j = pT�(UT

Xi) � (VTYj)

�. (16)

The multiple bilinear attention maps can be extended asfollows:

Ag := softmax⇣�

(1 · pTg ) �X

TU�V

TY

⌘(17)

where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.

3.2. Residual Learning of Attention

Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:

fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)

where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.

3.3. Time Complexity

When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �

|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.

In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.

4. Related Works

4.1. Attention for Multimodal Learning Task or VQA

Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation

low-rank bilinear pooling

Page 14: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Bilinear Attention Networks

• One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation

• Low-rank bilinear feature learning inside bilinear attention

• Similarly to MLB (Kim et al., ICLR 2017), activation functions can be applied

Page 15: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Bilinear Attention Networks

Linear

X

ReLU

Y

Softmax

LinearReLU

LinearReLU

LinearReLU

Linear

Classifier

Linear

transpose

transpose

Page 16: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Time Complexity

• Assuming that M ≥ N > K > φ ≥ ρ, the time complexity of bilinear attention networks is O(KMφ) where K denotes hidden size, since BAN consists of matrix chain multiplication

• Empirically, BAN takes 284s/epoch while unitary attention control takes 190s/epoch

• Largely due to the increment of input size for Softmax function, φ to φ × ρ

Page 17: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Residual Learning of Attention

• Residual learning exploits multiple attention maps (f0 is X and {ɑi} is fixed to ones):

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

Bilinear Attention Networks

reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:

f0k = (XT

U0)TkA(YT

V0)k (10)

where (XTU

0)k 2 R|{xi}|, (YTV

0)k 2 R|{yj}|, and f0k de-

notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:

f0k =

|{xi}|X

i=1

|{yj}|X

j=1

Ai,j(XTi U

0k)(V

0Tk Yj) (11)

=

|{xi}|X

i=1

|{yj}|X

j=1

Ai,jXTi (U

0kV

0Tk )Yj (12)

where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U

0k and V

0k denotes the k-th column of U0 and V

0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V

0Tk Yj in Equation 11 is

transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X

Ti (U

0kV

0Tk )Yj of Equation 12

(eventually K-rank bilinear pooling for f 0 2 RK).

Then the joint representation f 2 RC is defined as:

f = P0Tf0 (13)

where P02 RK⇥C . For the convenience, we define the bi-

linear attention networks as a function of two multi-channelinputs as follows:

f = BAN(X,Y) (14)

since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.

3.1. Bilinear Attention Map

Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:

A := softmax⇣�

(1 · pT ) �XT

U�V

TY

⌘(15)

where 1 2 R|{xi}|, p 2 Rd, and remind that A 2

R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j

before the softmax function:

Ai,j = pT�(UT

Xi) � (VTYj)

�. (16)

The multiple bilinear attention maps can be extended asfollows:

Ag := softmax⇣�

(1 · pTg ) �X

TU�V

TY

⌘(17)

where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.

3.2. Residual Learning of Attention

Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:

fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)

where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.

3.3. Time Complexity

When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �

|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.

In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.

4. Related Works

4.1. Attention for Multimodal Learning Task or VQA

Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation

shortcut

bilinear attention map

bilinear attention networksrepeat {# of tokens} times

Page 18: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

att_1

Overview

ρ

K

= ρ

φ

Step 1. Bilinear Attention Maps

1K ρ

φ

φρ

1

K

=

XTU

p

K

φ

VTY

U’TX YTV’ K=N

ρ

φ

1

K

1Kρ

U’’TX’ YTV’’

Residual Learning

Step 2. Bilinear Attention Networks

MLP classifier

GRUObject DetectionAll hidden

states

• After getting bilinear attention maps, we can stack multiple BANs.

What is the mustache made of ?

Att_1

Att_2∙ att_2

X

repeat 1→ρ

+

=φ K

X’

repeat 1→ρ

+Sum

Pooling

X

Y

Softmax

X’Residual Learning

Page 19: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Multiple Attention Maps

ValidationVQA 2.0 Score +%

Bottom-Up (Teney et al., 2017) 63.37

BAN-1 65.36 +1.99

BAN-2 65.61 +0.25

BAN-4 65.81 +0.2

BAN-8 66.00 +0.19

BAN-12 66.04 +0.04

• Single model on validation score

Page 20: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Residual Learning

ValidationVQA 2.0 Score +/-

BAN-4 (Residual) 65.81 ±0.09

BAN-4 (Sum) 64.78 ±0.08 -1.03

BAN-4 (Concat) 64.71 ±0.21 -0.07

BANi (X,Y;Ai )i∑

i! BANi (X,Y;Ai )

Page 21: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Comparison with Co-attention

ValidationVQA 2.0 Score +/-

Bilinear Attention 65.36 ±0.14

Co-Attention 64.79 ±0.06 -0.57

Attention 64.59 ±0.04 -0.20

※ The number of parameters is controlled (all comparison models have 32M parameters).

BAN-1

Page 22: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Comparison with Co-attention

※ The number of parameters is controlled (all comparison models have 32M parameters).

Training curves

Validation curves

Entro

py

3.0

3.3

3.6

3.9

4.2

4.5

4.8

Epoch0 2 4 6 8 10 12 14 16 18

Att_1Att_2Att_3Att_4

Valid

atio

n Sc

ore

35

40

45

50

55

60

65

70

The number of used glimpses0 1 2 3 4 5 6 7 8

2 Glimpses4 Glimpses8 Glimpses

Entro

py

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Epoch1 3 5 7 9 11 1315 1719

Att_1 Att_2Att_3 Att_4

Valid

atio

n sc

ore

35

40

45

50

55

60

65

70

Used glimpses0 2 4 6 8 10 12

BAN-1BAN-2BAN-4BAN-8BAN-12

Valid

atio

n sc

ore

35

45

55

65

75

85

Epoch1 3 5 7 9 11 13 15 17 19

bi-att trainco-att trainuni-att trainbi-att valco-att valuni-att val

(a) (b) (c)

Valid

atio

n sc

ore

58.0

59.5

61.0

62.5

64.0

65.5

67.0

0M 15M 30M 45M 60M

Uni-AttCo-AttBAN-1BAN-4BAN-1+MFB

Entro

py

0.0

1.3

2.7

4.0

5.3

6.7

8.0

Epoch1 4 7 10 13 16 19

Att_1Att_2Att_3Att_4

Valid

atio

n sc

ore

35

40

45

50

55

60

65

70

Used glimpses0 2 4 6 8 10 12

BAN-1BAN-2BAN-4BAN-8BAN-12

Valid

atio

n sc

ore

30

41

52

63

74

85

Epoch1 4 7 10 13 16 19

Bi-Att trainCo-Att trainUni-Att trainBi-Att valCo-Att valUni-Att val

(a) (c) (d)

The number of parameters

(b)

�1

Entro

py

3.0

3.3

3.6

3.9

4.2

4.5

4.8

Epoch0 2 4 6 8 10 12 14 16 18

Att_1Att_2Att_3Att_4

Valid

atio

n Sc

ore

35

40

45

50

55

60

65

70

The number of used glimpses0 1 2 3 4 5 6 7 8

2 Glimpses4 Glimpses8 Glimpses

Entro

py

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Epoch1 3 5 7 9 11 1315 1719

Att_1 Att_2Att_3 Att_4

Valid

atio

n sc

ore

35

40

45

50

55

60

65

70

Used glimpses0 2 4 6 8 10 12

BAN-1BAN-2BAN-4BAN-8BAN-12

Valid

atio

n sc

ore

35

45

55

65

75

85

Epoch1 3 5 7 9 11 13 15 17 19

bi-att trainco-att trainuni-att trainbi-att valco-att valuni-att val

(a) (b) (c)

Valid

atio

n sc

ore

58.0

59.5

61.0

62.5

64.0

65.5

67.0

0M 15M 30M 45M 60M

Uni-AttCo-AttBAN-1BAN-4BAN-1+MFB

Entro

py

0.0

1.3

2.7

4.0

5.3

6.7

8.0

Epoch1 4 7 10 13 16 19

Att_1Att_2Att_3Att_4

Valid

atio

n sc

ore

35

40

45

50

55

60

65

70

Used glimpses0 2 4 6 8 10 12

BAN-1BAN-2BAN-4BAN-8BAN-12

Valid

atio

n sc

ore

30

41

52

63

74

85

Epoch1 4 7 10 13 16 19

Bi-Att trainCo-Att trainUni-Att trainBi-Att valCo-Att valUni-Att val

(a) (c) (d)

The number of parameters

(b)

�1

Page 23: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Visualization

Page 24: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Integration of Counting module with BAN

• Counter (Zhang et al., 2018) gets a multinoulli distribution and detected box info

• Our method — maxout from bilinear attention distribution to unitary attention distribution, easy to adapt multitask learning

ρ

ɸ

maxout

ɸ1

LogitsSoftmax Bilinear Attention

Maxout Sigmoid Counting Module

linear embedding of the i-th output of counter module

residual learning of attention

inference steps for multi-glimpse attention

Bilinear Attention Networks — Appendix

A Variants of BAN

A.1 Enhancing glove word embedding

We augment a computed 300-dimensional word embedding to each 300-dimensional Glove wordembedding. The computation is as follows: 1) we choose arbitrary two words wi and wj from eachquestion can be found in VQA and Visual Genome datasets or each caption in MS COCO dataset.2) we increase the value of Ai,j by one where A 2 V 0

⇥ V 0 is an association matrix initializedwith zeros. Notice that i and j can be the index out of vocabulary V and the size of vocabulary inthis computation is denoted by V 0. 3) to penalize highly frequent words, each row of A is dividedby the number of sentences (question or caption) which contain the corresponding word . 4) eachrow is normalized by the sum of all elements of each row. 5) we calculate W0 = A · W whereW 2 V 0

⇥ E is a Glove word embedding matrix and E is the size of word embedding, i.e., 300.Therefore, W0

2 V 0⇥ E stands for the mixed word embeddings of semantically closed words. 6)

finally, we select V rows from W0 corresponding to the vocabulary in our model and augment theserows to the previous word embeddings, which makes 600-dimensional word embeddings in total. Theinput size of GRU is increased to 600 to match with these word embeddings. These word embeddingsare fine-tuned.

As a result, this variant significantly improves the performance to 66.03 (±0.12) compared with theperformance of 65.72 (± 0.11) which is done by augmenting the same 300-dimensional Glove wordembeddings (so the number of parameters is controlled). In this experiment, we use four-glimpseBAN and evaluate on validation split. The standard deviation is calculated by three random initializedmodels and the means are reported. The result on test-dev split can be found in Table 3 as BAN+Glove.

A.2 Integrating counting module

The counting module [41] is proposed to improve the performance related to counting tasks. Thismodule is a neural network component to get a dense representation from spatial information ofdetected objects, i.e., the left-top and right-bottom positions of the � proposed objects (rectangles)denoted by S 2 R4⇥�. The interface of the counting module is defined as:

c = Counter(s, ↵) (12)

where c 2 R�+1 and ↵ 2 R� is the logits of corresponding objects for sigmoid function insidethe counting module. We found that the ↵ defined by maxj=1,...,�(A·,j), i.e., the maximum valuesin each column vector of A in Equation 9, was better than that of summation. Since the countingmodule does not support variable-object inputs, we select 10-top objects for the input instead of �objects based on the values of ↵.

The BAN integrated with the counting module is defined as:

fi+1 =�BANi(fi,Y;Ai) + gi(ci)

�· 1T + fi (13)

where the function g(·) is a linear embedding followed by ReLU activation function. Note that adropout layer before this linear embedding severely hurts performance.

As a result, this variant significantly improves the counting performance from 54.92 (±0.30) to58.21 (±0.49), while overall performance is improved from 65.81 (±0.09) to 66.01 (±0.14) in acontrolled experiment using a vanilla four-glimpse BAN. The definition of a subset of countingquestions comes from the previous work [30]. The result on test-dev split can be found in Table 3 asBAN+Glove+Counter, notice that, which is applied by the previous embedding variant, too.

12

Page 25: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Comparison with State-of-the-arts

Test-devVQA 2.0 Score +%

Prior 25.70

Language-Only 44.22 +18.52%

MCB (ResNet) 61.96 +17.74%

Bottom-Up (FRCNN) 65.32 +3.36%

MFH (ResNet) 65.80 +0.48%MFH (FRCNN) 68.76 +2.96%

BAN (Ours; FRCNN) 69.52 +0.76%BAN-Glove (Ours; FRCNN) 69.66 +0.14%

BAN-Glove-Counter (Ours; FRCNN) 70.04 +0.38%

• Single model on test-dev score

image feature

attention model

2016 winner

2017 winner

2017 runner-up

counting feature

test-dev Numbers

Zhang et al. (2018) 51.62

Ours 54.04

Page 26: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Flickr30k Entities

• Visual grounding task — mapping entity phrases to regions in an image

[/EN#40120/people A girl] in [/EN#40122/clothing a yellow tennis suit] , [/EN#40125/other green visor] and [/EN#40128/clothing white tennis shoes] holding [/EN#40124/other a tennis racket] in a position where she is going to hit [/EN#40121/other the tennis ball] .

0.48<0.5

Page 27: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Flickr30k Entities

• Visual grounding task — mapping entity phrases to regions in an image

[/EN#38656/people A male conductor] wearing [/EN#38657/clothing all black] leading [/EN#38653/people a orchestra] and [/EN#38658/people choir] on [/EN#38659/scene a brown stage] playing and singing [/EN#38664/other a musical number] .

Page 28: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Flickr30k Entities Recall@1,5,10

R@1 R@5 R@10

Zhang et al. (2016) 27.0 49.9 57.7

SCRC (2016) 27.8 - 62.9

DSPE (2016) 43.89 64.46 68.66

GroundeR (2016) 47.81 - -

MCB (2016) 48.69 - -

CCA (2017) 50.89 71.09 75.73

Yeh et al. (2017) 53.97 - -

Hinami & Satoh (2017) 65.21

BAN (ours) 66.69 84.22 86.35

Page 29: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Flickr30k Entities Recall@1,5,10

people clothing bodyparts animals vehicles instruments scene other

#Instances 5,656 2,306 523 518 400 162 1,619 3,374

GroundeR (2016) 61.00 38.12 10.33 62.55 68.75 36.42 58.18 29.08

CCA (2017) 64.73 46.88 17.21 65.83 68.75 37.65 51.39 31.77

Yeh et al. (2017) 68.71 46.83 19.50 70.07 73.75 39.50 60.38 32.45

Hinami & Satoh (2017) 78.17 61.99 35.25 74.41 76.16 56.69 68.07 47.42

BAN (ours) 79.90 74.95 47.23 81.85 76.92 43.00 68.69 51.33

Page 30: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

ChallengeRunner-UpsJointRunner-UpTeam1

SNU-BI

ChallengeAccuracy: 71.69

Jin-HwaKim(SeoulNationalUniversity)Jaehyun Jun(SeoulNationalUniversity)

Byoung-Tak Zhang(SeoulNationalUniversity&Surromind Robotics)

28

Conclusions

• Bilinear attention networks gracefully extends unitary attention networks,

• as low-rank bilinear pooling inside bilinear attention.

• Furthermore, residual learning of attention efficiently uses multiple attention maps.2018 VQA Challenge runners-up (shared 2nd place) 1st single model (70.35 on test-standard)

Page 31: bilinear attention networks 8min - Visual Question Answering · -Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time

Thank you!Any question?

We would like to thank Kyoung-Woon On, Bohyung Han, and Hyeonwoo Noh for helpful comments and discussion. This work was supported by 2017 Google Ph.D. Fellowship in Machine Learning and Ph.D. Completion Scholarship from

College of Humanities, Seoul National University, and the Korea government (IITP-2017-0-01772-VTT, IITP-R0126-16-1072-SW.StarLab, 2018-0-00622-RMI, KEIT-10044009-HRI.MESSI, KEIT-10060086-RISF).

The part of computing resources used in this study was generously shared by Standigm Inc.

arXiv:1805.07932Code & pretrained model in PyTorch:

github.com/jnhwkim/ban-vqa