bilinear attention networks 8min - visual question answering · -interactions between words and...
TRANSCRIPT
Bilinear Attention Networks2018 VQA Challenge runner-up
(1st single model)
Biointelligence Lab.Seoul National University
Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang
–„„fi ¿� ¶ �
–„¿�„fi ¿� ¶ �
¿�„fi ¿� ¶ �A
¿�„fi ¿� ¶ �B
BS06‰ –�·ˇ ‡�-� ¿� ¶ �
‰ –�·ˇ ‡· �…›¿�·º —–‡� �‰�”…‚¶ '¿ �• �� ‚��� �`¶ ��»�‚» · �� �‚• ,���· ��ß¿º‚¯ …� �»� †¿¡��߶���ł�£,�•„��� ¿�����»���•` ¿'�
�ß�� ��� �»�…– ˆ ¿'�»�¿º �·�.�–�·��œ�¿º��¿¡�‚ ��• � ¿� ¶ �,�»� ˇ ¶ ����� �‰ –�·ˇ ‡� � ‚ƒ��‡„� ¿·��� ¿� ¶ ��¶ · �»� ˇ ¶ �� �
¿�…–��ß¿º–� ��”�”���• ��˛ ��� �‚‚�,�»�¿º‰ �”æ• ‡“��£��,�¯'–�‚ƒ���� • �”fl�� ��…���ł·�.�‰ –�·ˇ ‡� �»�»��”�Full�Color���¿º»�»��»�
¿�…–�ß�‚• �»�¿º ‚�����„�� �»�»�� �¿º¿¡�·º ��»� ��”�”»��¡����¶�� � �»�»� �¿º�»� ��� �·�.���»��”�CD-Rom ¿¡�…�• � ��¥�� ‚�»�¿º�»�
¿ł ¢�‚• � �·�
Visual Question Answering
• Learning joint representation of multiple modalities
• Linguistic and visual information
Credit: visualqa.org
VQA 2.0 Dataset
• 204K MS COCO images
• 760K questions (10 answers for each question from unique AMT workers)
• train, val, test-dev (remote validation), test-standard (publications)
• and test-challenge (competitions), test-reserve (check overfitting)
• VQA Score is the avg. of 10 choose 9 accuracies →
Goyal et al., 2017
Images Questions AnswersTrain 80K 443K 4.4M
Val 40K 214K 2.1MTest 80K 447K unknown
# annotations VQA Score
0 0.0
1 0.3
2 0.6
3 0.9
>3 1.0
https://github.com/GT-Vision-Lab/VQA/issues/1 https://github.com/hengyuan-hu/bottom-up-attention-vqa/pull/18
Objective
• Introducing bilinear attention
- Interactions between words and visual concepts are meaningful
- Proposing an efficient method (with the same time complexity) on top of low-rank bilinear pooling
• Residual learning of attention
- Residual learning with attention mechanism for incremental inference
• Integration of counting module (Zhang et al., 2018)
Preliminary
• Question embedding (fine-tuning)
- Use the all outputs of GRU (every time steps)
- X ∈ RN×ρ where N is hidden dim., and ρ=|{xi}| is # of tokens
• Image embedding (fixed bottom-up-attention)
- Select 10-100 detected objects (rectangles) using pre-trained Faster RCNN, to extract rich features for each object (1600 classes, 400 attributes)
- Y ∈ RM×φ where M is feature dim., and φ=|{yj}| is # of objects
https://github.com/peteanderson80/bottom-up-attention
Low-rank Bilinear Pooling
• Bilinear model and its approximation (Wolf et al., 2007, Pirsiavash et al., 2009)
• Low-rank bilinear pooling (Kim et al., 2017)For vector output, instead of using three-dimensional tensors U and V, replace the vector of ones with a pooling matrix P (use three two-dimensional tensors).
Overview
ρ
K
= ρ
φ
Step 1. Bilinear Attention Maps
1K ρ
φ
φρ
1
K
=
Q
p
K
φ
V
Q’ V’ K
11
1K ρ
φ
φρ1
K
=Q” V” K
11
Residual Learning
Step 2. Bilinear Attention Networks
MLP classifier
1
GRUObject DetectionAll hidden
states
• After getting bilinear attention maps, we can stack multiple BANs.
What is the mustache made of ?
att_1
att_2∙ +
Figure 1: Overview of a two-layer BAN. Two multi-channel inputs, �-object detection features and⇢-length GRU hidden vectors, are used to get bilinear attention maps and joint representations to beused by a classifier. For the definition of the BAN, see the text in Section 3.
previous works [6, 15] where multiple attention maps are used by concatenating the attended features.Since the proposed residual learning method for BAN exploits residual summations instead of con-catenation, which leads to parameter-efficiently and performance-effectively learn up to eight-glimpseBAN. For the overview of two-glimpse BAN, please refer to Figure 1.
Our main contributions are:• We propose the bilinear attention networks (BAN) to learn and use bilinear attention distributions,
on top of low-rank bilinear pooling technique.• We propose a variant of multimodal residual networks (MRN) to efficiently utilize the multiple
bilinear attention maps generated by our model. Unlike previous works, our method successfullyutilizes up to 8 attention maps.
• Finally, we validate our proposed method on a large and highly-competitive dataset, VQA2.0 [8]. Our model achieves a new state-of-the-art maintaining simplicity of model structure.Moreover, we evaluate the visual grounding of bilinear attention map on Flickr30k Entities [23]outperforming previous methods, along with 25.37% improvement of inference speed takingadvantage of the processing of multi-channel inputs.
2 Low-rank bilinear pooling
We first review the low-rank bilinear pooling and its application to attention networks [15], whichuses single-channel input (question vector) to combine the other multi-channel input (image features)as single-channel intermediate representation (attended feature).
Low-rank bilinear model. Pirsiavash et al. [22] proposed a low-rank bilinear model to reduce therank of bilinear weight matrix Wi to give regularity. For this, Wi is replaced with the multiplicationof two smaller matrices UiVT
i , where Ui 2 RN⇥d and Vi 2 R
M⇥d. As a result, this replacementmakes the rank of Wi to be at most d min(N,M). For the scalar output fi (bias terms are omittedwithout loss of generality):
fi = xTWiy = xTUiVTi y = 1T (UT
i x �VTi y) (1)
where 1 2 Rd is a vector of ones and � denotes Hadamard product (element-wise multiplication).
Low-rank bilinear pooling. For a vector output f , a pooling matrix P is introduced:
f = PT (UTx �VTy) (2)
where P 2 Rd⇥c, U 2 RN⇥d, and V 2 RM⇥d. It allows U and V to be two-dimensional tensorsby introducing P for a vector output f 2 Rc, significantly reducing the number of parameters.
Unitary attention networks. Attention provides an efficient mechanism to reduce input channelby selectively utilizing given information. Assuming that a multi-channel input Y consisting of� = |{yi}| column vectors, we want to get single channel y from Y using the weights {↵i}:
y =X
i
↵iyi (3)
2
Overview
ρ
K
= ρ
φ
Step 1. Bilinear Attention Maps
1K ρ
φ
φρ
1
K
=
Q
p
K
φ
V
Q’ V’ K
11
1K ρ
φ
φρ1
K
=Q” V” K
11
Residual Learning
Step 2. Bilinear Attention Networks
MLP classifier
1
GRUObject DetectionAll hidden
states
• After getting bilinear attention maps, we can stack multiple BANs.
What is the mustache made of ?
att_1
att_2∙ +
Figure 1: Overview of a two-layer BAN. Two multi-channel inputs, �-object detection features and⇢-length GRU hidden vectors, are used to get bilinear attention maps and joint representations to beused by a classifier. For the definition of the BAN, see the text in Section 3.
previous works [6, 15] where multiple attention maps are used by concatenating the attended features.Since the proposed residual learning method for BAN exploits residual summations instead of con-catenation, which leads to parameter-efficiently and performance-effectively learn up to eight-glimpseBAN. For the overview of two-glimpse BAN, please refer to Figure 1.
Our main contributions are:• We propose the bilinear attention networks (BAN) to learn and use bilinear attention distributions,
on top of low-rank bilinear pooling technique.• We propose a variant of multimodal residual networks (MRN) to efficiently utilize the multiple
bilinear attention maps generated by our model. Unlike previous works, our method successfullyutilizes up to 8 attention maps.
• Finally, we validate our proposed method on a large and highly-competitive dataset, VQA2.0 [8]. Our model achieves a new state-of-the-art maintaining simplicity of model structure.Moreover, we evaluate the visual grounding of bilinear attention map on Flickr30k Entities [23]outperforming previous methods, along with 25.37% improvement of inference speed takingadvantage of the processing of multi-channel inputs.
2 Low-rank bilinear pooling
We first review the low-rank bilinear pooling and its application to attention networks [15], whichuses single-channel input (question vector) to combine the other multi-channel input (image features)as single-channel intermediate representation (attended feature).
Low-rank bilinear model. Pirsiavash et al. [22] proposed a low-rank bilinear model to reduce therank of bilinear weight matrix Wi to give regularity. For this, Wi is replaced with the multiplicationof two smaller matrices UiVT
i , where Ui 2 RN⇥d and Vi 2 R
M⇥d. As a result, this replacementmakes the rank of Wi to be at most d min(N,M). For the scalar output fi (bias terms are omittedwithout loss of generality):
fi = xTWiy = xTUiVTi y = 1T (UT
i x �VTi y) (1)
where 1 2 Rd is a vector of ones and � denotes Hadamard product (element-wise multiplication).
Low-rank bilinear pooling. For a vector output f , a pooling matrix P is introduced:
f = PT (UTx �VTy) (2)
where P 2 Rd⇥c, U 2 RN⇥d, and V 2 RM⇥d. It allows U and V to be two-dimensional tensorsby introducing P for a vector output f 2 Rc, significantly reducing the number of parameters.
Unitary attention networks. Attention provides an efficient mechanism to reduce input channelby selectively utilizing given information. Assuming that a multi-channel input Y consisting of� = |{yi}| column vectors, we want to get single channel y from Y using the weights {↵i}:
y =X
i
↵iyi (3)
2
≈
Unitary Attention
• This pooling is used to get attention weights with a question embedding (single-channel) vector and visual feature vectors (multi-channel) as the two inputs.
• We call it unitary attention since a question embedding vector queried the feature vectors, unidirectionally.
A
TanhConv
TanhLinear
Replicate
Q V
SoftmaxConv
TanhLinear
TanhLinearLinear
Softmax
Kim et al., 2017
Bilinear Attention Maps
• U and V are linear embeddings
• p is a learnable projection vector
110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
Bilinear Attention Networks
reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:
f0k = (XT
U0)TkA(YT
V0)k (10)
where (XTU
0)k 2 R|{xi}|, (YTV
0)k 2 R|{yj}|, and f0k de-
notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:
f0k =
|{xi}|X
i=1
|{yj}|X
j=1
Ai,j(XTi U
0k)(V
0Tk Yj) (11)
=
|{xi}|X
i=1
|{yj}|X
j=1
Ai,jXTi (U
0kV
0Tk )Yj (12)
where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U
0k and V
0k denotes the k-th column of U0 and V
0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V
0Tk Yj in Equation 11 is
transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X
Ti (U
0kV
0Tk )Yj of Equation 12
(eventually K-rank bilinear pooling for f 0 2 RK).
Then the joint representation f 2 RC is defined as:
f = P0Tf0 (13)
where P02 RK⇥C . For the convenience, we define the bi-
linear attention networks as a function of two multi-channelinputs as follows:
f = BAN(X,Y) (14)
since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.
3.1. Bilinear Attention Map
Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:
A := softmax⇣�
(1 · pT ) �XT
U�V
TY
⌘(15)
where 1 2 R|{xi}|, p 2 Rd, and remind that A 2
R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j
before the softmax function:
Ai,j = pT�(UT
Xi) � (VTYj)
�. (16)
The multiple bilinear attention maps can be extended asfollows:
Ag := softmax⇣�
(1 · pTg ) �X
TU�V
TY
⌘(17)
where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.
3.2. Residual Learning of Attention
Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:
fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)
where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.
3.3. Time Complexity
When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �
|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.
In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.
4. Related Works
4.1. Attention for Multimodal Learning Task or VQA
Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation
ρ×K ρ×N
element-wise multiplicationK×M
N×K
M×φ
ρ × K K × φ
ρ × φ
ρ
K
K
φ
= ρ
φ
Bilinear Attention Maps
• Exactly the same approach with low-rank bilinear pooling
110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
Bilinear Attention Networks
reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:
f0k = (XT
U0)TkA(YT
V0)k (10)
where (XTU
0)k 2 R|{xi}|, (YTV
0)k 2 R|{yj}|, and f0k de-
notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:
f0k =
|{xi}|X
i=1
|{yj}|X
j=1
Ai,j(XTi U
0k)(V
0Tk Yj) (11)
=
|{xi}|X
i=1
|{yj}|X
j=1
Ai,jXTi (U
0kV
0Tk )Yj (12)
where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U
0k and V
0k denotes the k-th column of U0 and V
0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V
0Tk Yj in Equation 11 is
transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X
Ti (U
0kV
0Tk )Yj of Equation 12
(eventually K-rank bilinear pooling for f 0 2 RK).
Then the joint representation f 2 RC is defined as:
f = P0Tf0 (13)
where P02 RK⇥C . For the convenience, we define the bi-
linear attention networks as a function of two multi-channelinputs as follows:
f = BAN(X,Y) (14)
since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.
3.1. Bilinear Attention Map
Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:
A := softmax⇣�
(1 · pT ) �XT
U�V
TY
⌘(15)
where 1 2 R|{xi}|, p 2 Rd, and remind that A 2
R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j
before the softmax function:
Ai,j = pT�(UT
Xi) � (VTYj)
�. (16)
The multiple bilinear attention maps can be extended asfollows:
Ag := softmax⇣�
(1 · pTg ) �X
TU�V
TY
⌘(17)
where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.
3.2. Residual Learning of Attention
Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:
fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)
where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.
3.3. Time Complexity
When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �
|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.
In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.
4. Related Works
4.1. Attention for Multimodal Learning Task or VQA
Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation
element-wise multiplication
110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
Bilinear Attention Networks
reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:
f0k = (XT
U0)TkA(YT
V0)k (10)
where (XTU
0)k 2 R|{xi}|, (YTV
0)k 2 R|{yj}|, and f0k de-
notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:
f0k =
|{xi}|X
i=1
|{yj}|X
j=1
Ai,j(XTi U
0k)(V
0Tk Yj) (11)
=
|{xi}|X
i=1
|{yj}|X
j=1
Ai,jXTi (U
0kV
0Tk )Yj (12)
where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U
0k and V
0k denotes the k-th column of U0 and V
0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V
0Tk Yj in Equation 11 is
transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X
Ti (U
0kV
0Tk )Yj of Equation 12
(eventually K-rank bilinear pooling for f 0 2 RK).
Then the joint representation f 2 RC is defined as:
f = P0Tf0 (13)
where P02 RK⇥C . For the convenience, we define the bi-
linear attention networks as a function of two multi-channelinputs as follows:
f = BAN(X,Y) (14)
since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.
3.1. Bilinear Attention Map
Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:
A := softmax⇣�
(1 · pT ) �XT
U�V
TY
⌘(15)
where 1 2 R|{xi}|, p 2 Rd, and remind that A 2
R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j
before the softmax function:
Ai,j = pT�(UT
Xi) � (VTYj)
�. (16)
The multiple bilinear attention maps can be extended asfollows:
Ag := softmax⇣�
(1 · pTg ) �X
TU�V
TY
⌘(17)
where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.
3.2. Residual Learning of Attention
Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:
fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)
where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.
3.3. Time Complexity
When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �
|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.
In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.
4. Related Works
4.1. Attention for Multimodal Learning Task or VQA
Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation
ρ
K
K
φ
= ρ
φ
Bilinear Attention Maps
• Multiple bilinear attention maps are acquired by different projection vectors pg as:110
111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
Bilinear Attention Networks
reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:
f0k = (XT
U0)TkA(YT
V0)k (10)
where (XTU
0)k 2 R|{xi}|, (YTV
0)k 2 R|{yj}|, and f0k de-
notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:
f0k =
|{xi}|X
i=1
|{yj}|X
j=1
Ai,j(XTi U
0k)(V
0Tk Yj) (11)
=
|{xi}|X
i=1
|{yj}|X
j=1
Ai,jXTi (U
0kV
0Tk )Yj (12)
where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U
0k and V
0k denotes the k-th column of U0 and V
0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V
0Tk Yj in Equation 11 is
transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X
Ti (U
0kV
0Tk )Yj of Equation 12
(eventually K-rank bilinear pooling for f 0 2 RK).
Then the joint representation f 2 RC is defined as:
f = P0Tf0 (13)
where P02 RK⇥C . For the convenience, we define the bi-
linear attention networks as a function of two multi-channelinputs as follows:
f = BAN(X,Y) (14)
since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.
3.1. Bilinear Attention Map
Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:
A := softmax⇣�
(1 · pT ) �XT
U�V
TY
⌘(15)
where 1 2 R|{xi}|, p 2 Rd, and remind that A 2
R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j
before the softmax function:
Ai,j = pT�(UT
Xi) � (VTYj)
�. (16)
The multiple bilinear attention maps can be extended asfollows:
Ag := softmax⇣�
(1 · pTg ) �X
TU�V
TY
⌘(17)
where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.
3.2. Residual Learning of Attention
Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:
fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)
where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.
3.3. Time Complexity
When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �
|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.
In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.
4. Related Works
4.1. Attention for Multimodal Learning Task or VQA
Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation
ρ
K
K
φ
= ρ
φ
Bilinear Attention Networks
• Each multimodal joint feature is filled with following equation (k is the index of K; broadcasting in PyTorch let you avoid for-loop for this):110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
Bilinear Attention Networks
reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:
f0k = (XT
U0)TkA(YT
V0)k (10)
where (XTU
0)k 2 R|{xi}|, (YTV
0)k 2 R|{yj}|, and f0k de-
notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:
f0k =
|{xi}|X
i=1
|{yj}|X
j=1
Ai,j(XTi U
0k)(V
0Tk Yj) (11)
=
|{xi}|X
i=1
|{yj}|X
j=1
Ai,jXTi (U
0kV
0Tk )Yj (12)
where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U
0k and V
0k denotes the k-th column of U0 and V
0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V
0Tk Yj in Equation 11 is
transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X
Ti (U
0kV
0Tk )Yj of Equation 12
(eventually K-rank bilinear pooling for f 0 2 RK).
Then the joint representation f 2 RC is defined as:
f = P0Tf0 (13)
where P02 RK⇥C . For the convenience, we define the bi-
linear attention networks as a function of two multi-channelinputs as follows:
f = BAN(X,Y) (14)
since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.
3.1. Bilinear Attention Map
Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:
A := softmax⇣�
(1 · pT ) �XT
U�V
TY
⌘(15)
where 1 2 R|{xi}|, p 2 Rd, and remind that A 2
R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j
before the softmax function:
Ai,j = pT�(UT
Xi) � (VTYj)
�. (16)
The multiple bilinear attention maps can be extended asfollows:
Ag := softmax⇣�
(1 · pTg ) �X
TU�V
TY
⌘(17)
where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.
3.2. Residual Learning of Attention
Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:
fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)
where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.
3.3. Time Complexity
When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �
|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.
In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.
4. Related Works
4.1. Attention for Multimodal Learning Task or VQA
Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation
1 × ρ φ × 1
ρ×φ
※ broadcasting: automatically repeat tensor operations in api-level supported by Numpy, Tensorflow, Pytorch
1k-th
K
ρ
φ
φρ
1
k-th K
= k-th
K
Bilinear Attention Networks
• One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
Bilinear Attention Networks
reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:
f0k = (XT
U0)TkA(YT
V0)k (10)
where (XTU
0)k 2 R|{xi}|, (YTV
0)k 2 R|{yj}|, and f0k de-
notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:
f0k =
|{xi}|X
i=1
|{yj}|X
j=1
Ai,j(XTi U
0k)(V
0Tk Yj) (11)
=
|{xi}|X
i=1
|{yj}|X
j=1
Ai,jXTi (U
0kV
0Tk )Yj (12)
where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U
0k and V
0k denotes the k-th column of U0 and V
0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V
0Tk Yj in Equation 11 is
transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X
Ti (U
0kV
0Tk )Yj of Equation 12
(eventually K-rank bilinear pooling for f 0 2 RK).
Then the joint representation f 2 RC is defined as:
f = P0Tf0 (13)
where P02 RK⇥C . For the convenience, we define the bi-
linear attention networks as a function of two multi-channelinputs as follows:
f = BAN(X,Y) (14)
since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.
3.1. Bilinear Attention Map
Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:
A := softmax⇣�
(1 · pT ) �XT
U�V
TY
⌘(15)
where 1 2 R|{xi}|, p 2 Rd, and remind that A 2
R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j
before the softmax function:
Ai,j = pT�(UT
Xi) � (VTYj)
�. (16)
The multiple bilinear attention maps can be extended asfollows:
Ag := softmax⇣�
(1 · pTg ) �X
TU�V
TY
⌘(17)
where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.
3.2. Residual Learning of Attention
Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:
fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)
where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.
3.3. Time Complexity
When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �
|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.
In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.
4. Related Works
4.1. Attention for Multimodal Learning Task or VQA
Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation
110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
Bilinear Attention Networks
reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:
f0k = (XT
U0)TkA(YT
V0)k (10)
where (XTU
0)k 2 R|{xi}|, (YTV
0)k 2 R|{yj}|, and f0k de-
notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:
f0k =
|{xi}|X
i=1
|{yj}|X
j=1
Ai,j(XTi U
0k)(V
0Tk Yj) (11)
=
|{xi}|X
i=1
|{yj}|X
j=1
Ai,jXTi (U
0kV
0Tk )Yj (12)
where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U
0k and V
0k denotes the k-th column of U0 and V
0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V
0Tk Yj in Equation 11 is
transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X
Ti (U
0kV
0Tk )Yj of Equation 12
(eventually K-rank bilinear pooling for f 0 2 RK).
Then the joint representation f 2 RC is defined as:
f = P0Tf0 (13)
where P02 RK⇥C . For the convenience, we define the bi-
linear attention networks as a function of two multi-channelinputs as follows:
f = BAN(X,Y) (14)
since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.
3.1. Bilinear Attention Map
Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:
A := softmax⇣�
(1 · pT ) �XT
U�V
TY
⌘(15)
where 1 2 R|{xi}|, p 2 Rd, and remind that A 2
R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j
before the softmax function:
Ai,j = pT�(UT
Xi) � (VTYj)
�. (16)
The multiple bilinear attention maps can be extended asfollows:
Ag := softmax⇣�
(1 · pTg ) �X
TU�V
TY
⌘(17)
where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.
3.2. Residual Learning of Attention
Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:
fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)
where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.
3.3. Time Complexity
When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �
|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.
In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.
4. Related Works
4.1. Attention for Multimodal Learning Task or VQA
Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation
low-rank bilinear pooling
Bilinear Attention Networks
• One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation
• Low-rank bilinear feature learning inside bilinear attention
110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
Bilinear Attention Networks
reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:
f0k = (XT
U0)TkA(YT
V0)k (10)
where (XTU
0)k 2 R|{xi}|, (YTV
0)k 2 R|{yj}|, and f0k de-
notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:
f0k =
|{xi}|X
i=1
|{yj}|X
j=1
Ai,j(XTi U
0k)(V
0Tk Yj) (11)
=
|{xi}|X
i=1
|{yj}|X
j=1
Ai,jXTi (U
0kV
0Tk )Yj (12)
where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U
0k and V
0k denotes the k-th column of U0 and V
0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V
0Tk Yj in Equation 11 is
transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X
Ti (U
0kV
0Tk )Yj of Equation 12
(eventually K-rank bilinear pooling for f 0 2 RK).
Then the joint representation f 2 RC is defined as:
f = P0Tf0 (13)
where P02 RK⇥C . For the convenience, we define the bi-
linear attention networks as a function of two multi-channelinputs as follows:
f = BAN(X,Y) (14)
since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.
3.1. Bilinear Attention Map
Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:
A := softmax⇣�
(1 · pT ) �XT
U�V
TY
⌘(15)
where 1 2 R|{xi}|, p 2 Rd, and remind that A 2
R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j
before the softmax function:
Ai,j = pT�(UT
Xi) � (VTYj)
�. (16)
The multiple bilinear attention maps can be extended asfollows:
Ag := softmax⇣�
(1 · pTg ) �X
TU�V
TY
⌘(17)
where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.
3.2. Residual Learning of Attention
Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:
fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)
where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.
3.3. Time Complexity
When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �
|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.
In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.
4. Related Works
4.1. Attention for Multimodal Learning Task or VQA
Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation
low-rank bilinear pooling
Bilinear Attention Networks
• One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation
• Low-rank bilinear feature learning inside bilinear attention
• Similarly to MLB (Kim et al., ICLR 2017), activation functions can be applied
Bilinear Attention Networks
Linear
X
ReLU
Y
Softmax
LinearReLU
LinearReLU
LinearReLU
Linear
Classifier
Linear
transpose
transpose
Time Complexity
• Assuming that M ≥ N > K > φ ≥ ρ, the time complexity of bilinear attention networks is O(KMφ) where K denotes hidden size, since BAN consists of matrix chain multiplication
• Empirically, BAN takes 284s/epoch while unitary attention control takes 190s/epoch
• Largely due to the increment of input size for Softmax function, φ to φ × ρ
Residual Learning of Attention
• Residual learning exploits multiple attention maps (f0 is X and {ɑi} is fixed to ones):
110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
Bilinear Attention Networks
reduce both input channel simultaneously, we introduceattention map A 2 R|{xi}|⇥|{yj}| as follows:
f0k = (XT
U0)TkA(YT
V0)k (10)
where (XTU
0)k 2 R|{xi}|, (YTV
0)k 2 R|{yj}|, and f0k de-
notes the k-th element of intermediate representation. Thesubscript k for matrices indicates the index of column. No-tice that Equation 10 is similar to Equation 1 since this isthe bilinear model for the two groups of input channels withA as its weight matrix. By the way, Equation 10 can berewritten as:
f0k =
|{xi}|X
i=1
|{yj}|X
j=1
Ai,j(XTi U
0k)(V
0Tk Yj) (11)
=
|{xi}|X
i=1
|{yj}|X
j=1
Ai,jXTi (U
0kV
0Tk )Yj (12)
where Xi and Yj denotes the i-th channel (column) of inputX and the j-th channel (channel) of input Y, respectively,U
0k and V
0k denotes the k-th column of U0 and V
0 matrices,respectively, and Ai,j denotes an element in the i-th rowand the j-th column of A. The V
0Tk Yj in Equation 11 is
transposed because it is a scalar. Notice that, for each pairof channels, the 1-rank bilinear representation of two fea-ture vectors is modeled in X
Ti (U
0kV
0Tk )Yj of Equation 12
(eventually K-rank bilinear pooling for f 0 2 RK).
Then the joint representation f 2 RC is defined as:
f = P0Tf0 (13)
where P02 RK⇥C . For the convenience, we define the bi-
linear attention networks as a function of two multi-channelinputs as follows:
f = BAN(X,Y) (14)
since the bilinear attention map is also the output of a func-tion of the two multi-channel inputs as we will see in thefollowing section.
3.1. Bilinear Attention Map
Now, we want to get the attention map similarly to Equa-tion 7. Using Hadamard product and matrix-matrix multi-plication, the attention map A is defined as:
A := softmax⇣�
(1 · pT ) �XT
U�V
TY
⌘(15)
where 1 2 R|{xi}|, p 2 Rd, and remind that A 2
R|{xi}|⇥|{yj}|. The softmax function is applied to the ma-trix A, element-wisely. In an aspect for each logit Ai,j
before the softmax function:
Ai,j = pT�(UT
Xi) � (VTYj)
�. (16)
The multiple bilinear attention maps can be extended asfollows:
Ag := softmax⇣�
(1 · pTg ) �X
TU�V
TY
⌘(17)
where U and V are sharable across the attention maps. Thesubscript g denotes the index of glimpses.
3.2. Residual Learning of Attention
Inspired by multimodal residual networks (MRN) from Kimet al. (2016), we propose a variation of MRN to integratethe joint representations from the multiple bilinear attentionmaps. The i+ 1-th output is defined as:
fi+1 = BANi(fi,Y;Ai) · 1T + ↵ifi (18)
where f0 = X, 1 2 R|{xi}|, and ↵i is a scalar parameterto learn, but initialized with one. In this way, the size of fikeeps to match with the size of X as successive attentionmaps are processed while being updated. To get the logitsfor a classifier, we sum over the channel dimension of fG,where G is the number of glimpses.
3.3. Time Complexity
When we assume that the number of input channels issmaller than feature sizes, M � N > K > |{yj}| �
|{xi}|, the time complexity of the bilinear attention net-works is the same with the case of one multi-channel inputas O(KM |{yj}|) for single glimpse model. Since the bilin-ear attention networks consist of matrix chain multiplicationand exploit the property of low-rank factorization in thelow-rank bilinear pooling.
In our experiment using NVIDIA Titan Xp, the spendingtime per epoch of one-glimpse BAN is approximately 284seconds, while 190 seconds for the control (Teney et al.,2017). The increased spending time is largely due to theincreased size of softmax input for attention distributionfrom |{yj}| to |{xi}| ⇥ |{yj}|. However, the validationscore is significantly improved (+1.28%). The experimentresults are shown in Table 1 as Bottom-Up (Ours) for thecontrol and BAN-1 for our model.
4. Related Works
4.1. Attention for Multimodal Learning Task or VQA
Xu & Saenko (2016) proposed the spatial memory networkmodel, where attention map is produced by estimating thecorrelation of every image patches with individual wordsin the question. More specifically, the attention operation
shortcut
bilinear attention map
bilinear attention networksrepeat {# of tokens} times
att_1
Overview
ρ
K
= ρ
φ
Step 1. Bilinear Attention Maps
1K ρ
φ
φρ
1
K
=
XTU
p
K
φ
VTY
U’TX YTV’ K=N
ρ
φ
1
K
1Kρ
U’’TX’ YTV’’
Residual Learning
Step 2. Bilinear Attention Networks
MLP classifier
GRUObject DetectionAll hidden
states
• After getting bilinear attention maps, we can stack multiple BANs.
What is the mustache made of ?
Att_1
Att_2∙ att_2
X
repeat 1→ρ
+
=φ K
X’
repeat 1→ρ
+Sum
Pooling
X
Y
Softmax
X’Residual Learning
Multiple Attention Maps
ValidationVQA 2.0 Score +%
Bottom-Up (Teney et al., 2017) 63.37
BAN-1 65.36 +1.99
BAN-2 65.61 +0.25
BAN-4 65.81 +0.2
BAN-8 66.00 +0.19
BAN-12 66.04 +0.04
• Single model on validation score
Residual Learning
ValidationVQA 2.0 Score +/-
BAN-4 (Residual) 65.81 ±0.09
BAN-4 (Sum) 64.78 ±0.08 -1.03
BAN-4 (Concat) 64.71 ±0.21 -0.07
BANi (X,Y;Ai )i∑
i! BANi (X,Y;Ai )
Comparison with Co-attention
ValidationVQA 2.0 Score +/-
Bilinear Attention 65.36 ±0.14
Co-Attention 64.79 ±0.06 -0.57
Attention 64.59 ±0.04 -0.20
※ The number of parameters is controlled (all comparison models have 32M parameters).
BAN-1
Comparison with Co-attention
※ The number of parameters is controlled (all comparison models have 32M parameters).
Training curves
Validation curves
Entro
py
3.0
3.3
3.6
3.9
4.2
4.5
4.8
Epoch0 2 4 6 8 10 12 14 16 18
Att_1Att_2Att_3Att_4
Valid
atio
n Sc
ore
35
40
45
50
55
60
65
70
The number of used glimpses0 1 2 3 4 5 6 7 8
2 Glimpses4 Glimpses8 Glimpses
Entro
py
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Epoch1 3 5 7 9 11 1315 1719
Att_1 Att_2Att_3 Att_4
Valid
atio
n sc
ore
35
40
45
50
55
60
65
70
Used glimpses0 2 4 6 8 10 12
BAN-1BAN-2BAN-4BAN-8BAN-12
Valid
atio
n sc
ore
35
45
55
65
75
85
Epoch1 3 5 7 9 11 13 15 17 19
bi-att trainco-att trainuni-att trainbi-att valco-att valuni-att val
(a) (b) (c)
Valid
atio
n sc
ore
58.0
59.5
61.0
62.5
64.0
65.5
67.0
0M 15M 30M 45M 60M
Uni-AttCo-AttBAN-1BAN-4BAN-1+MFB
Entro
py
0.0
1.3
2.7
4.0
5.3
6.7
8.0
Epoch1 4 7 10 13 16 19
Att_1Att_2Att_3Att_4
Valid
atio
n sc
ore
35
40
45
50
55
60
65
70
Used glimpses0 2 4 6 8 10 12
BAN-1BAN-2BAN-4BAN-8BAN-12
Valid
atio
n sc
ore
30
41
52
63
74
85
Epoch1 4 7 10 13 16 19
Bi-Att trainCo-Att trainUni-Att trainBi-Att valCo-Att valUni-Att val
(a) (c) (d)
The number of parameters
(b)
�1
Entro
py
3.0
3.3
3.6
3.9
4.2
4.5
4.8
Epoch0 2 4 6 8 10 12 14 16 18
Att_1Att_2Att_3Att_4
Valid
atio
n Sc
ore
35
40
45
50
55
60
65
70
The number of used glimpses0 1 2 3 4 5 6 7 8
2 Glimpses4 Glimpses8 Glimpses
Entro
py
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Epoch1 3 5 7 9 11 1315 1719
Att_1 Att_2Att_3 Att_4
Valid
atio
n sc
ore
35
40
45
50
55
60
65
70
Used glimpses0 2 4 6 8 10 12
BAN-1BAN-2BAN-4BAN-8BAN-12
Valid
atio
n sc
ore
35
45
55
65
75
85
Epoch1 3 5 7 9 11 13 15 17 19
bi-att trainco-att trainuni-att trainbi-att valco-att valuni-att val
(a) (b) (c)
Valid
atio
n sc
ore
58.0
59.5
61.0
62.5
64.0
65.5
67.0
0M 15M 30M 45M 60M
Uni-AttCo-AttBAN-1BAN-4BAN-1+MFB
Entro
py
0.0
1.3
2.7
4.0
5.3
6.7
8.0
Epoch1 4 7 10 13 16 19
Att_1Att_2Att_3Att_4
Valid
atio
n sc
ore
35
40
45
50
55
60
65
70
Used glimpses0 2 4 6 8 10 12
BAN-1BAN-2BAN-4BAN-8BAN-12
Valid
atio
n sc
ore
30
41
52
63
74
85
Epoch1 4 7 10 13 16 19
Bi-Att trainCo-Att trainUni-Att trainBi-Att valCo-Att valUni-Att val
(a) (c) (d)
The number of parameters
(b)
�1
Visualization
Integration of Counting module with BAN
• Counter (Zhang et al., 2018) gets a multinoulli distribution and detected box info
• Our method — maxout from bilinear attention distribution to unitary attention distribution, easy to adapt multitask learning
ρ
ɸ
maxout
ɸ1
LogitsSoftmax Bilinear Attention
Maxout Sigmoid Counting Module
linear embedding of the i-th output of counter module
residual learning of attention
inference steps for multi-glimpse attention
Bilinear Attention Networks — Appendix
A Variants of BAN
A.1 Enhancing glove word embedding
We augment a computed 300-dimensional word embedding to each 300-dimensional Glove wordembedding. The computation is as follows: 1) we choose arbitrary two words wi and wj from eachquestion can be found in VQA and Visual Genome datasets or each caption in MS COCO dataset.2) we increase the value of Ai,j by one where A 2 V 0
⇥ V 0 is an association matrix initializedwith zeros. Notice that i and j can be the index out of vocabulary V and the size of vocabulary inthis computation is denoted by V 0. 3) to penalize highly frequent words, each row of A is dividedby the number of sentences (question or caption) which contain the corresponding word . 4) eachrow is normalized by the sum of all elements of each row. 5) we calculate W0 = A · W whereW 2 V 0
⇥ E is a Glove word embedding matrix and E is the size of word embedding, i.e., 300.Therefore, W0
2 V 0⇥ E stands for the mixed word embeddings of semantically closed words. 6)
finally, we select V rows from W0 corresponding to the vocabulary in our model and augment theserows to the previous word embeddings, which makes 600-dimensional word embeddings in total. Theinput size of GRU is increased to 600 to match with these word embeddings. These word embeddingsare fine-tuned.
As a result, this variant significantly improves the performance to 66.03 (±0.12) compared with theperformance of 65.72 (± 0.11) which is done by augmenting the same 300-dimensional Glove wordembeddings (so the number of parameters is controlled). In this experiment, we use four-glimpseBAN and evaluate on validation split. The standard deviation is calculated by three random initializedmodels and the means are reported. The result on test-dev split can be found in Table 3 as BAN+Glove.
A.2 Integrating counting module
The counting module [41] is proposed to improve the performance related to counting tasks. Thismodule is a neural network component to get a dense representation from spatial information ofdetected objects, i.e., the left-top and right-bottom positions of the � proposed objects (rectangles)denoted by S 2 R4⇥�. The interface of the counting module is defined as:
c = Counter(s, ↵) (12)
where c 2 R�+1 and ↵ 2 R� is the logits of corresponding objects for sigmoid function insidethe counting module. We found that the ↵ defined by maxj=1,...,�(A·,j), i.e., the maximum valuesin each column vector of A in Equation 9, was better than that of summation. Since the countingmodule does not support variable-object inputs, we select 10-top objects for the input instead of �objects based on the values of ↵.
The BAN integrated with the counting module is defined as:
fi+1 =�BANi(fi,Y;Ai) + gi(ci)
�· 1T + fi (13)
where the function g(·) is a linear embedding followed by ReLU activation function. Note that adropout layer before this linear embedding severely hurts performance.
As a result, this variant significantly improves the counting performance from 54.92 (±0.30) to58.21 (±0.49), while overall performance is improved from 65.81 (±0.09) to 66.01 (±0.14) in acontrolled experiment using a vanilla four-glimpse BAN. The definition of a subset of countingquestions comes from the previous work [30]. The result on test-dev split can be found in Table 3 asBAN+Glove+Counter, notice that, which is applied by the previous embedding variant, too.
12
Comparison with State-of-the-arts
Test-devVQA 2.0 Score +%
Prior 25.70
Language-Only 44.22 +18.52%
MCB (ResNet) 61.96 +17.74%
Bottom-Up (FRCNN) 65.32 +3.36%
MFH (ResNet) 65.80 +0.48%MFH (FRCNN) 68.76 +2.96%
BAN (Ours; FRCNN) 69.52 +0.76%BAN-Glove (Ours; FRCNN) 69.66 +0.14%
BAN-Glove-Counter (Ours; FRCNN) 70.04 +0.38%
• Single model on test-dev score
image feature
attention model
2016 winner
2017 winner
2017 runner-up
counting feature
test-dev Numbers
Zhang et al. (2018) 51.62
Ours 54.04
Flickr30k Entities
• Visual grounding task — mapping entity phrases to regions in an image
[/EN#40120/people A girl] in [/EN#40122/clothing a yellow tennis suit] , [/EN#40125/other green visor] and [/EN#40128/clothing white tennis shoes] holding [/EN#40124/other a tennis racket] in a position where she is going to hit [/EN#40121/other the tennis ball] .
0.48<0.5
Flickr30k Entities
• Visual grounding task — mapping entity phrases to regions in an image
[/EN#38656/people A male conductor] wearing [/EN#38657/clothing all black] leading [/EN#38653/people a orchestra] and [/EN#38658/people choir] on [/EN#38659/scene a brown stage] playing and singing [/EN#38664/other a musical number] .
Flickr30k Entities Recall@1,5,10
R@1 R@5 R@10
Zhang et al. (2016) 27.0 49.9 57.7
SCRC (2016) 27.8 - 62.9
DSPE (2016) 43.89 64.46 68.66
GroundeR (2016) 47.81 - -
MCB (2016) 48.69 - -
CCA (2017) 50.89 71.09 75.73
Yeh et al. (2017) 53.97 - -
Hinami & Satoh (2017) 65.21
BAN (ours) 66.69 84.22 86.35
Flickr30k Entities Recall@1,5,10
people clothing bodyparts animals vehicles instruments scene other
#Instances 5,656 2,306 523 518 400 162 1,619 3,374
GroundeR (2016) 61.00 38.12 10.33 62.55 68.75 36.42 58.18 29.08
CCA (2017) 64.73 46.88 17.21 65.83 68.75 37.65 51.39 31.77
Yeh et al. (2017) 68.71 46.83 19.50 70.07 73.75 39.50 60.38 32.45
Hinami & Satoh (2017) 78.17 61.99 35.25 74.41 76.16 56.69 68.07 47.42
BAN (ours) 79.90 74.95 47.23 81.85 76.92 43.00 68.69 51.33
ChallengeRunner-UpsJointRunner-UpTeam1
SNU-BI
ChallengeAccuracy: 71.69
Jin-HwaKim(SeoulNationalUniversity)Jaehyun Jun(SeoulNationalUniversity)
Byoung-Tak Zhang(SeoulNationalUniversity&Surromind Robotics)
28
Conclusions
• Bilinear attention networks gracefully extends unitary attention networks,
• as low-rank bilinear pooling inside bilinear attention.
• Furthermore, residual learning of attention efficiently uses multiple attention maps.2018 VQA Challenge runners-up (shared 2nd place) 1st single model (70.35 on test-standard)
Thank you!Any question?
We would like to thank Kyoung-Woon On, Bohyung Han, and Hyeonwoo Noh for helpful comments and discussion. This work was supported by 2017 Google Ph.D. Fellowship in Machine Learning and Ph.D. Completion Scholarship from
College of Humanities, Seoul National University, and the Korea government (IITP-2017-0-01772-VTT, IITP-R0126-16-1072-SW.StarLab, 2018-0-00622-RMI, KEIT-10044009-HRI.MESSI, KEIT-10060086-RISF).
The part of computing resources used in this study was generously shared by Standigm Inc.
arXiv:1805.07932Code & pretrained model in PyTorch:
github.com/jnhwkim/ban-vqa