journal of la geometry-entangled visual semantic

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Geometry-Entangled Visual Semantic Transformerfor Image Captioning

Ling Cheng, Wei Wei, Feida Zhu, Yong Liu, Member, IEEE, and Chunyan Miao, Senior Member, IEEE

Abstract—Recent advancements of image captioning havefeatured Visual-Semantic Fusion or Geometry-Aid attentionrefinement. However, those fusion-based models, they are stillcriticized for the lack of geometry information for inter andintra attention refinement. On the other side, models basedon Geometry-Aid attention still suffer from the modality gapbetween visual and semantic information. In this paper, we intro-duce a novel Geometry-Entangled Visual Semantic Transformer(GEVST) network to realize the complementary advantages ofVisual-Semantic Fusion and Geometry-Aid attention refinement.Concretely, a Dense-Cap model proposes some dense captionswith corresponding geometry information at first. Then, toempower GEVST with the ability to bridge the modality gapamong visual and semantic information, we build four paralleltransformer encoders VV(Pure Visual), VS(Semantic fused toVisual), SV(Visual fused to Semantic), SS(Pure Semantic) for finalcaption generation. Both visual and semantic geometry featuresare used in the Fusion module and also the Self-Attention modulefor better attention measurement. To validate our model, weconduct extensive experiments on the MS-COCO dataset, theexperimental results show that our GEVST model can obtainpromising performance gains.

Index Terms—Image captioning, Dense captioning, Multi-modal interaction, Content-Geometry fusion, Content-Geometryattention.

I. INTRODUCTION

Recent advances in deep learning have resulted in greatprogress in both the computer vision and natural languageprocessing communities. These achievements make it possibleto connect vision and language, and facilitate multi-modallearning tasks such as image-text matching, visual questionanswering, visual grounding and image captioning. Imagecaptioning aims to automatically describe an image’s contentusing a natural language sentence. The task is challengingsince it requires one to recognize key objects in an image,and to understand their relationships with each other. Inspiredby machine translation, encoder-decoder architecture becamethe most successful approach for image caption task. Encoderencodes the visual(object region or grid visual feature) orsemantic(attribute words) information into feature vectors.Given the input features, decoder iteratively generates theoutput caption words. Based on this architecture, variousimprovements have been applied into encoder, decoder andtraining method to further improve the performance.

Previously, there are three kinds of method to improve themodel. The first one is based on attention mechanism. To es-tablish the fine-grained connections of caption words and their

This work was supported in part by the National Natural Science Foundationof China under Grant 61602197

related image regions, attention can be seamlessly insertedinto the framework. Then, a lot of works focus on improvingattention measurement to enhance the interaction betweenvisual content and natural sentence to boost image captioning.The second one treat image captioning as a multi-modalproblem, by gradually fusing visual and semantic informationor exploiting semantic and visual information simultaneously,they can identify the equivalent visual signals especially whenpredicting highly abstract words. The last one is aim to addressthe exposure bias of generated captions by using the cross-entropy loss, reinforcement learning (RL)-based algorithm [1],[2], [3] are designed to directly optimize the non-differentiableevaluation metrics (e.g., BLEU, METEOR, ROUGE, CIDErand SPICE [4], [5], [6], [7], [8]).

Despite those reinforcement learning based algorithmswhich are compatible to different models, there existing twomain limitations in both attention focused mechanism andmulti-modal focused mechanism: 1) Even with visual sentinelmechanism, typical attention focused mechanisms are stillincapable of inter-modal interactions, thus they are arduous toidentify the equivalent visual signals especially when predict-ing highly abstract words. 2) Multi-modal focused models typ-ically fuse visual branch(object region or grid visual feature)and semantic branch(attribute words) with concatenation orsummation, these method are usually shallow and may fail tofully understand the complex relationships among informationfrom two different modalities. Besides, even attention mecha-nism are used in each branch, since these primitive semanticfeatures have no geometry information as guidance, noise fromdifferent modal will deteriorate the attention measurement.

To address the limitation in attention focused models, weextend the Transformer model for machine translation to aVisual-Semantic Transformer model for image captioning.Different from previous Transformer based captioning models,there are four parallel branches in the encoder (VV:PureVisual, VS:Semantic fused to Visual, SV:Visual fused toSemantic, SS:Pure Semantic). Different branch stands fordifferent ratios of visual information to semantic information.By dynamically adjust each branch’s contribution at differenttime step, model can not only maintain sophisticated attentionmeasurement in each branch, but can also generate words withabstract semantic meaning. To address the limitation in multi-modal focused models, rather than using MIL model to extractattribute words, we use Dense-Caption model to extract densecaptions which describe sub-images and each dense captionhas a bounding box. By properly taking both content andgeometry information into a fusion module, model can notonly encode this complex multi-modal relationship to a deeper

arX

iv:2

109.

1413

7v1

[cs

.CV

] 2

9 Se

p 20

21


degree, but can also generate inter geometry features betweentwo different modals. Then, inter and intra geometry infor-mation will contribute to a better attention measurement ineach self-attention encoder branch. Thus we call the model asGeometry-Entangled Visual-Semantic Transformer(GEVST).

To summarize, the main contributions of this study arethree-fold:

• Inter- and intra-geometry and content relation in bothfusion and self-attention is first proposed in the GEVSTmodel. Our model is not only capable of fusing twomodality information based on both content and geometryinformation, by feeding both inter and intra geometryfeatures into self-attention module, model can also geta better attention measurement which improves imagecaptioning performance.

• To dynamically give a more fine-grained ratio and higherorder interaction between visual and semantic, we per-form a cross-attention with 4 encoding branches, thesebranches’ outputs with multi-VS ratio are then summedtogether after modulation.

• Extensive experiments on the benchmark MSCOCO [9],[10] image captioning dataset are conducted to quanti-tatively and qualitatively prove the effectiveness of theproposed models. The experimental results show that theGEVST significantly outperforms previous state-of-the-art approaches with a single model.

The rest of the paper is organized as follows: In section II,we review the related work of image captioning approaches,especially those attention focused and multi-modal focusedmechanisms. In section III, we introduce our visual-semanticfusion module which take both content and geometry in-formation into consideration. In section IV, we introduceour Geomerty-Entangled self-attention module, and will alsoelaborate the architecture of four parallel branches in ourGEVST model. In section V, we introduce our extensiveexperimental results for evaluation and use the benchmarkMSCOCO image captioning dataset to evaluate our proposedapproaches. Finally, we conclude this work in section VI.

II. RELATED WORK

In this section, we briefly review the most relevant researchon image captioning, especially those attention focused andmulti-modal focused mechanisms.

A. Image Captioning

Previously, the research on image captioning can bemainly divided into three categories: template-based ap-proaches [11], [12], retrieval-based approaches [13], [14], [15],and generation-based approaches [16], [17], [18], [19], [20].The template-based approaches address the task using a two-stage strategy, it first detects the objects, their attributes, andobject relationships in an image, and then fills the extractedinformation in a pre-designed sentence template. Both Con-ditional Random Field (CRF) and Hidden Markov Model(HMM) [21], [22] are used to predict labels based on thedetected objects, attributes, and prepositions, and then generatecaption sentences with a template. Obviously, those captions

generated by the template based approaches are highly dependon the quality of the templates and usually follow the syntacti-cal structures. However, the diversity of the generated captionsis severely restricted, as a result, the generated captions areoften rigid, and lack naturalness. To ease the diversity problem,is to directly retrieve the target sentences for the query imagefrom a large-scale caption database with respect to their cross-modal similarities in a multi-modal embedding space. The coreidea [13] is to propose an approach to match the image-captionpairs based on the alignment of visual segments and captionsegments. For prediction, the cross-modal matching over thewhole caption database is performed to generate the captionfor one image. Further improvements [14], [15] are focusingon different metrics or loss functions to learn the cross-modalmatching model. However, the result captions are fixed, andmay fail to describe the object combinations in a new image.To increase caption’s diversity, caption database has to be aslarge as possible. Whereas, for large database, there is a bigproblem in retrieval efficiency. Thus, the diversity problemhas not been completely resolved. Different from template-based and retrieval-based models, in recent years, With thesuccesses of sequence generation in machine translation [23],[24], generation-based model has emerged as a promisingsolution to image captioning. In this kind of architecture,model tries to build an end-to-end system with encoder-decoder framework, thus, it can generate novel captions withmore flexible syntactical structures. Vinyals et al. proposean encoder(CNN Based)-decoder(LSTM Based) architectureas its backbones. After that, this kind of encoder-decoderarchitecture became the mainstream for generation-based ap-proaches. Similar architectures are also proposed by Donahueet al. [25] and Karpathy et al. [20]. Due to the flexibility andexcellent performance, most of subsequent improvements arebased on the generation-based models.

B. Attention Focused Model

Within the encoder-decoder framework, one of the mostimportant improvements for generation-based models is the at-tention mechanism. Xu et al. [16] proposed soft/hard attentionmechanism to automatically focus on the salient regions whengenerating corresponding words. The attention model is apluggable module that can be seamlessly inserted into previousapproaches to remarkably improve the caption quality. The at-tention model is further improved in [26], [27], [18], [28], [29],[30], [31], [32], [33]. Lu et al. presented an adaptive attentionencoder-decoder model for automatically deciding when torely on visual or language signals. Chen et al. proposed aspatial- and channel-wise attention model to attend to visualfeatures. Anderson et al. introduced a bottom-up module, thatuses a pre-trained object detector to extract region-based imagefeatures, and a top-down module that utilizes soft attention todynamically attend to these object Gu et al. adopted multi-stage learning for coarse-to-fine image captioning by stackingthree layers of LSTMs, where each layer of LSTM wasdifferent from other layers of LSTMs. Zhou et al. designeda novel salience-enhanced re-captioning framework via two-phase learning is developed for the optimization of image


caption generation Huang et al. proposed attention on attentionmodule enhances visual attention by further measuring therelevance between the attention result and the query. Wanget al. proposed a hierarchical attention network to combinetext, grids, and regions with a relation module to exploitthe inherent relationship among diverse features. Herdadeet al. utilized bounding boxes of regions to model locationrelationships between regions in a relative manner. Cornia etal. presented a mesh-like structure transformer to exploit bothlow-level and high-level contributions. Through all of theseworks, it is confirmed that improving attention measurementis an effective way to enhance the interaction between visualcontent and natural sentence and thus boost image captioning.However, typical attention mechanisms are arduous to identifythe equivalent visual signals especially when predicting highlyabstract words. This phenomenon is known as the semanticgap between vision and language.

C. Multi-Modal Focused Model

This problem in those attention focused model can be over-come by providing semantic attributes that are homologousto language [34], [17], [35], [36], [37], [38], [39]. Hao et al.use multiple instance learning to train to detect words(nouns,verbs, and adjectives) that commonly occur in captions. Basedon the these detected words, You et al. leveraged the high-levelsemantic information directly with semantic attention. As itwas justified that the combination of the two complementaryattention paradigms can alleviate the harmful impacts of thesemantic gap. Therefor, Li et al. proposed a two-layeredLSTM that the visual and semantic attentions are separatelyconducted at each layer. Yu et al. designed a multi-modal trans-former model which simultaneously captures intra- and inter-modal interactions in a unified attention block. Huang et al.also introduced two modules to couple attribute detection withimage captioning as well as prompt attributes by predictingappropriate attributes at each time step. Another way to bridgethe modality gap is to employ graph convolutional neuralnetworks(GCN). Yao et al. built a GCN model to explore thespatial and semantic relationships. Then, by late fusion, theycombined two LSTM language models that are independentlytrained on different modalities. Similarly, Xu et al. used amulti-modal graph convolution network to modulate scenegraphs into visual representations. Whereas, the interactionsbetween different modals are usually shallow and may failto fully understand the complex relationships among thesetwo modal information. Besides, even attention mechanismare used in each branch, since their semantic features have nogeometry information as guidance.

D. Caption Related Task

Dense-caption task is also a task aimed at caption genera-tion. Dense-caption performs intensive captions[40], [41], [42]on the image. Beside giving caption which describes a certainsub-image, it can also give the geometry information of thiscaption. And since the output is a complete sentence ratherthan attribute words, it can provide more advanced semantic

and linguistic information. Like image-caption, Visual Ques-tion Answering(VQA) also aim at handling the modal-crossingproblem. In VQA, although attention mechanism helps modelto focus on the visual content relevant to the question, it’sstill insufficient to model complex reasoning features. Ratherthan simple attention mechanism, multi-modal fusion playsthe key roles in better performance. In VQA, there are manyvisual semantic fusion methods[43], [44], [45], [46] are provedto be more powerful than just concatenation or element-wiseaddition.

III. PROPOSED METHOD

In this section, we first briefly describe the preliminaryknowledge of the Transformer model[47], After that, wewill elaborate the data preparation. Then, we introduce theGeometry-Entangled Visual Semantic Transformer(GEVST)framework as shown in the Fig 1, which can be conceptuallydivided into a fusion model, an encoder and a decoder module,both encoder and decoder made of stacks of attentive layers.The fusion model fuses visual and semantic information basedon the content and geometry information. Encoder is in chargeof processing deep representation of the input information ina self-attention manner. The decoder is to generate the outputcaption word by word.

A. The Transformer Model

The Transformer model was first proposed for machinetranslation, and has been successfully applied to many naturallanguage processing tasks. All intra- and inter-modality inter-actions between word and image-level features are modeledvia scaled dot-product attention, without using recurrence.Attention operates on three sets of vectors, namely a set ofqueries Q, keys K and values V , and takes a weighted sumof value vectors according to a similarity distribution betweenquery and key vectors. In the case of scaled dot-productattention, the operator can be formally defined as below,

F = Attention(Q,K, V ) = softmax(QKT

√d

)V, (1)

where F is a matrix of nq attended query vectors, Q is amatrix of nq query vectors, K and V both contain nk keysand values, all with the same dimension, and d is a scalingfactor. Instead of performing a single attention function forthe queries, multi-head attention is to allow the model toattend to diverse information from different representation sub-spaces. The multi-head attention contains h parallel ‘heads’with each head corresponding to an independent scaled dot-product attention function. The attended features F of themulti-head attention functions is given as follows:

F =MultiHead(Q,K, V ) = Concat(H1, · · · , Hh)WO

Hi = Attention(QWQi ,KW

Ki , V WV

i )(2)

where WQi ,W

Ki ,WV

i ∈ Rd× dh , are the independent head

projection matrices. WOi ∈ Rd×d denotes the linear transfor-

mation. Note that the bias terms in linear layers are omitted for


1 . Man on a helmet.2. Man riding horse.3. Clear blue sky.4. Trees are green.

…

SemanticContentFeatures

VisualContentFeatures …

…

SemanticGeometryFeatures

VisualGeometryFeatures …

EncoderLayer 1

DecoderLayer 1

DecoderLayer N

… EncoderLayer N

EncoderLayer 1

… EncoderLayer N

EncoderLayer 1

… EncoderLayer N

EncoderLayer 1

… EncoderLayer N

…

SS-Branch

SV-Branch

VS-Branch

VV-Branch

Fusion

Fusion

Dense-CaptionModel

Object-DetectionModel

SV-RelatedGeometryFeatures

VS-RelatedGeometryFeatures

Fig. 1: Overview of our proposed GEVST approach.Firstly, Visual encoder and semantic encoder will generate objects regionfeature and dense-captions with corresponding geometry information. After a fusion model, we can get the related geometryfeatures between two modalities. Decoder is to generate the output caption based on the output from 4 encoder branches.

the sake of concise expression, and the subsequent descriptionsfollow the same principle.

The Transformer is a deep end-to-end architecture thatstacks attention blocks to form an encoder-decoder strategy.Both the encoder and the decoder consist of N attention block,and each attention block contains the multi-head attentionmodules. The multi-head attention module learns the attendedfeatures that consider the pairwise interactions between twoinput features, In the encoder, each attention block is self-attentional such that the queries, keys and values refer to thesame input features. In the decoder contains a self-attentionlayer and a guided attention layer. It first models the self-attention of given input features and then takes the outputfeatures of the last encoder attention block to guide theattention learning.

B. Data Preparation

As mentioned in Section I, our model is based on boththe visual representation and semantic representation. In orderto model the given image in visual-level, each element ~viof visual feature vector V is defined as the mean-pooledconvolutional feature of the i-th salient region, which is a dv-dimensional fixed-size feature vector (empirically set as 2048).These visual features are extracted by means of the output offinal convolution layer of Faster-RCNN model [18], which iswidely used in previous work [29], [30], [31], [32]. Thus, forthe visual information, we have V = {~vi}Nv

i=1 (~vi ∈ Rdv )to denote the Nv visual feature vectors with dv-dimension.To fetch semantic representation for the given image, weconsider to employ dense-caption model to obtain a list ofdense-captions that are most likely to appear in the image.Same to the work [40], for each image, we get a list of dense-captions, corresponding confidence scores and bounding boxesinformation. Then, each dense-caption will be encoded by astandard 3-layers transformer encoder. Then, the confidence

scores are used to assign a weight to corresponding phrases. Atlast, we get a set of ds-dimensional fixed-size semantic featurevectors, the i-th vector is denoted as (~si). Thus now, we haveS = {~si}Ns

i=1 (~si ∈ Rds) to denote the Ns semantic dense-caption vectors with ds-dimension. The inherent geometricstructure among the input visual objects and semantic objectsis beneficial for reasoning, thus bounding box coordinates forboth visual and semantic features are feed into the model.

C. Visual-Semantic Fusion

Geometry-Content Fusion Cell For a given visual feature,it has different relationship with different dense-captions. Therelation can be divided into two types, the content relation andthe geometry relation. Thus, to fuse features from differentmodalities, we propose a Stack Geometry-Content Fusion.For example, when fusing the semantic information intothe visual information(VS-Branch), both visual and semanticbounding box vectors will be projected into high dimensionrepresentations V geo and Sgeo as shown in the Fig 2, eachfusion cell contains a geometry attention layer and a contentattention layer to encode two kinds of relation simultaneously.To select relevant semantic features for the i-th visual contentfeature V con

i , both content and geometry attention layer willoutput an attention map, then, two maps are added togetherfor generating the weighted sum semantic content feature Si.

aconi,j =W contanh(W cons Scon

j +W conv V con

i ) (3)

ageoi,j =W geotanh(W geos Sgeo

j +W geov V geo

i ) (4)

(αconi,1 , · · · , αcon

i,Ns) = Softmax(aconi,1 , · · · , aconi,Ns

) (5)

(αgeoi,1 , · · · , α

geoi,Ns

) = Softmax(ageoi,1 , · · · , ageoi,Ns

) (6)

Sconi =

Ns∑j=1

(αconi,j + αgeo

i,j )Sconj (7)


SumPooling

…

SemanticContentFeatures

VisualContentFeature

123

do

do

Ns

Er×do

Content

Attention

Er×do

⨂ do

Nv

Expand

Expand

…

SemanticGeometryFeatures

VisualGeometryFeature

123

dg

dg

Ns

Geom

etryAttention

dg

SemanticInput

VisualInput

SemanticInput

…VisualInput

Element-wiseSum

1 M

VS-RelatedGeometryFeatures

VS-FusedFeatures

…SemanticInput

VisualInput

Geometry-ContentFusion

Element-wiseSum

m



Fig. 2: Geometry-Content Fusion Cell(above) and Stack fusion module(below) for V S-Branch, where the semantic informationis fused into visual information. For each visual feature{~vi}Nv

i=1, the attention weight of Geometry-Attention and Content-Attention will be summed to generate the final attend-semantic feature. Besides, the weighted-summation of semantic geometryfeature is used to represent the intra-geometry information for the given visual feature.

Then, both the i-th visual content feature and the attendedsemantic content feature will be expanded to high-dimensionalrepresentations Vi and Si. After element-wise multiplicationand a sum pooling function with stride equals to expansionrate Er, we get the i-th Content-based fused feature.

Vi =WExpv V con

i (8)

Si =WExps Scon

i (9)

Fi = Sum Pooling(Si ⊗ Vi) (10)

where, WExpv ∈ REr∗do×do and WExp

s ∈ REr∗do×do are theexpansion matrices for visual and semantic branches.

Notice that, the geometry attention weight map can pro-vide the geometry relation between two modalities, thus, theweighted sum of semantic geometry features Sgeo

i can be seenas inter-geometry features V inter

i for the given visual featureto describe the geometry relation between two modalities.

V interi = Sgeo

i =

Ns∑j=1

αgeoi,j S

geoj (11)

For clarity and consistency in the next sections, the visualgeometry features V geo

i used previously are denoted as V intrai ,

which stands for the geometry features of the original modal-ity. Thus, for a visual object, it has a content feature V con

i , anintra-geometry feature V intra

i and an inter-geometry featureV interi . And it is similarly to every semantic object.

Stack Geometry-Content Fusion. To strengthen the fusioneffect, a stack structure is proposed as shown in the Fig 2.Still, take V S-Branch for example, to prevent gradients fromvanishing or explosion, the fusion result will be skip-connectedto the input visual features, the updated visual features andthe original semantic features will be feed to the next cellfor further fusion. And similarly, for SV -Branch, the fusionresult will be skip-connected to former semantic features, theupdated semantic features and the original visual features willbe feed to the next fusion cell.

D. Geometry-Entangled Self-Attention

Except from content-based relationships among the in-put features, for image captioning, geometric structures arealso beneficial for reasoning the interaction. Former workscan only use intra-geometry relation, whereas, the importantinter-geometry relation between two modalities are missed.Therefore, we propose a Geometry-Entangled Self-Attentionoperator. In our proposal, along with the content information,


EncoderLayer 1

EncoderLayer N

…

EncoderLayer n

QueryG Intra

KeyG Intra

Inter-GeoAttention

ContentAttention

Intra-GeoAttention

Geometry-Tangled Self-Attention

Feed Forward

Inter-Geo α Content αIntra-Geo α

ScoreAttention

…QueryG Inter

KeyG Inter

QueryContent

KeyContent

S1 S2 S3

ValueContent

Fig. 3: A schematic diagram of Geometry-Entangled Self-Attention layer based on content and geometry interactions.

two kinds of extra geometry information are feed into Self-Attention layer, namely, intra- and inter-geometry relation.Thus, we have three attention maps for the final self-attentionmeasurement.

As shown in the Fig3, in each encoder layer, three sets ofkeys and queries are used for content, intra- and inter-geometryself-attention maps. For the content self-attention map whichis same to the standard self-attention,

ATT1 = Softmax (QCKC

T

√d

) (12)

QC = Xn−1C WQ

C (13)

KC = Xn−1C WK

C (14)

where Xn−1C is the output values of n−1 th encoder layer, X0

C

is equal to input content features at the first layer. WQC ,W

KC

are the projection matrices for queries and keys, d is the com-mon dimension of all the inputs features, ATT1 ∈ RNv×Nv

is the content attention map. For the intra- and inter-geometryself-attention maps, V intra and V inter are used to generateintra- and inter-geometry queries and keys.

ATT2 = Softmax (QG−intraKG−intra

T

√d

) (15)

ATT3 = Softmax (QG−interKG−inter

T

√d

) (16)

QG−intra = XintraG WQ

G−intra (17)

QG−inter = XinterG WQ

G−inter (18)

KG−intra = XintraG WK

G−intra (19)

KG−inter = XinterG WK

G−inter (20)

EncoderLayer 1

…

DecoderLayer 1

DecoderLayer N

…

…

DecoderLayer n

EncoderLayer n

EncoderLayer N

EncoderLayer 1

…… EncoderLayer n

EncoderLayer N

EncoderLayer 1


EncoderLayer N

EncoderLayer 1


EncoderLayer N

…

…SS-Branch

SV-Branch

VS-Branch

VV-Branch

Fig. 4: Our encoding part is composed of four branches(SSbranch, SV branch, V S branch and V V branch). Theseencoders perform input’s interaction with different visual-semantic ratios. Decoder is charge of generating textual tokensbased on the weighted contribution of each branch.

Through intra-geometry queries matrix WQG−intra(inter) and

inter-geometry queries matrix WKG−intra(inter), intra-geometry

input XintraG and inter-geometry input Xinter

G will generatethe intra-geometry attention map ATT2 ∈ RNv×Nv and theinter-geometry attention map ATT3 ∈ RNv×Nv .

Lastly, to dynamically focus on different self-attention mapsat different layers, The mean of Xn−1

C will be feed into asoftmax layer to generate 3 scores c1, c2, c3 for each self-attention map. Then, the final attended values are calculatedas below,

XnC = (

3∑i=1

ATTi × ci)Xn−1C WV

C (21)

[c1, c2, c3] = Softmax (mean(Xn−1C W o)) (22)

where, WVC ,W

o are the projection matrices for values updateand attention map score generation.

E. Multi-Branches Encoder-Decoder

Multi-Branches Encoder. To bridge the modality gap, priorlinguistic input are used to exploit semantic and visual in-formation simultaneously. However, the interaction betweensemantic and visual information is too complex to be encodedby just a weighted summation. Thus, based on our fusionmodel, as shown is the Fig 4, four encoders are built par-allelly, namely, V V (Pure Visual) branch, V S(Semantic fusedto Visual) branch, SV (Visual fused to Semantic) branch andSS(Pure Semantic) branch. In this way, model can not onlymimic higher order visual-semantic interactions, but can alsocan generate a more fine-grained visual-semantic ratio. Theinput-output relation in each branch is defined as

Xss = Branchss(S,XsB , G

inters,v , Gintra

s,s ), (23)

Xsv = Branchsv (S,XsB , G

inters,v , Gintra

s,s ), (24)

Xvs = Branchvs(V,XvB , G

interv,s , Gintra

v,v ), (25)


Xvv = Branchvv (V,XvB , G

interv,s , Gintra

v,v ), (26)

where V, S are visual feature and semantic dense-captionvectors. Xv

B , XsB are bounding box vectors for corresponding

visual and semantic features. Ginters,v(v,s), G

intrav,v(s,s) are inter-

geometry and intra-geometry relation matrix for correspondingbranches. Therefore, 4 encoding branches will produce amultilevel output χ = (Xss, Xsv, Xvs, Xvv), obtained fromthe outputs of each branch.Multi-Input Decoder. Conditioned on both previously gen-erated words and encoder’s output, decoder aims to generatethe next tokens of the output caption. Just like the case in[30], here we also have multi-parallel encoder outputs, Givenan input sequence of vectors Y , attention operator shouldconnects Y to all elements in χ through cross-attentions. Thedecoder structure is basically the same to previous work, theonly difference is that, the multiple inputs in [30] are comingfrom different layers of the same encoder, whereas, our inputsare coming from the last layers of different encoders. Formally,our attention operator is similarly defined as

Att(χ, Y ) =∑i

αi � C(Xi, Y ), i ∈ {ss, sv, vs, vv} (27)

αi = θ(Wi[Y,C(Xi, Y )] + bi) (28)

C(Xi, Y ) = Attention(WQY,WKXi,WV Xi) (29)

where C(., .) stands for the encoder-decoder cross-attention,computed using queries from the decoder and keys and valuesfrom the encoder. [., .] stands for the concatenation. αi is amatrix of weights having the same size as the cross-attentionresults. Weights in αi modulate both the single contributionof each branch, and the relative importance between differentbranches. Wi is a 2d× d weight matrix, and bi is a learnablebias vector.

In the testing stage, the caption is generated word-by-word in a sequential manner. When generating thetth word, the input features are represented as Y≤t =[y1, y2, · · · , yt−1, 0, · · · , 0] ∈ Rn×dy . The input caption fea-tures along with the image features are fed forward the modelto obtain the word with the largest probability among thewhole word vocabulary. The predicted word is then integratedinto the inputs to recursively generate the new inputs Y≤t+1.To improve the diversity of generated captions, we alsointroduce the beam search strategy during the testing stage.

IV. EXPERIMENTS

In this section, we conduct experiments and evaluate theproposed GEVST models on MSCOCO 2015 image caption-ing dataset[9] which is the most commonly used captioningdataset. Additionally, Visual Genome dataset[48] is used toextract the bottom-up-attention visual features[18] and alsoused to train the Dense-Caption model[40].

A. Datasets

MSCOCO[9]. MSCOCO contains 164, 062 images with995, 684 captions, which are divided into three parts, thetraining part (82, 783 images), the validation part (40, 504

images) and the test part (40, 775 images1.). To evaluate thequality of the generated captions, we follow the “karpathysplit” evaluation strategy, 5, 000 images are chosen for offlinevalidation and another 5, 000 images are selected for offlinetesting. To evaluate the quality of the generated captions, weadopt the widely used evaluation metrics, BLEU[7], ME-TEOR[5], RougeL[6] and CIDEr[7]. Additionally, we useSPICE[4] to evaluate our model, which is more similar tohuman evaluation criteria. For all indicators, the higher thebetter.Visual Genome[48]. Visual Genome is a large-scale datasetto evaluate the interactions between objects in the images. ForDense-Caption training, similar to [40], 77, 398 images fortraining and 5000 images each for validation and test. We usethe same evaluation metric of mean Average Precision (mAP)as [40], which measures localization and description accu-racy jointly. For localization, IoU thresholds .3, .4, .5, .6, .7are used. For language similarity, Meteor score thresholds0, .05, .1, .15, .2, .25 are used.

B. Implementation Details

For the captions, we perform the pre-processing as follows.All the caption sentences are converted to lower case and tok-enized into words with white space. The rare words that occurless than 5 times are discarded, resulting in a vocabulary of9, 487 words. The out-of-vocabulary words are represented as<UNK> token. Corresponding word embeddings are trainedwith the model.

For the visual features, we use the pre-trained bottom-up-attention models to detect the objects and extract visualfeatures for the detected objects. Each original region fea-ture is a 2, 048-dimensional vector, which is transformedas the input visual feature with the dimension do=512. Forthe semantic features, Input dense-captions are encoded bya basic transformer-encoder with 3 encoding layers. For adense-caption, the corresponding feature is the mean of en-coder outputs. Each original dense-caption feature is a 1, 024-dimensional vector, which is transformed as the input semanticfeature with the dimension do=512.

Expansion rate Er in fusion module is 5. The number ofGeometry Entangled Fusion blocks m ranges in {1, 2, 3}. Thelatent dimensionality dmodel in the Geometry-Entangle Self-Attention module is 512, the number of heads h is 8, and thelatent dimensionality for each head dh=d/h=64. The numberof encoding layer L in the encoder and decoder ranges in{2, 3, 4, 5}.

The whole image captioning architecture are mainly im-plemented with PyTorch, optimized with Adam[49]. For thetraining stage, we follow the training schedule in[47] tooptimize the whole architecture with cross-entropy loss. Thewarmup steps are set as 4 epochs and the batch size is 40. Forthe training with self-critical training strategy, we first selectthe initialization model which is trained with cross entropyloss for 18 epochs. After that, the whole architecture is furtheroptimized with CIDEr reward for another 30 epochs. At the

1Note that these images are unavailable and withheld in MSCOCO serverfor on-line comparison.


DC1: 0.059, Person on a beach. DC2: 0.038, Sand at beach.DC3: 0.063, Grass is brown.DC4: 0.059, Clouds are white.

GT: A person walking on a beach with a dog.

DC1: Person on a beach. DC2: Sand at beach.DC3: Grass is brown.DC4: Clouds are white.

GT: A little boy that is jumping a skateboard.

DC1: Boy skate on ramp. DC2: Boy on skateboard.DC3: Ramp is black.DC4: The sky is blue.

DC1: Bunch of bananas. DC2: A yellow banana.DC3: A tree trunk.DC4: A green tree.

GT: A bunch of bananas are sitting on the stand.

GT: Donuts and pastries sit on trays in a store.

DC1: Donuts are brown. DC2: Bread is brown.DC3: Table hold plate.DC4: A white sign.

DC1: 0.068, Boy skate on ramp. DC2: 0.049, Boy on skateboard.DC3: 0.053, Ramp is black.DC4: 0.054, The sky is blue.

DC1: 0.054, Bunch of bananas. DC2: 0.054, A yellow banana.DC3: 0.063, A tree trunk.DC4: 0.059, A green tree.

DC1: 0.058, Donuts are brown. DC2: 0.063, Bread is brown.DC3: 0.064, Table hold plate.DC4: 0.060, A white sign.









m=3










































GT: A little boy that is jumping a skateboard. GT: A bunch of bananas are sitting on the stand.

0.054 0.054 0.063 0.059

0.074 0.0690.043 0.042

0.061 0.062 0.057 0.054

DC 1DC 2

DC 3DC 4


0.058 0.063 0.064 0.060

0.066 0.066 0.063 0.049

0.059 0.058 0.053 0.052

DC 1DC 2

DC 3DC 4





















m=2

m=1

m=3

m=2

m=1

m=3

m=2

m=1

m=3

m=2

m=1

0.0680.049 0.053 0.054

0.071 0.073 0.0630.042

0.061 0.062 0.053 0.052

DC 1DC 2

DC 3DC 4

DC 1DC 2

DC 3DC 4

0.0590.038

0.063 0.059

0.077 0.0630.043 0.032

0.068 0.063 0.059 0.057

Fig. 5: Visualization of the results with different cell numbers. For each original image, GT is the ground truth, DC1 to DC4are 4 dense-captions for visualization. The sub-image is shown under the dense-captions. The number above the bar stand forthe weight for the dense-captions(DC1 to DC4). m stands for the number of fusion cells.

TABLE I: Comparison results on MSCOCO on Karpathy’ssplit. The heighest score is labbles as bold. Our model notedas GEVST.

Model B-1 B-4 M R C S

SCST CVPR2017 - 34.2 26.7 57.7 114.0 -Up-Down CVPR2018 79.8 36.3 27.7 56.9 120.1 21.4

GCN-LSTM ECCV2018 80.5 38.2 28.5 58.5 128.3 22.0HAN [50]AAAI2019 80.9 37.6 27.8 58.1 121.7 21.5

SGAE CVPR2019 80.8 38.4 28.4 58.6 127.8 22.1ORT NIPS2019 80.5 38.6 28.7 58.4 127.8 22.1AoAICCV2019 80.2 38.9 29.2 58.8 129.8 22.4HIPICCV2019 - 39.1 28.9 59.2 130.6 22.3

SRT [51]AAAI2020 80.3 38.5 28.7 58.4 129.1 22.4M2 CVPR2020 80.8 39.1 29.2 58.6 131.2 22.6

GEVSTours 81.1 39.3 29.1 58.7 131.6 22.8

inference stage, we adopt the beam search strategy and set thebeam size as 5. Five evaluation metrics, BLEU@N, METEOR,ROUGE-L, CIDEr and SPICE, are simultaneously utilized toevaluate our model.

C. Comparison with the State-of-the-Art

Offline Evaluation. As shown in Table I We report theperformance on the offline test split of our model as well as thecompared models. All scores are tested on single model forcomparison. The models include: SCST [1], which employsReinforcement Learning after XE training. Up-Down [18],which employs a two LSTM layers model with bottom-up fea-tures extracted from Faster-RCNN. GCN-LSTM [37], whichbuilds graphs over the detected objects in an image based ontheir spatial and semantic connections. HAN [31], constructsa feature pyramid leveraging patch features and also a parallelMultivariate Residual Network to integrate features of differentlevels. SGAE [38], which introduces auto-encoding scenegraphs into its model. ORT [32], which uses bounding boxesto fortify position relation representation. AoA [29], which

modifies traditional attention model by adding an additionalweight gate. HIP [50], which uses tree-structured Long Short-Term Memory network to interpret the hierarchal structure andenhance 3 levels’ features. SRT [51], which applies an image-text matching model to retrieve recalled words for each imageand also two methods to utilize these recalled words. M2 [30],which proposes a multi-layer encoder for image regions and amulti-layer decoder. Also, extra slots are used to learn prioriknowledge. For fair comparison, all the models are first trainedunder XE loss and then optimized for CIDEr-D score.Online Evaluation. We also evaluate our model on the onlineCOCO test server Finally, we also report the performanceof our method on the online COCO test server as shown inTable II. In this case, we still use the single models previouslydescribed, trained on the “Karpathy” training split. The eval-uation is done on the COCO test split, for which ground-truthannotations are not publicly available. In comparison with thestate-of-the-art approaches of the leaderboard. As it can beseen, our model achieves the highest scores for most metrics.For fair comparison, we only list models which have onlinesocres of single model.

D. Qualitative Analysis

To better understand the effectiveness of proposed approachand the explanations mentioned above, We visualize the dense-caption attention weights with different fusion cell numbers asshown in Fig 5. dense-caption attention weights with differentsub-images are also shown in the Fig 6.Number of Fusion Cells. Instead of a single cell, a stackstructure can fortify the effectiveness of fusion, which canimprove the model’s performance. We can see from the Fig5, compare to m = 1, model with m = 2 gives morereasonable attention weights. Whereas, increasing fusion cellnumber doesn’t always mean better performance. When we setm to 3, the attention weights become similar to each others,fusion model fails to distinguish the most important dense-caption for the given sub-image.


GEVST: A cat sitting on a chair next to a table.GT1: A grey cat sitting in chair next to a table.GT2: A cat sitting in a chair by a table.

DC1: 0.052, Cat sit on back. DC2: 0.046, Cat lie on seat.DC3: 0.043, Cat looks at camera.DC4: 0.042, The cat has green eyes.

DC1: 0.053, A brown chair. DC2: 0.052, A brown wooden chair.DC3: 0.044, The back of a chair.DC4: 0.043, A chair in background.

DC1: 0.059, A book on the table. DC2: 0.054, Book point at camera.DC3: 0.039, A box on the table.DC4: 0.035, Plastic look box.

GEVST: A bunch of boats are docked in the water.GT1: Some boats parked in the water at a dock.GT2: A small marina with boats docked there.

DC1: 0.076, A blue and white boat. DC2: 0.076, A boat in the water.DC3: 0.064, The boat is blue .DC4: 0.063, Boats in the water .

DC1: 0.089, The grass is green. DC2: 0.077, The grass is yellow. DC3: 0.058, A large green tree.DC4: 0.052, The trees are green.

DC1: 0.103, The sky is blue. DC2: 0.101, The sky is white. DC3: 0.074, The clouds are white.DC4: 0.064, Mountains in bg.

GEVST: A woman riding a skateboard on a street.GT1: A woman is skateboarding down a street.GT2: A woman riding her skateboard down a sidewalk.

DC1: 0.060, A sign on the wall. DC2: 0.057, The sign is white.DC3: 0.055, A sign on the side.DC4: 0.052, The sign is red.

DC1: 0.045, A door of a store. DC2: 0.039, A store. DC3: 0.033, The ground is red.DC4: 0.025, A yellow sign.

DC1: 0.064, Shoe on skateboard.DC2: 0.045, A red skateboard. DC3: 0.046, The wheels are red.DC4: 0.033, A pink and white shoe.

GEVST: A row of motorbikes parked on the side of a bd.GT1: A number of motorbikes parked on an alley.GT2: A row of motorcycles parked in front of a bd.

DC1: 0.102, A blue and white bd. DC2: 0.091, A building is in the bg.DC3: 0.060, A woman in a red shirt.DC4: 0.042, A window on a bd.

DC1: 0.064, A black motorcycle. DC2: 0.053, A blue and red motor.DC3: 0.046, The handle is black .DC4: 0.043, A parked parked.

DC1: 0.060, A bd is in the bg. DC2: 0.055, The wall is white.DC3: 0.054, The bd is red.DC4: 0.048, Ground made of bricks.

Fig. 6: Visualization of the results with different sub-images. Under each original image, GT1 and GT2 are the ground truthcaptions. GEVST stands for the caption generated by our model. For a specific sub-image, DC1 to DC4 are 4 dense-captionswith the highest attention weight.

TABLE II: Leaderboard of the published state-of-the-art, single−model methods on the online MS-COCO test server, wherec5 and c40 denote using 5 and 40 references for testing, respectively. The highest score is denoted as bold. Our model is notedas GEVST.

ModelB-1 B-2 B-3 B-4 M R C

c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40

SCST CVPR2017 78.1 93.7 61.9 86.0 47.0 75.9 35.2 64.5 27.0 35.5 56.3 70.7 114.7 116.7Stack-Cap AAAI2018 77.8 93.2 61.6 86.1 46.8 76.0 34.9 64.6 27.0 35.6 56.2 70.6 114.8 118.3Up-Down CVPR2018 80.2 95.2 64.1 88.8 49.1 79.4 36.9 68.5 27.6 36.7 57.1 72.4 117.9 120.5

CVAP MM2018 80.1 94.9 64.7 88.8 50.0 79.7 37.9 69.0 28.1 37.0 58.2 73.1 121.6 123.8HAN AAAI2019 80.4 94.5 63.8 87.7 48.8 78.0 36.5 66.8 27.4 36.1 57.3 71.9 115.2 118.2

SGAE CVPR2019 80.6 95.0 65.0 88.9 50.1 79.6 37.8 68.7 28.1 37.0 58.2 73.1 122.7 125.5VSUA [52] MM2019 79.9 94.7 64.3 88.6 49.5 79.3 37.4 68.3 28.2 37.1 57.9 72.8 123.1 125.5

GEVSTours 80.8 95.1 65.1 89.2 50.4 80.5 38.2 70.1 28.7 37.9 58.2 73.3 125.1 127.8

Effective of Visual-Semantic Fusion. Fig 6 shows someexamples of fusion attention. From these examples, we findthat our fusion model has three superior abilities: Firstly,compare to original image, our fusion model can providemore diversity info. For example, as for the cat in the secondrow, model can provide richer descriptions. Whereas, previousworks fail to give such fine grained level description. Secondly,model is able to discriminates the importance of objects. Since,for objects with less importance, fewer related dense-captionsare provided by our model. For example, comparing the catsand the buildings sub-image in the fourth row, We can getmore dense-captions about the cat. In contrast, fewer dense-captions are generated for the background buildings. Lastly,

model can describe objects relation more accurately. Ratherthan reasoning from the visual information monopoly, modelcan get Richer and more accurate relationships from dense-captions straightforwardly.

E. Ablation StudiesWe run a number of ablation experiments on MSCOCO im-

age captioning dataset to explore the effectiveness of GEVSTmodel with different hyperparameters. The results shown inthe following tables which are discussed in detail below.Geometry-Content Fusion. Table III summarizes the ablationexperiments on different fusion schemes for Stack Geometry-Entangled Fusion. The first parameter we changed is the cell


TABLE III: Ablation studies for fusion scheme: Scores ofdifferent fusion schemes m− B, where m ∈ {1, 2, 3} meansthe number of fusion cells, B ∈ {C,G,CG} means whichkind of base is used by fusion cells. (C:Only Content Based,G:Only Geometry Based, CG:Both Content and GeometryBased).

SchemeX-Entropy(30 Epoch) CIDEr Optimization

B-4 M R C B-4 M R C

1-CG 36.6 28.3 57.2 118.0 37.3 28.8 58.4 128.62-CG 37.0 28.4 57.3 118.3 39.3 29.1 58.7 131.63-CG 36.8 28.2 57.3 117.7 38.6 29.0 58.3 129.7

2-C 36.9 28.3 57.2 118.1 38.7 29.0 57.3 127.12-G 36.3 28.1 56.9 116.2 37.6 28.7 56.6 124.3

2-CG 37.0 28.4 57.3 118.3 39.3 29.1 58.7 131.6

TABLE IV: Ablation studies for number of Self-Attentionlayers: Scores of the models with different attention layernumbers(Encoder and Decoder have the same layer numbers)l ∈ {2, 3, 4, 5}.

lX-Entropy(30 Epoch) CIDEr Optimization

B-4 M R C B-4 M R C

2 37.0 28.2 57.0 118.2 38.5 28.4 57.5 128.13 37.0 28.4 57.3 118.3 39.3 29.1 58.7 131.64 36.8 28.2 57.3 117.2 38.6 28.7 58.0 129.65 36.1 27.9 56.6 114.1 38.3 28.2 57.0 128.0

number m in the fusion model, we set the cell number from1 to 3, Regarding the performance, the model’s performanceimproves as we increase m to 2. That means single fusion cellcan hardly produce the precise alignment attention betweentwo modalities, an extra fusion cell can help model achieves apreciser alignment. However, for 3-rd fusion cell, the queriesare fused twice from previous fusion cells, thus, attentionscores are too general to distinguish the difference betweeninput features. The second parameter we investigated is theFusion-Base, it decides which kind of information guidesthe alignment between two modalities features. We can see,when we use content information and geometry informationsimultaneously, model gets the best performance. This isbecause, for some objects, their relationships are decidedby the geometry information(eg: Planes above the bridge.).However, for other objects, their relationships are decided bytheir content information(eg: Girl eats cake.). Relying on asingle kind of relation can’t deal with all cases mentionedabove.Number of Attention Layers. Table IV shows the perfor-mance of the GEVST model with different number of attentionlayers. Regarding the performance, as increasing l, the model’sperformance gradually improves and is saturated at a certainnumber. This can be explained as a deeper model can capturemore complex relationships among objects, providing a moreaccurate understanding of the image contents. In addition, adeeper model has a larger representation capacity and has a

TABLE V: Ablation studies for number of Geometry-Entangled Self-Attention: Scores of the models with differentSelf-Attention module, the basic module just use contentinformation, +X means module X are added to the previousrow. X ∈ {Intra Geometry(Intra), Inter Geometry(Inter)}.

XX-Entropy(30 Epoch) CIDEr Optimization

B-4 M R C B-4 M R C

Con 36.1 28.1 56.8 113.2 37.9 28.4 56.4 128.6+Intra 37.1 28.4 57.2 117.8 38.5 28.7 57.7 129.7+Inter 37.0 28.4 57.3 118.3 39.3 29.1 58.7 131.6

TABLE VI: Ablation studies for number of Encoder-Branch:Scores of the models with different Encoders. V V meansvisual encoder, V S means visual-semantic encoder(semanticfeatures are fused into visual features), SV means semantic-visual encoder(visual features are fused into semantic fea-tures), SS means semantic encoder. +X means encoderbranch X is added to the model of last row.

EncoderX-Entropy(30 Epoch) CIDEr Optimization

B-4 M R C B-4 M R C

SS 31.3 22.3 51.7 101.5 35.2 25.9 54.2 116.9SV 33.2 24.1 52.9 104.9 35.9 26.2 54.8 118.9VS 36.9 28.1 57.0 117.1 38.4 28.6 57.4 128.3VV 36.1 27.7 56.6 114.1 37.2 27.6 56.9 125.3+VS 37.2 28.3 57.2 117.9 38.9 28.6 57.7 128.4+SV 37.2 28.4 57.2 118.5 39.2 29.1 58.0 129.4+SS 37.0 28.4 57.3 118.3 39.3 29.1 58.7 131.6

larger risk to overfit the training set, thus, the performancegets worse when setting the l too large.Geometry-Entangled Self-Attention. Next, we compare theGEVST model with different Self-Attention modules in TableV. From the results, we can see following that: 1) By addingmore geometry information to the Self-Attention module, theperformance gets better steadily, thus highlighting the effectof using geometry information; and 2) The optimal modelis trained with both intra- and inter-geometry information,thus inter-geometry information is proved to be effective inattention measurement refinement. For the cross-entropy loss,the performance of +Intra is very similar to +Inter, butwhen trained with reinforcement learning, +Inter performthe best. This can be explained as the reinforcement learning-based self-critical loss provides a more diverse exploration ofthe hypothesis space to avoid overfitting, and thus it can betterutilize the potential of more complex models.Multi-Branch Encoder. In Table VI, we show the perfor-mance of the GEVST model with different encoders. Com-paring models with single branch(row-1 to row-4), We cansee that: 1) model with fused input features perform betterthan those only use single modality features. which justifiedthat providing semantic information can narrow the gap inthese two modalities. 2) model with visual-based informationperform better than those semantic-based models. Since dense-captions are homologous to the captions we want to generate,


the errors in these dense-captions will also be transfer to thedecoder directly, which deteriorate the performance. Besides,by comparing models form row-4 to row-7, we can see that:as we add more encoders, the performance gets better steadily.This justifies that effectiveness of every branch we add. Fortraining using the cross-entropy loss, the optimal model is theone with 3 encoders (V V ,V S,SV ). This phenomenon is due tothe fact that larger models need more time to converge, since,the comparison is done by 30 epochs, we infer that model with4 encoders (V V , V S, SV , SS) can achieve better scores witha longer training time. Its capability is demonstrated in the casetraining with CIDEr optimization.

V. CONCLUSION

In this paper, we present a novel Geometry-EntangledVisual Semantic Transformer (GEVST) framework for imagecaptioning. Our GEVST consists of four parallel transformerencoders VV(Pure Visual), VS(Semantic fused to Visual),SV(Visual fused to Semantic), SS(Pure Semantic) for finalcaption generation. To take fully advantage of the capacity ofdense-caption semantic features, Both content and geometryrelationships of visual and semantic features are consideredin fusion model and self-attention encoder. Specially, thegeometry fusion attention will not only refine the fusion atten-tion measurement, but also be responsible for inter-geometryfeature generation. We quantitatively and qualitatively evaluatethe proposed GEVST models on the benchmark MSCOCOimage captioning dataset. Extensive ablation studies are ex-plored to find the reasons behind the GEVST’s effectiveness.Experimental results show that our method significantly out-performs existing approaches, and for single model, GEVSTachieves the best performance on the real-time leaderboard ofthe MSCOCO image captioning challenge.

ACKNOWLEDGMENT

This work was supported in part by the National NaturalScience Foundation of China under Grant 61602197, and inpart by the Cognitive Computing and Intelligent InformationProcessing (CCIIP) Laboratory, Huazhong University of Sci-ence and Technology, Wuhan, China.

REFERENCES

[1] S. J. Rennie, E. Marcheret, Y. Mroueh, and et al, “Self-critical sequencetraining for image captioning,” in CVPR, 2017.

[2] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Optimizationof image description metrics using policy gradient methods,” CoRR,abs/1612.00370, vol. 2, 2016.

[3] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled samplingfor sequence prediction with recurrent neural networks,” in NIPS, 2015.

[4] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semanticpropositional image caption evaluation,” in ECCV, 2016.

[5] S. Banerjee and A. Lavie, “Meteor: An automatic metric for mtevaluation with improved correlation with human judgments,” in ACLWorkshop, 2005.

[6] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,”Text Summarization Branches Out, 2004.

[7] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method forautomatic evaluation of machine translation,” in ACL, 2002.

[8] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in CVPR, 2015.

[9] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, andC. L. Zitnick, “Microsoft coco captions: Data collection and evaluationserver,” arXiv preprint arXiv:1504.00325, 2015.

[10] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in ECCV, 2014.

[11] D. Elliott and F. Keller, “Image description using visual dependencyrepresentations,” in EMNLP, 2013, pp. 1292–1302.

[12] M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos, X. Han,A. Mensch, A. Berg, T. Berg, and H. Daume III, “Midge: Generatingimage descriptions from computer vision detections,” in EACL, 2012,pp. 747–756.

[13] A. Karpathy, A. Joulin, and L. Fei-Fei, “Deep fragment embeddings forbidirectional image sentence mapping,” arXiv preprint arXiv:1406.5679,2014.

[14] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian,J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generatingsentences from images,” in ECCV. Springer, 2010.

[15] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig, andM. Mitchell, “Language models for image captioning: The quirks andwhat works,” arXiv preprint arXiv:1505.01809, 2015.

[16] K. Xu, J. Ba, R. Kiros, and et al, “Show, attend and tell: Neural imagecaption generation with visual attention,” in ICML, 2015.

[17] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning withsemantic attention,” in CVPR, 2016.

[18] P. Anderson, X. He, C. Buehler, and et al, “Bottom-up and top-downattention for image captioning and visual question answering,” in CVPR,2018.

[19] C. Liu, F. Sun, C. Wang, and et al, “Mat: A multimodal attentivetranslator for image captioning,” arXiv preprint arXiv:1702.05658, 2017.

[20] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments forgenerating image descriptions,” in CVPR, 2015.

[21] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg,and T. L. Berg, “Babytalk: Understanding and generating simple imagedescriptions,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 35, no. 12, pp. 2891–2903, 2013.

[22] Y. Yang, C. Teo, H. Daume III, and Y. Aloimonos, “Corpus-guidedsentence generation of natural images,” in EMNLP, 2011.

[23] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: Aneural image caption generator,” in CVPR, 2015.

[24] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain images withmultimodal recurrent neural networks,” arXiv preprint arXiv:1410.1090,2014.

[25] J. Donahue, L. Anne Hendricks, S. Guadarrama, and et al, “Long-termrecurrent convolutional networks for visual recognition and description,”in CVPR, 2015.

[26] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look:Adaptive attention via a visual sentinel for image captioning,” in CVPR,2017.

[27] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua,“Sca-cnn: Spatial and channel-wise attention in convolutional networksfor image captioning,” in CVPR, 2017.

[28] J. Gu, J. Cai, G. Wang, and T. Chen, “Stack-captioning: Coarse-to-finelearning for image captioning,” in AAAI, 2018.

[29] L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention forimage captioning,” in ICCV, 2019.

[30] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memorytransformer for image captioning,” in CVPR, 2020.

[31] W. Wang, Z. Chen, and H. Hu, “Hierarchical attention network for imagecaptioning,” in AAAI, vol. 33, no. 01, 2019.

[32] S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning:Transforming objects into words,” 2019.

[33] L. Zhou, Y. Zhang, Y.-G. Jiang, T. Zhang, and W. Fan, “Re-caption:Saliency-enhanced image captioning through two-phase learning,” IEEETransactions on Image Processing, vol. 29, pp. 694–709, 2019.

[34] H. Fang, S. Gupta, F. Iandola, and et al, “From captions to visualconcepts and back,” in CVPR, 2015.

[35] N. Li and Z. Chen, “Image cationing with visual-semantic lstm.” inIJCAI, 2018.

[36] J. Yu, J. Li, Z. Yu, and Q. Huang, “Multimodal transformer with multi-view visual representation for image captioning,” IEEE transactions oncircuits and systems for video technology, vol. 30, no. 12, pp. 4467–4480, 2019.

[37] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship forimage captioning,” in ECCV, 2018.

[38] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphsfor image captioning,” in CVPR, 2019.


[39] Y. Huang, J. Chen, W. Ouyang, W. Wan, and Y. Xue, “Image captioningwith end-to-end attribute detection and subsequent attributes prediction,”IEEE Transactions on Image Processing, vol. 29, pp. 4013–4026, 2020.

[40] L. Yang, K. Tang, J. Yang, and L.-J. Li, “Dense captioning with jointinference and visual context,” in CVPR, 2017.

[41] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutionallocalization networks for dense captioning,” in CVPR, 2016.

[42] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense-captioning events in videos,” in ICCV, 2017.

[43] R. Cadene, H. Ben-Younes, M. Cord, and N. Thome, “Murel: Multi-modal relational reasoning for visual question answering,” in CVPR,2019.

[44] Z. Yu, J. Yu, J. Fan, and D. Tao, “Multi-modal factorized bilinear poolingwith co-attention learning for visual question answering,” in ICCV, 2017.

[45] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, “Mutan: Multi-modal tucker fusion for visual question answering,” in ICCV, 2017.

[46] Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, “Beyond bilinear: Generalizedmultimodal factorized high-order pooling for visual question answering,”IEEE transactions on neural networks and learning systems, vol. 29,no. 12, pp. 5947–5959, 2018.

[47] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprintarXiv:1706.03762, 2017.

[48] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen,Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connectinglanguage and vision using crowdsourced dense image annotations,”International Journal of Computer Vision, vol. 123, no. 1, pp. 32–73,2017.

[49] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”2014.

[50] T. Yao, Y. Pan, Y. Li, and T. Mei, “Hierarchy parsing for imagecaptioning,” in ICCV, 2019.

[51] L. Wang, Z. Bai, Y. Zhang, and H. Lu, “Show, recall, and tell: Imagecaptioning with recall mechanism,” in AAAI, 2020.

[52] L. Guo, J. Liu, J. Tang, J. Li, W. Luo, and H. Lu, “Aligning linguisticwords and visual semantic units for image captioning,” in ACM MM,2019.

LING CHENG received the master degree fromFudan University, Shanghai, China, in 2019. He iscurrently a Ph.D. student at School of Computingand Information Systems, Singapore ManagementUniversity, Singapore. His research interests includecomputer vision and data mining.

WEI WEI received the PhD degree from HuazhongUniversity of Science and Technology, Wuhan,China, in 2012. He is currently an associate profes-sor with the School Computer Science and Technol-ogy, Huazhong University of Science and Technol-ogy. He was a research fellow with Nanyang Tech-nological University, Singapore, and Singapore Man-agement University, Singapore. His current researchinterests include computer vision, natural languageprocessing, information retrieval, data mining, andsocial computing.

FEIDA ZHU received the B.S. degree in computerscience from Fudan University, Shanghai, China,and the Ph.D. degree in computer science from theUniversity of Illinois at Urbana–Champaign, Cham-paign, IL, USA. He is an Associate Professor withthe School of Computing and Information Systems,Singapore Management University, Singapore. Hiscurrent research interests include large-scale graphpattern mining and data mining, with applications onWeb, management information systems and businessintelligence.

YONG LIU is a Research Scientist at Alibaba-NTU Singapore Joint Research Institute and JointNTU-UBC Research Centre of Excellence in Ac-tive Living for the Elderly (LILY), Nanyang Tech-nological University, Singapore. Prior to that, hewas Data Scientist at NTUC Enterprise, Singaporefrom November 2017 to July 2018, and a ResearchScientist at Data Analytics Department, Institutefor Infocomm Research (I2R), A*STAR, Singaporefrom November 2015 to October 2017. He receivedPh.D. from the School of Computer Science and

Engineering at Nanyang Technological University in 2016 and B.S. from theDepartment of Electronic Science and Technology at University of Scienceand Technology of China in 2008. His research areas include various topicsin machine learning and data mining. His research papers appear in leadinginternational conferences and journals. He has been invited as a PC memberof major conferences such as KDD, SIGIR, IJCAI, AAAI, CIKM, ICDM, andreviewer for IEEE/ACM transactions. He is a member of ACM and AAAI.

CHUNYAN MIAO is the chair of School of Com-puter Science and Engineering in Nanyang Tech-nological University (NTU), Singapore. Dr. Miaois a Full Professor in the Division of InformationSystems and Director of the Joint NTU-UBC Re-search Centre of Excellence in Active Living for theElderly (LILY), School of Computer Engineering,Nanyang Technological University (NTU), Singa-pore. She received her PhD degree from NTU andwas a Postdoctoral Fellow/Instructor in the School ofComputing, Simon Fraser University (SFU), Canada.

She visited Harvard and MIT, USA, as a Tan Chin Tuan Fellow, collaboratingon a large NSF funded research program in social networks and virtual worlds.She has been an Adjunct Associate Professor/Associate Professor/FoundingFaculty member with the Center for Digital Media which is jointly managedby The University of British Columbia (UBC) and SFU. Her current researchis focused on human-centered computational/ artificial intelligence and inter-active media for the young and the elderly. Since 2003, she has successfullyled several national research projects with a total funding of about 10 Milliondollars from both government agencies and industry, including NRF, MOE,ASTAR, Microsoft Research and HP USA. She is the Editor-in-Chief of theInternational Journal of Information Technology published by the SingaporeComputer Society.

journal of la geometry-entangled visual semantic

Documents