arxiv:1904.05521v2 [cs.cv] 28 apr 2019 · 2019. 4. 30. · this paper is to be presented in the...

UniVSE: Robust Visual Semantic Embeddings viaStructured Semantic Representations∗∗

Hao Wu1,3,4,6,††,‡, Jiayuan Mao5,6,†,‡, Yufeng Zhang2,6,‡, Yuning Jiang6, Lei Li6, Weiwei Sun1,3,4, Wei-Ying Ma6

1School of Computer Science, 2School of Economics, Fudan University3Systems and Shanghai Key Laboratory of Data Science, Fudan University

4Shanghai Insitute of Intelligent Electroics & Systems5ITCS, Institute for Interdisciplinary Information Sciences, Tsinghua University, 6Bytedance AI Lab

{wuhao5688, zhangyf, wwsun}@fudan.edu.cn, [email protected],

{jiangyuning, lileilab, maweiying}@bytedance.com

Abstract

We propose Unified Visual-Semantic Embed-dings (UniVSE) for learning a joint space ofvisual and textual concepts. The space unifiesthe concepts at different levels, including ob-jects, attributes, relations, and full scenes. Acontrastive learning approach is proposed forthe fine-grained alignment from only image-caption pairs. Moreover, we present an effec-tive approach for enforcing the coverage of se-mantic components that appear in the sentence.We demonstrate the robustness of Unified VSEin defending text-domain adversarial attackson cross-modal retrieval tasks. Such robust-ness also empowers the use of visual cues toresolve word dependencies in novel sentences.

1 Introduction

We study the problem of establishing accurate andgeneralizable alignments between visual conceptsand textual semantics efficiently, based upon richbut few, paired but noisy, or even biased visual-textual inputs (e.g., image-caption pairs). Considerthe image-caption pair A shown in Fig. 1: “A whiteclock on the wall is above a wooden table”. Thealignments are formed at multiple levels: This shortsentence can be decomposed into a rich set of se-mantic components (Abend et al., 2017): objects(clock, table and wall) and relations (clockabove table, and clock on wall). These compo-nents are linked with different parts of the scene.

This motives our work to introduce UnifiedVisual-Semantic Embeddings (Unified VSE for

∗This paper is to be presented in the non-archival trackof NAACL 2019 Spatial Language Understanding (SpLU)& Grounded Communication for Robotics (RoboNLP) work-shops only. The full version of this paper can be accessed inhttps://arxiv.org/abs/1904.05521v1, which has been publishedin the proceedings of CVPR 2019.

† indicates equal contribution.‡Work was done when HW, JM and YZ were intern re-

searchers at the Bytedance AI Lab.

CN

N

Word level encoder

Share

Global pooling

1x1 conv

clock is above tables and chairsA

Word level encoder

Word level encoder

Phrase/Sentence level encoder

Share

Share

7x7Image local embedding

Global sentence embedding

Global image embedding

Phrase-levellocal embedding

Instance-levellocal embedding


Local embedding loss

Global embedding loss


Concept constraint loss

0.9

0.80.6

0.10.0

...

clock?chair?table?

dish?cat?

Lglobal

Llocal

Lconcept

Llocal

+

+

+

“clock”

“table”

“clock and table”

“A clock is above tables and chairs”

Image global embedding

7x7 Image local embedding

Joint embedding space (unit ball)

global pooling


“A group of men are using computers”


attention module

A(F, f<clock>)

H×W×E

E

E

H×Wf<clock>

push

f<cat>

f<clock>

Flocal pull

attention map

F

H×W×E

E

f<clock>

push

f<cat>

pull

loss importance mask

F

triplet embedding loss

clock if ,F

Llocal

softmax

triplet embedding loss for each local cell

H×Wtriplet loss

E

weighted sum

Llocal

H×W

Word-level

encoder

GRU GRU

Word-level

encoder

GRU

Word-level

encoder

“clock”

“table”


<clock> <and> <table>

encode in the same space

ResN

et152

Global pooling



1x1 convspace projection

l2 normalization

0.9

0.80.6

0.10.0

...

clock?chair?table?

dish?cat?

Lconcept



l2 normalization



l2 normalization

l2 normalization

Word-level

encoder

GRU GRU

Word-level

encoder

GRU

Word-level

encoder

“chairs”

“table”


<tables> <and> <chairs>


<a> <clock> <is> <above>

GRU GRU GRUGRU GRU GRUGRU

Word-level

encoder

Word-level

encoder

Word-level

encoder

Word-level

encoder


l2 normalization



l2 normalization

l2 normalization


“a clock is above tables and chairs”

Constituency parsing tree

l2 normalization

Local image embedding

Semantic concept constraint

loss importance mask cap if ,F

Llocal

softmax


H×Wtriplet loss

weighted sum

H×Wf<chair> f<dog>f<clock>

0.9 0.150.85

softminpooling

slocal = 0.26

Bag of image local embeddings

Bag of text local embeddings

cap if ,F

f<chair> f<table>f<clock>

0.9 0.950.85softmaxpooling

softmin pooling

slocal = 0.89

softmaxpooling

softmaxpooling



softmaxpooling

softmaxpooling

softmaxpooling

Evidence score

Evidence score

“A clock is above table and chairs”

“A clock is above dog and chairs”

Mblue

blue young

red old

car dog grass bowl

= ×

Attribute Operators

Entity Operators

f<car>

f<blue car>

f<red car>

f<blue dog>

attribute negative

entity negative

attended triplet embedding loss


clock Object level

Attributed object level

Relational phrase level

Sentence level

white clock

abovewhite clock

table wall

wallonwhite clock

clock

wall

table wooden

white

on above

a white clock on the wall is above a wooden table.

wooden table

wooden table

attributes

relationships

objects

CN

N

ObjectEncoder

Share

Global Pooling

1x1 conv

ObjectEncoder

ObjectEncoder

Neural Combiner

Share

Share

Embeddings of Image Local Regions

Sentence Combination

Global Image Embedding

clock

A white clock on the wall is above a table.

tableabovewhite clock

ObjectEncoder

Share

white clock Semantic Component

Combination

Semantic Component Alignment Loss

Neural Combiner

0.63 0.110.08 0.03( )basicnw

( )modifaw

( )modifnw

( )basicnw

uobj uattr urel usent

ucomp

ucap

uobj

uattr

urel

obj attr rel

sent

comp

Share

Text-to-Image Retrieval

CN

N

Word level encoder

Share

Global pooling

1x1 conv

Word level encoder

Word level encoder


Share

Share




Relational phrase-levellocal embedding

Noun-levellocal embedding






0.9

0.80.6

0.10.0

...

clock?wall?table?

dish?cat?

Lglobal

Lrel

Lconcept

Lattr

+

+

+clock

a white clock on the wall is above a table.


Attribute operator

Word level encoder

Share

Noun phrase-levellocal embedding

Lnoun

+


white clock

white

Object

Pair A

Relational phrase

Sentence

white

abovewhite

wall

wallonwhite

clock

wall

table wooden

white

on above

A white clock on the wall is above a wooden table.

table

table

attributes

relationships

objects

Pair B

“white clock”

“wall”

“clock on wall”

“A white clock on the wall is above a wooden table”

Embedding of the Whole Image


Unified Visual-Semantic Embedding Space (Unit Ball)

Global Pooling

Caption Alignment Loss

A white basin is on a wooden

table in front of the wall .table wall

basin

ucomp

usent

v

object alignment

attribute alignment

relation alignment

Caption Embedding for Retrieval

V

7×7×d

d

u<clock>

push

u<cat>

pull

Relevance Map (as weights on 7×7 ranking losses)

V

( )clock is , u V softmax

Triplet Ranking Loss for Each Local Region

7×7ranking loss

d

Weighted Sum

7×7

obj

clock

clock wooden

wooden

clock

7×7×d

Figure 1: Two examplar image-caption pairs. Humansare able to establish accurate and generalizable align-ments between vision and language, at different levels:objects, relations and full sentences. Pair A and B forma pair of contrastive example for the concepts clockand basin.

short) Shown in Fig. 2, Unified VSE bridges vi-sual and textual representation in a joint embeddingspace that unifies the embeddings for objects (nounphrases vs. visual objects), attributes (prenomi-nal phrases vs. visual attributes), relations (verbsor prepositional phrases vs. visual relations) andscenes (sentence vs. image).

There are two major challenges in establishingsuch a factorized alignment. First, the link betweenthe textual description of an object and the cor-responding image region is ambiguous: A visualscene consists of multiple objects, and thus it is un-clear to the learner which object should be alignedwith the description. Second, it could be problem-atic to directly learn a neural network that com-bines various semantic components in a captionand form an encoding for the full sentence, withthe training objective to maximize the cross-modalretrieval performance in the training set (e.g., in(You et al., 2018; Ma et al., 2015; Shi et al., 2018)).As reported by (Shi et al., 2018), because of theinevitable bias in the dataset (e.g., two objects mayco-occur with each other in most cases, see thetable and the wall in Fig. 1 as an example), thelearned sentence encoders usually pay attention to

arX

iv:1

904.

0552

1v2

[cs

.CV

] 2

8 A

pr 2

019

CN

N

Word level encoder

Share

Global pooling

1x1 conv


Word level encoder

Word level encoder


Share

Share











0.9

0.80.6

0.10.0

...

clock?chair?table?

dish?cat?

Lglobal

Llocal

Lconcept

Llocal

+

+

+

“clock”

“table”






global pooling




attention module

A(F, f<clock>)

H×W×E

E

E

H×Wf<clock>

push

f<cat>

f<clock>

Flocal pull

attention map

F

H×W×E

E

f<clock>

push

f<cat>

pull


F


clock if ,F

Llocal

softmax


H×Wtriplet loss

E

weighted sum

Llocal

H×W

Word-level

encoder

GRU GRU

Word-level

encoder

GRU

Word-level

encoder

“clock”

“table”




ResN

et152

Global pooling




l2 normalization

0.9

0.80.6

0.10.0

...

clock?chair?table?

dish?cat?

Lconcept



l2 normalization



l2 normalization

l2 normalization

Word-level

encoder

GRU GRU

Word-level

encoder

GRU

Word-level

encoder

“chairs”

“table”






Word-level

encoder

Word-level

encoder

Word-level

encoder

Word-level

encoder


l2 normalization



l2 normalization

l2 normalization




l2 normalization




Llocal

softmax


H×Wtriplet loss

weighted sum


0.9 0.150.85

softminpooling

slocal = 0.26



cap if ,F



softmin pooling

slocal = 0.89

softmaxpooling

softmaxpooling



softmaxpooling

softmaxpooling

softmaxpooling

Evidence score

Evidence score



Mblue

blue young

red old

car dog grass bowl

= ×

Attribute Operators

Entity Operators

f<car>

f<blue car>

f<red car>

f<blue dog>

attribute negative

entity negative



clock Object level



Sentence level

white clock

abovewhite clock

table wall

wallonwhite clock

clock

wall

table wooden

white

on above


wooden table

wooden table

attributes

relationships

objects

CN

N

ObjectEncoder

Share

Global Pooling

1x1 conv

ObjectEncoder

ObjectEncoder

Neural Combiner

Share

Share




clock



ObjectEncoder

Share


Combination


Neural Combiner

0.63 0.110.08 0.03( )basicnw

( )modifaw

( )modifnw

( )basicnw


ucomp

ucap

uobj

uattr

urel

obj attr rel

sent

comp

Share

Text-to-Image Retrieval

CN

N

Word level encoder

Share

Global pooling

1x1 conv

Word level encoder

Word level encoder


Share

Share











0.9

0.80.6

0.10.0

...

clock?wall?table?

dish?cat?

Lglobal

Lrel

Lconcept

Lattr

+

+

+clock



Attribute operator

Word level encoder

Share


Lnoun

+


white clock

white

Object

Pair A

Relational phrase

Sentence

white

abovewhite

wall

wallonwhite

clock

wall

table wooden

white

on above


table

table

attributes

relationships

objects

Pair B

“white clock”

“wall”

“clock on wall”





Global Pooling




basin

ucomp

usent

v

object alignment

attribute alignment

relation alignment


V

7×7×d

d

u<clock>

push

u<cat>

pull


V


Triplet Ranking Loss for Each Local Region

7×7ranking loss

d

Weighted Sum

7×7

obj

clock

clock wooden

wooden

clock

7×7×d

Figure 2: We build a visual-semantic embedding space,which unifies the embeddings for objects, attributes, re-lations and full scenes.

only part of the sentence. As a result, they are vul-nerable to text-domain adversarial attacks: Adver-sarial captions constructed from original captionsby adding small perturbations (e.g., by changingwall to be shelf) can easily fool the model (Shiet al., 2018; Shekhar et al., 2017).

We resolve the aforementioned challenges by anatural combination of two ideas: cross-situationallearning and the enforcement of semantic cover-age that regularizes the encoder. Cross-situationallearning, or learning from contrastive examples(Fazly et al., 2010), uses contrastive examples inthe dataset to resolve the referential ambiguity ofobjects: Looking at both Pair A and B in Fig. 1,we know that Clock should refer to an object thatoccurs only in scene A but not B. Meanwhile, toalleviate the biases of datasets such as object co-occurrence, we present an effective approach thatenforces the semantic converage: The meaning ofa caption is a composition of all semantic compo-nents in the sentence (Abend et al., 2017). Reflec-tively, the embedding of a caption should have acoverage of all semantic components, while chang-ing any of them should affect the global captionembedding.

Conceptually and empirically, Unified VSEmakes the following three contributions.

First, the explicit factorization of the visual-semantic embedding space enables us to build afine-grained correspondence between visual andtextual data, which further benefits a set of down-stream visual-textual tasks. We achieve thisthrough a contrastive example mining techniquethat uniformly applies to different semantic com-ponents, in contrast to the sentence or image-level contrastive samples used by existing visual-semantic learning (You et al., 2018; Ma et al., 2015;Faghri et al., 2018).

Second, we propose a caption encoder that en-

sures a coverage of all semantic components ap-peared in the sentence. We show that this regular-ization helps our model to learn a robust semanticrepresentation for captions. It effectively defendsadversarial attacks on the text domain.

Furthermore, we show how our learned embed-dings can provide visual cues to assist the parsingof novel sentences, including determining contentword dependencies and labelling semantic roles forcertain verbs. It ends up that our model can buildreliable connections between vision and languageusing given semantic cues and in return, bootstrapthe acquisition of language.

2 Related work

Visual semantic embedding. Visual semantic em-bedding (Frome et al., 2013) is a common tech-nique for learning a joint representation of visionand language. The embedding space empowersa set of cross-modal tasks such as image caption-ing (Vinyals et al., 2015; Xu et al., 2015; Donahueet al., 2015) and visual question answering (Antolet al., 2015; Xu and Saenko, 2016).

A fundamental technique proposed in (Fromeet al., 2013) for aligning two modalities is to usethe pairwise ranking to learn a distance metric fromsimilar and dissimilar cross-modal pairs (Wanget al., 2016; Ren et al., 2016; Kiros et al., 2014a;Eisenschtat and Wolf, 2017; Liu et al., 2017; Kiroset al., 2014b). As a representative, VSE++ (Faghriet al., 2018) uses the online hard negative mining(OHEM) strategy (Shrivastava et al., 2016) for datasampling and shows the performance gain. VSE-C (Shi et al., 2018), based on VSE++, enhancesthe robustness of the learned visual-semantic em-beddings by incorporating rule-generated textualadversarial samples as hard negatives during train-ing. In this paper, we present a contrastive learningapproach based on semantic components.

There are multiple VSE approaches that alsouse linguistically-aware techniques for the sentenceencoding and learning. Hierarchical multimodalLSTM (HM-LSTM) (Niu et al., 2017) and (Xiaoet al., 2017), as two examples, both leverage theconstituency parsing tree. Multimodal-CNN (m-CNN) (Ma et al., 2015) and CSE (You et al., 2018)apply convolutional neural networks to the captionand extract the a hierarchical representation of sen-tences. Our model differs with them in two aspects.First, Unified VSE is built upon a factorized se-mantic space instead of the syntactic knowledge.

Second, we employ a contrastive example min-ing approach that uniformly applies to differentsemantic components. It substantially improvesthe learned embeddings, while the related worksuse only sentence-level contrastive examples.

The learning of object-level alignment in unifiedVSE is also related to (Karpathy and Fei-Fei, 2015;Karpathy et al., 2014; Ren et al., 2017), where theauthors incorporate pre-trained object detectors forthe semantic alignment. (Engilberge et al., 2018)propose a selective pooling technique for the ag-gregation of object features. Compared with them,Unified VSE presents a more general approach thatembeds concepts of different levels, while still re-quiring no extra supervisions.Structured representation for vision and lan-guage. We connect visual and textual representa-tions in a structured embedding space. The designof its structure is partially motivated by the paperson relational visual representations (scene graphs)(Lu et al., 2016; Johnson et al., 2015, 2018), wherea scene is represented by a set of objects and theirrelations. Compared with them, our model doesnot rely on labelled graphs during training.

Researchers have designed various types of rep-resentations (Banarescu et al., 2013; Montague,1970) as well as different models (Liang et al.,2013; Zettlemoyer and Collins, 2005) for translat-ing natural language sentences into structured rep-resentations. In this paper, we present how the us-age of such semantic parsing into visual-semanticembedding facilitates the learning of the embed-ding space. Moreover, we present how the learnedVSE can, in return, helps the parser to resolve pars-ing ambiguities using visual cues.

3 Unified Visual-Semantic Embeddings

We now describe the overall architecture and train-ing paradigm for the proposed Unified Visual-Semantic Embeddings. Given an image-captionpair, we first parse the caption into a structuredmeaning representation, composed by a set of se-mantic components: object nouns, prenominalmodifiers, and relational dependencies. We encodedifferent types of semantic components with type-specific encoders. A caption encoder combinesthe embedding of the semantic components into acaption semantic embedding. Jointly, we encodeimages with a convolutional neural network (CNN)into the same, unified VSE space. The distancebetween the image embedding and the sentential

embedding measures the semantic similarity be-tween the image and the caption.

We employ a multi-task learning approach forthe joint learning of embeddings for semantic com-ponents (as the “basis” of the VSE space) as well asthe caption encoder (as the combiner of semanticcomponents).

3.1 Visual-Semantic Embedding: A Revisit

We begin the section with an introduction to thetwo-stream VSE approach. It jointly learns theembedding spaces of two modalities: vision andlanguage, and aligns them using parallel image-textpairs (e.g., image and captions from the MS-COCOdataset (Lin et al., 2014)).

Let v ∈ Rd be the representation of the im-age and u ∈ Rd be the representation of a cap-tion matching this image, both encoded by neuralmodules. To archive the alignment, a bidirectionalmargin-based ranking loss has been widely applied(Faghri et al., 2018; You et al., 2018; Huang et al.,2017). Formally, for an image (caption) embeddingv (u), denote the embedding of its matched cap-tion (image) as u+ (v+). A negative (unmatched)caption (image) is sampled whose embedding isdenoted as u− (v−). We define the bidirectionalranking loss `sent between captions and images as:

`sent =∑u

Fv−(|δ + s(u,v−)− s(u,v+)|+

)+∑v

Fu−(|δ + s(u−,v)− s(u+,v)|+

)(1)

where δ is a predefined margin, |x|+ = max(x, 0)is the traditional ranking loss and Fx(·) = maxx(·)denotes the hard negative mining strategy (Faghriet al., 2018; Shrivastava et al., 2016). s(·, ·) is asimilarity function between two embeddings andis usually implemented as cosine similarity (Faghriet al., 2018; Shi et al., 2018; You et al., 2018).

3.2 Semantic Encodings

The encoding of a caption is made up of three steps.As an example, consider the caption“A white clockon the wall is above a wooden table”. 1) Weextract a structured meaning representation as acollection of three types of semantic components:object (clock, wall, table), attribute-objectdependencies (white clock, wooden table) andrelational dependencies (clock above table, clockon wall). 2) We encode each component as well asthe full sentence with type-specific encoders into

the unified VSE space. 3) We represent the em-bedding of the caption by combining the semanticcomponents.Semantic parsing. We implement a semanticparser 1 of image captions based on (Schuster et al.,2015). Given the input sentence, the parser firstperforms a syntactic dependency parsing. A set ofrules is applied to the dependency tree and extractsobject entities appeared in the sentence, adjectivesthat modify the object nouns, subjects/objects ofthe verbs and prepositional phrases. For simplicity,we consider only single-word nouns for objects andsingle-word adjectives for object attributes.Encoding objects and attributes. We use an uni-fied object encoder φ for nouns and adjective-nounpairs. For each wordw in the vocabulary, we initial-ize a basic semantic embedding w(basic) ∈ Rdbasic

and a modifier semantic embedding w(modif) ∈Rdmodif .

For a single noun word wn (e.g., clock), wedefine its embedding wn as w

(basic)n ⊕ w

(modif)n ,

where ⊕ means the concatenation of vectors.For an (adjective, noun) pair (wa, wn) (e.g.,(white, clock)), its embedding wa,n is definedas w(basic)

n ⊕w(modif)a where w(modif)

a encodes theattribute information. In implementation, the ba-sic semantic embedding is initialized from GloVe(Pennington et al., 2014). The modifier semanticembeddings (both w

(modif)n and w

(modif)a ) are ran-

domly initialized and jointly learned. w(modif)n can

be regarded as an intrinsic modifier for each nouns.To fuse the embeddings of basic and modifier

semantics, we employ a gated fusion function:φ(wn) = Norm(σ(W1wn + b1)) tanh(W2wn + b2)),

φ(wa,n) = Norm(σ(W1wa,n + b1) tanh(W2wa,n + b2)).

Throughout the text, σ denotes the sigmoid func-tion: σ(x) = 1/(1+exp(−x)), and Norm denotesthe L2 normalization, i.e., Norm(w) = w/‖w‖2.One may interpret φ as a GRU cell (Chung et al.,2014) taking no historical state.Encoding relations and full sentence. Since re-lations and sentences are the composed based onobjects, we encode them with a neural combinerψ, which takes the embeddings of word-level se-mantics encoded by φ as input. In practice, weimplement ψ as an uni-directional GRU (Chunget al., 2014), and pick the L2-normalized last stateas the output.

To obtain a visual-semantic embedding for a re-lational triple (ws, wr, wo) (e.g., (clock, above,

1https://github.com/vacancy/SceneGraphParser

table)), we first extract the word embeddings forthe subject, relational word and the object usingφ. We then feed the encoded word embeddings inthe same order into ψ and takes the L2-normalizedlast state of the GRU cell. Mathematically, urel =ψ(ws, wr, wo) = ψ({φ(ws), φ(wr), φ(wo)}).

The embedding of a sentence usent is computedover the word sequence w1, w2, · · ·wk of thecaption:

usent = ψ({φ(w1), φ(w2), · · · , φ(wk)}),

where for any word x, φ(wx) = φ(w(basic)x ⊕

w(modif)x ) Note that we share the weights of the

encoders ψ and φ among the encoding processesof all semantic levels. This allows our encodersof various types of components to bootstrap thelearning of each other.Combining all of the components. A straight-forward implementation of the caption encoder is todirectly use the sentence embedding usent, as it hasalready combined the semantics of components ina contextually-weighted manner (Levy et al., 2018).However, it has been revealed in (Shi et al., 2018)that such combination is vulnerable to adversarialattacks: Because of the biases in the dataset, thecombiner ψ usually focuses on only a small set ofsemantic components appeared in the caption.

We alleviate such biases by enforcing the cov-erage of the semantic components appeared in thesentence. Specifically, to form the caption embed-ding ucap, the sentence embedding usent is com-bined with an explicit bag-of-components embed-ding ucomp. Mathematically, we define ucomp asan unweighted aggregation of all components inthe sentence:

ucomp = Norm(∑

objuobj +

∑attr

uattr +∑

relurel

)and encode the caption as: ucap = αusent + (1−α)ucomp, where 0 ≤ α ≤ 1 is a scalar weight. Thepresence of ucomp disallows the ignorance of anyof the components in the final caption embeddingucap.

3.3 Image EncodingsWe use CNN to encode the input RGB image intothe unified VSE space. Specifically, we choose aResNet-152 model (He et al., 2016) pretrained onImageNet (Russakovsky et al., 2015) as the imageencoder. We apply a layer of 1× 1 convolution ontop of the last convolutaion layer (i.e., conv5_3)and obtain a convolutional feature map of shape

7× 7× d for each image. d denotes the dimensionof the unified VSE space.

The feature map, denoted as V ∈ R7×7×d, canbe view as the embeddings of 7×7 local regions inthe image. The embedding v for the whole image isdefined as the aggregation of the embeddings at allregions through a global spatial pooling operator.

3.4 Learning ParadigmIn this section, we present how to align vision andlanguage into the unified space using contrastivelearning on different semantic levels. We start fromthe generation of contrastive exampls for differentsemantic components.Negative example sampling. It has been dis-cussed in (Shi et al., 2018) that to explore a largecompositional space of semantics, directly sam-pling negative captions from a human-built dataset(e.g., MS-COCO captions) is not sufficient. Inthis paper, instead of manually define rules thataugment the training data as in (Shi et al., 2018),we address this problem by sampling contrastivenegative examples in the explicitly factorized se-mantic space. The generation does not requiremanually labelled data, and can be easily appliedto any datasets. For a specific caption, we gener-ate the following four types of contrastive negativesamples.• Nouns. We sample negative noun words from

all nouns that do not appear in the caption. 2

• Attribute-noun pairs. We sample negativepairs by randomly substituting the adjectiveby another adjective or substituting the noun.• Relational triples. We sample negative

triples by randomly substituting the subject, orthe relation, or the object. Moreover, we alsosample the whole relational triples of captionsin the dataset which describe other images, asthe negative triples.• Sentences. We sample negative sentences

from the whole dataset. Meanwhile, following(Frome et al., 2013; Faghri et al., 2018), wealso sample negative images from the wholedataset as contrastive images.

The key motivation behind our visual-semanticalignment is that: an object appears in a local re-gion of the image, while the aggregation of all localregions should be aligned with the full semanticsof a caption.

2For the MS-COCO dataset, in all 5 captions associatedwith the same image. This also applies to other components.

CN

N

Word level encoder

Share

Global pooling

1x1 conv


Word level encoder

Word level encoder


Share

Share











0.9

0.80.6

0.10.0

...

clock?chair?table?

dish?cat?

Lglobal

Llocal

Lconcept

Llocal

+

+

+

“clock”

“table”






global pooling




attention module

A(F, f<clock>)

H×W×E

E

E

H×Wf<clock>

push

f<cat>

f<clock>

Flocal pull

attention map

F

H×W×E

E

f<clock>

push

f<cat>

pull


F


clock if ,F

Llocal

softmax


H×Wtriplet loss

E

weighted sum

Llocal

H×W

Word-level

encoder

GRU GRU

Word-level

encoder

GRU

Word-level

encoder

“clock”

“table”




ResN

et152

Global pooling




l2 normalization

0.9

0.80.6

0.10.0

...

clock?chair?table?

dish?cat?

Lconcept



l2 normalization



l2 normalization

l2 normalization

Word-level

encoder

GRU GRU

Word-level

encoder

GRU

Word-level

encoder

“chairs”

“table”






Word-level

encoder

Word-level

encoder

Word-level

encoder

Word-level

encoder


l2 normalization



l2 normalization

l2 normalization




l2 normalization




Llocal

softmax


H×Wtriplet loss

weighted sum


0.9 0.150.85

softminpooling

slocal = 0.26



cap if ,F



softmin pooling

slocal = 0.89

softmaxpooling

softmaxpooling



softmaxpooling

softmaxpooling

softmaxpooling

Evidence score

Evidence score



Mblue

blue young

red old

car dog grass bowl

= ×

Attribute Operators

Entity Operators

f<car>

f<blue car>

f<red car>

f<blue dog>

attribute negative

entity negative



clock Object level



Sentence level

white clock

abovewhite clock

table wall

wallonwhite clock

clock

wall

table wooden

white

on above


wooden table

wooden table

attributes

relationships

objects

CN

N

ObjectEncoder

Share

Global Pooling

1x1 conv

ObjectEncoder

ObjectEncoder

Neural Combiner

Share

Share




clock



ObjectEncoder

Share


Combination


Neural Combiner

0.63 0.110.08 0.03( )basicnw

( )modifaw

( )modifnw

( )basicnw


ucomp

ucap

uobj

uattr

urel

obj attr rel

sent

comp

Share

Text-to-Image RetrievalC

NN

Word level encoder

Share

Global pooling

1x1 conv

Word level encoder

Word level encoder


Share

Share











0.9

0.80.6

0.10.0

...

clock?wall?table?

dish?cat?

Lglobal

Lrel

Lconcept

Lattr

+

+

+clock



Attribute operator

Word level encoder

Share


Lnoun

+


white clock

white

Object

Pair A

Relational phrase

Sentence

white

abovewhite

wall

wallonwhite

clock

wall

table wooden

white

on above


table

table

attributes

relationships

objects

Pair B

“white clock”

“wall”

“clock on wall”





Global Pooling




basin

ucomp

usent

v

object alignment

attribute alignment

relation alignment


V

7×7×d

d

u<clock>

push

u<cat>

pull


V


Margin-based Ranking Losses at Each Local Region

7×7ranking loss

d

Weighted Sum

7×7

obj

clock

clock wooden

wooden

clock

7×7×d

Figure 3: An illustration of our relevance-weightedalignment mechanism. The relevance map shows thesimilarity of each region with the object embeddingu<clock>. We weight the alignment loss with the mapto reinforce the correspondence between the u<clock>

and its matched region.

Local region-level alignment. In detail, we pro-pose a relevance-weighted alignment mechanismfor linking textual object descriptors and local im-age regions. As shown in Fig. 3, consider the em-bedding of a positive textual object descriptor u+

o ,a negative textual object descriptor u−

o and the setimage local region embeddings Vi where i ∈ 7×7extracted from the image. We generate a relevancemap M ∈ R7×7 with Mi, i ∈ 7 × 7 represent-ing the relevance between u+

o and Vi, computedas as Eq. (2). We compute the loss for noun and(adjective, noun) pairs by:

Mi =exp(s(u+

o ,Vi))∑j exp(s(u

+o ,Vj))

(2)

`obj =∑

i∈7×7

(Mi ·

∣∣δ + s(u−o ,Vi)− s(u+o ,Vi)

∣∣+

)(3)

The intuition behind the definition is that, we ex-plicitly try to align the embedding at each imageregion with u+

o . The losses are weighted by thematching score, thus reinforce the correspondencebetween u+

o and the matched region. This tech-nique is related to multi-instance learning (Wuet al., 2015).Global image-level alignment. For relationaltriples urel, semantic components aggregationsucomp and sentences usent, their semantics usu-ally cover multiple objects. Thus, we align themwith the full image embedding v via bidirectionalranking losses as Eq. (1)3. The alignment loss isdenoted as `rel, `comp and `sent, respectively.

We want to highlight that, during training, weseparately align the two type of semantic represen-tations of the caption, i.e., usent and ucomp, withthe image. This differs from the inference-timecomputation of the caption. Recall that α can be

3Only textual negative samples are used for `rel.

Object attack Attribute attack Relation attackMetric R@1 R@5 R@10 rsum R@1 R@5 R@10 rsum R@1 R@5 R@10 rsum total sumVSE++ 32.3 69.6 81.4 183.3 19.8 59.4 76.0 155.2 26.1 66.8 78.7 171.6 510.1VSE-C 41.1 76.0 85.6 202.7 26.7 61.0 74.3 162.0 35.5 71.1 81.5 188.1 552.8

UniVSE (usent+ucomp) 45.3 78.3 87.3 210.9 35.3 71.5 83.1 189.9 39.0 76.5 86.7 202.2 603.0UniVSE (usent) 40.7 76.4 85.5 202.6 30.0 70.5 80.6 181.1 32.6 72.6 83.5 188.7 572.4

UniVSE (usent+uobj ) 42.9 77.2 85.6 205.7 30.1 69.0 79.8 178.9 34.0 71.2 83.6 188.8 573.4UniVSE (usent+uattr) 40.1 73.9 83.3 197.3 37.4 72.0 81.9 191.3 30.5 70.0 81.9 182.4 571.0UniVSE (usent+urel) 45.4 77.1 85.5 208.0 29.2 68.1 78.5 175.8 42.8 77.5 85.6 205.9 589.7

Table 1: Results on image-to-sentence retrieval task with text-domain adversarial attacks. For each caption, wegenerate 5 adversarial fake captions which do not match the images. Thus, the models need to retrieve 5 positivecaptions from 30,000 candidate captions.

viewed as a factor that balances the training ob-jective and the enforcement of semantic coverage.This allows us to flexibly adjust α during inference.

4 Experiments

We evaluate our model on the MS-COCO (Linet al., 2014) dataset. It contains 82,783 trainingimages with each image annotated by 5 captions.We use the common 1K validation and test splitfrom (Karpathy and Fei-Fei, 2015).

We first validate the effectiveness of enforcingthe semantic coverage of caption embeddings bycomparing models on cross-modal retrieval taskswith adversarial examples. We then propose aunified text-to-image retrieval task to support thecontrastive learning on various semantic compo-nents. We end this section with an application ofusing visual cues to facilitate the semantic pars-ing of novel sentences. We include two baselines:VSE++(Faghri et al., 2018) and VSE-C(Shi et al.,2018) for comparison.

4.1 Retrieval under text-domain adversarialattack

Recent works (Shi et al., 2018; Shekhar et al., 2017)have raised their concerns on the robustness of thelearned visual-semantic embeddings. They showthat existing models are vulnerable to text-domainadversarial attacks (i.e., using adversarial captions)and can be easily fooled. This is closely relatedto the bias in small datasets over a large, composi-tional semantic space (Shi et al., 2018). To provethe robustness of the learned unifed VSE, we fur-ther conduct experiments on the image-to-sentenceretrieval task with text-domain adversarial attacks.Following (Shi et al., 2018), we first design severaltypes of adversarial captions by adding perturba-tions to existing captions.

1. Object attack: Randomly replace / appendby an irrelevant one in the original caption.

2. Attribute attack: Randomly replace / add anirrelevant attribute modifier for one object inthe original caption.

3. Relational attack: 1) Randomly replace thesubject/relation/object word by an irrelevantone. 2) Randomly select an entity as a sub-ject/object and add an irrelevant relationalword and object/subject.

The results are shown in Table 1 where differ-ent columns represent different types of attacks.VSE++ performs worst as it is only optimized forthe retrieval performance on the dataset. Its sen-tence encoder is insensitive to a small perturbationin the text. VSE-C explicitly generates the adversar-ial captions based on human-designed rules as hardnegative examples during training, which makes itrelatively robust to those adversarial attacks. Uni-fied VSE shows strong robustness across all typesof adversarial attacks and outperforms all baselines.

The ability of Unified VSE to defend adversar-ial texts comes almost for free: we present zeroadversarial captions during training. Unified VSEbuilds fine-grained semantic alignments via the con-trastive learning of semantic components. It usethe explicit aggregation of the components ucomp

to alleviate the dataset biases.Ablation study: semantic components. We nowdelve into the effectiveness of different semanticcomponents by choosing different combinations ofcomponents for the caption embedding. Shown inTable 1, we use different subsets of the semanticcomponents to form the bag-of-component embed-dings ucomp. For example, in UniVSEobj , onlyobject nouns are selected and aggregated as ucomp.

The results demonstrate the effectiveness of theenforcement of semantic coverage: even if the se-mantic components have got fine-grained align-ment with visual concepts, directly using usent asthe caption encoding still degenerates the robust-ness against adversarial examples. Consistent with

0.0 0.2 0.4 0.6 0.8 1.030

40

50

60

R@

1

image-to-sentence retrieval (no attack) sentence-to-image retrieval (no attack)

(a) Normal cross-modal re-trieval (5,000 captions)

0.0 0.2 0.4 0.6 0.8 1.030

35

40

45

R@

1

obj attack (img-to-sent) attr attack (img-to-sent) rel attack (img-to-sent)

(b) Adversarial attackedimage-to-sentence retrieval(30,000 captions)

Figure 4: The performance of UniVSE on cross-modalretrieval tasks with different combination weight α.Our model can effective defending adversarial attacks,with no sacrifice for the performance on other tasks bychoosing a reasonable α (thus we set α = 0.75 in allother experiments).

Task obj attr rel obj (det) sumVSE++ 29.95 26.64 27.54 50.57 134.70VSE-C 27.48 28.76 26.55 46.20 128.99

UniVSEall 39.49 33.43 39.13 58.37 170.42UniVSEobj 39.71 33.37 34.38 56.84 164.3UniVSEattr 31.31 37.51 34.73 52.26 155.81UniVSErel 37.55 32.7 39.57 59.12 168.94

Table 2: The mAP performance on the unified text-to-image retrieval task. Please refer to the text for details.

the intuition, enforcing of coverage of a certain typeof components (e.g., objects) helps the model to de-fend the adversarial attacks of the same type (e.g.,defending adversarial attacks of nouns). Combin-ing all components leads to the best performance.Choice of the combination factor: α. We studythe choice of α by conducting experiments onboth normal retrieval tasks and the adversarial one.Fig 4 shows the R@1 performance under the nor-mal/adversarial retrieval scenario w.r.t. differentchoices of α. We observe that the ucomp term con-tributes little on the normal cross-modal retrievaltasks but largely on tasks with adversarial attacks.Recall that α can be viewed as a factor that bal-ances the training objective and the enforcementof semantic coverage. By choosing α from a rea-sonable region (e.g., from 0.6 to 0.8), our modelcan effective defend adversarial attacks, with nosacrifice for the overall performance.

4.2 Unified Text-to-Image Retrieval

We extend the word-to-scene retrieval used by (Shiet al., 2018) into a general unified text-to-imageretrieval task. In this task, models receive queriesof different semantic levels, including single words(e.g., “Clock.”), noun phrases (e.g., “White clock.”),relational phrases (e.g., “Clocks on wall”) and full

sentences. For all baselines, the texts of differenttypes as treated as full sentences. The result ispresented in Table 2.

We generate positive image-text pairs by ran-domly choosing an image and a semantic com-ponent from 5 matched captions with the chosenimage. It is worth mention that the semantic com-ponents extracted from captions may not cover allvisual concepts in the corresponding image, whichmakes the annotation noisy. To address this, wealso leverage the MS-COCO detection annotationsto facilitate the evaluation (see obj(det) column).We treat the labels for detection bounding boxes asthe annotation of objects in the scene.Ablation study: contrastive learning of compo-nents. We evaluate the effectiveness of using con-trastive samples for different semantic components.Shown in Table 2, UniVSEobj denotes the modeltrained with only contrastive samples of noun com-ponents. The same notation applies to other mod-els. The UniVSE trained with a certain type of con-trastive examples (e.g., UniVSEobj with contrastivenouns) consistently improves the retrieval perfor-mance of the same type of queries (e.g., retrievingimages from a single noun). UniVSE trained withall kinds of contrastive samples performs best inoverall and shows a significant gap w.r.t. otherbaselines.Visualization of the semantic alignment. We vi-sualize the semantic-relevance map on an imagew.r.t. a given query uq for a qualitative evaluationof the alignment performance of various seman-tic components. The map Mi is computed as thesimilarity between each image region vi and uq,in a similar way as Eq. (2). Shown as Fig. 5, thisvisualization helps to verify that our model success-fully aligns different semantic components with thecorresponding image regions.

4.3 Semantic Parsing with Visual Cues

As a side application, we show how the learned uni-fied VSE space can provide the visual cues to helpthe semantic parsing of sentences. Fig. 6 shows thegeneral idea. When parsing a sentence, ambiguitymay occur, e.g., the subject of the relational wordeat may be sweater or burger. It is not easyfor a textual parser to decide which one is correctbecause of the innate syntactic ambiguity. How-ever, we can use the image which is depicted by thissentence to assist the parsing by. This is relatedto previous works on using image segmentation

Query: black dog Query: white dog Query: man swing bat

0.257 0.211 0.205

0.2550.2510.2860.302

0.2470.255

0.406 0.393

0.404

0.359

0.247

0.490 0.406 0.404 0.393 0.3590.302 0.286 0.255 0.251 0.2470.257 0.255 0.247 0.211 0.205

Query: black dog Query: white dog Query: player swing bat

0.490 0.406 0.404 0.393 0.3590.302 0.286 0.255 0.251 0.2470.257 0.255 0.247 0.211 0.205

RetrievedImage

RelevanceMap

GroundedArea

Matching Score

Figure 5: The relevance maps and grounded areas obtained from the retrieved images w.r.t. three queries. Thetemperature of the softmax for visualizing the relevance map is τ = 0.1. Pixels in white indicates a highermatching score. Note that the third image of the query “black dog” contains two dogs, while our model successfullylocates the black one (on the left). It also succeeded in finding the white dog in the first image of “white dog”.Moreover, for the query “player swing bat”, although there are many players in the image, our model only attendto the man swinging the bat.

A in is eating

A in is eating

ambiguity in parsing

eat

eat

blue

blue

girl

girl sweater burger

girlsweater

sweater blue

burger

burgereat

in

girl

sweater blue

burger

eat

in

0.42

0.42

0.31

0.48

sweater

0.3118

-

0.1877

burger

0.4850

0.2794

-

eat

girl

sweater

burger

girl

-

0.3559

0.2705

sweater burger

girl

sweater

burger

girl

0.3118

-

0.1877

0.4850

0.2794

-

-

0.3559

0.2705

eat

sweater

0.49

burger

0.2blue

girl

0.1

sweater

0.49

burger

0.2blue

girl

0.1

A in eating

A in eating


eat

eat

blue

blue

girl

girl sweater burger

girlsweater

sweater blue

burger

burgereat

in

girl

sweater blue

burger

eat

in

0.42

0.42

0.31

0.48

sweater burger

girl

sweater

burger

girl

0.3118

-

0.1877

0.4850

0.2794

-

-

0.3559

0.2705

Relation: eat

subjectobject

Matching score of all possible combinations w.r.t. the relation eat, to the image

√

×

A in eating

A in eating


eat

eat

blue

blue

girl

girl sweater burger

girlsweater

sweater blue

burger

burgereat

in

girl

sweater blue

burger

eat

in

0.42

0.42

0.28

0.48

sweater burger

girl

sweater

burger

girl

0.3118

-

0.1877

0.4850

0.2794

-

-

0.3559

0.2705

Relation: eat

subjectobject

Matching score of all possible combinations w.r.t. the relation eat, to the image

√

×

Figure 6: Example showing our model can leverage image to assist semantic parsing when there is ambiguity inthe sentence. We can infer that the matching score of “girl eat burger” is much higher than “sweater eat burger”,which can help to eliminate the ambiguity. Note that the other components in the scene graph are also correctlyinferred by our model.

models to facilitate the sentence parsing (Christieet al., 2016).

This motivates us to design two tasks, 1) recover-ing the dependency between attributes and entities,and 2) recovering the relational triples. In detail,we first extract the entities, attributes and relationalwords from the raw sentence without knowing theirdependencies. For each possible combination ofcertain semantic component, our model computesits embedding in the unified joint space. E.g., inFig. 6, there are in total 3× (3− 1) = 6 possibledependencies for eat. We choose the combina-tion with the highest matching score with the im-age to decide the subject/object dependencies ofthe relation eat. We use parsed semantic compo-nents as the ground-truth and report the accuracy,defined as the fraction of the number of correctdependency resolution and the total number of at-tributes/relations.

Table 3 reports the results on assisting seman-tic parsing with visual cues, compared with otherbaselines. Fig. 6 shows a real case in which wesuccessfully resolve the textual ambiguity.

Task attributed object relational phraseRandom 37.41 31.90VSE++ 41.12 43.31VSE-C 43.44 41.08

UniVSE 64.82 62.69

Table 3: The accuracy of different models on recover-ing word dependencies with visual cues. In the “Ran-dom” baseline, we randomly assign the word dependen-cies.

5 Conclusion

We present a unified visual-semantic embeddingapproach that learns a joint representation space ofvision and language in a factorized manner: Differ-ent levels of textual semantic components such asobjects and relations get aligned with regions of im-ages. A contrastive learning approach for semanticcomponents is proposed for the efficient learning ofthe fine-grained alignment. We also introduce theenforcement of semantic coverage: each captionembedding should have a coverage of all semanticcomponents in the sentence. Unified VSE showssuperiority on multiple cross-modal retrieval tasksand can effectively defend text-domain adversar-ial attacks. We hope the proposed approach canempower machines that learn vision and languagejointly, efficiently and robustly.

ReferencesOmri Abend, Tom Kwiatkowski, Nathaniel J Smith,

Sharon Goldwater, and Mark Steedman. 2017. Boot-strapping language acquisition. Cognition, 164:116–143.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual Question An-swering. In International Conference on ComputerVision (ICCV).

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract Meaning Representationfor Sembanking. In Proceedings of the 7th Linguis-tic Annotation Workshop and Interoperability withDiscourse, pages 178–186.

Gordon Christie, Ankit Laddha, Aishwarya Agrawal,Stanislaw Antol, Yash Goyal, Kevin Kochersberger,and Dhruv Batra. 2016. Resolving language andvision ambiguities together: Joint segmentation& prepositional attachment resolution in captionedscenes. arXiv preprint arXiv:1604.02125.

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho,and Yoshua Bengio. 2014. Empirical Evaluationof Gated Recurrent Neural Networks on SequenceModeling. In NIPS 2014 Workshop on Deep Learn-ing, December 2014.

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadar-rama, Marcus Rohrbach, Subhashini Venugopalan,Kate Saenko, and Trevor Darrell. 2015. Long-Term Recurrent Convolutional Networks for VisualRecognition and Description. In IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR), pages 2625–2634.

Aviv Eisenschtat and Lior Wolf. 2017. Linking Im-age and Text with 2-way Nets. In IEEE conferenceon computer vision and pattern recognition (CVPR),pages 1855–1865.

Martin Engilberge, Louis Chevallier, Patrick Perez, andMatthieu Cord. 2018. Finding Beans in Burgers:Deep Semantic-Visual Embedding with Localiza-tion. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 3984–3993.

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, andSanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives.

Afsaneh Fazly, Afra Alishahi, and Suzanne Steven-son. 2010. A probabilistic computational model ofcross-situational word learning. Cognitive Science,34(6):1017–1063.

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Ben-gio, Jeff Dean, Tomas Mikolov, et al. 2013. De-vise: A Deep Visual-Semantic Embedding Model.In Advances in Neural Information Processing Sys-tems (NIPS), pages 2121–2129.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep Residual Learning for ImageRecognition. In IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR), pages 770–778.

Yan Huang, Wei Wang, and Liang Wang. 2017.Instance-Aware Image and Sentence Matching withSelective Multimodal LSTM. In IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR), pages 7254–7262.

Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Im-age Generation from Scene Graphs.

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein,and Fei-Fei Li. 2015. Image Retrieval using SceneGraphs. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 3668–3678.

Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descrip-tions. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 3128–3137.

Andrej Karpathy, Armand Joulin, and Li F Fei-Fei.2014. Deep Fragment Embeddings for BidirectionalImage Sentence Mapping. In Conference on Neu-ral Information Processing Systems (NIPS), pages1889–1897.

Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel.2014a. Multimodal Neural Language Models. In In-ternational Conference on Machine Learning, pages595–603.

Ryan Kiros, Ruslan Salakhutdinov, and Richard SZemel. 2014b. Unifying visual-semantic embed-dings with multimodal neural language models.arXiv preprint arXiv:1411.2539.

Omer Levy, Kenton Lee, Nicholas FitzGerald, andLuke Zettlemoyer. 2018. Long Short-Term Mem-ory as a Dynamically Computed Element-wiseWeighted Sum. arXiv preprint arXiv:1805.03716.

Percy Liang, Michael I. Jordan, and Dan Klein. 2013.Learning Dependency-Based Compositional Seman-tics. Computational Linguistics, 39(2):389–446.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft COCO:Common Objects in Context. In European Confer-ence on Computer Vision (ECCV), pages 740–755.

Yu Liu, Yanming Guo, Erwin M Bakker, and Michael SLew. 2017. Learning a Recurrent Residual FusionNetwork for Multimodal Matching. In IEEE in-ternational conference on computer vision (ICCV),pages 4127–4136.

Cewu Lu, Ranjay Krishna, Michael Bernstein, andLi Fei-Fei. 2016. Visual relationship detection withlanguage priors. In European Conference on Com-puter Vision (ECCV), pages 852–869. Springer.

Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li.2015. Multimodal Convolutional Neural Networksfor Matching Image and Sentence. In IEEE interna-tional conference on computer vision (ICCV), pages2623–2631.

Richard Montague. 1970. Universal Grammar. Theo-ria, 36(3):373–398.

Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, andGang Hua. 2017. Hierarchical Multimodal LSTMfor Dense Visual-Semantic Embedding. In IEEE in-ternational conference on computer vision (ICCV),pages 1899–1907.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global Vectors for WordRepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1532–1543.

Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and AlanYuille. 2016. Joint Image-Text Representation byGaussian Visual-Semantic Embedding. In ACMMultimedia (ACM-MM), pages 207–211.

Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and AlanYuille. 2017. Multiple Instance Visual-SemanticEmbedding. In British Machine Vision Conference(BMVC).

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2015. Ima-geNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV),115(3):211–252.

Sebastian Schuster, Ranjay Krishna, Angel Chang,Li Fei-Fei, and Christopher D. Manning. 2015. Gen-erating semantically precise scene graphs from tex-tual descriptions for improved image retrieval. InWorkshop on Vision and Language (VL15), Lisbon,Portugal. Association for Computational Linguis-tics.

Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich,Aurelie Herbelot, Moin Nabi, Enver Sangineto, andRaffaella Bernardi. 2017. FOIL it! Find One Mis-match between Image and Language Caption. arXivpreprint arXiv:1705.01359.

Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang,and Jian Sun. 2018. Learning Visually-GroundedSemantics from Contrastive Adversarial Samples.In Proceedings of the 27th International Conferenceon Computational Linguistics (COLING), pages3715–3727.

Abhinav Shrivastava, Abhinav Gupta, and Ross Gir-shick. 2016. Training Region-Based Object Detec-tors with Online Hard Example Mining. In IEEEConference on Computer Vision and Pattern Recog-nition (CVPR), pages 5253–5262.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and Tell: A NeuralImage Caption Generator. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR),pages 3156–3164.

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016.Learning Deep Structure-Preserving Image-TextEmbeddings. In IEEE conference on computer vi-sion and pattern recognition (CVPR), pages 5005–5013.

Jiajun Wu, Yinan Yu, Chang Huang, and Kai Yu. 2015.Deep multiple instance learning for image classifi-cation and auto-annotation. In The IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR).

Fanyi Xiao, Leonid Sigal, and Yong Jae Lee. 2017.Weakly-Supervised Visual Grounding of Phraseswith Linguistic Structures. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR),pages 5945–5954.

Huijuan Xu and Kate Saenko. 2016. Ask, Attend andAnswer: Exploring Question-Guided Spatial Atten-tion for Visual Question Answering. In EuropeanConference on Computer Vision (ECCV), pages 451–466. Springer.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, Attend and Tell:Neural Image Caption Generation with Visual At-tention. In International Conference on MachineLearning (ICML), pages 2048–2057.

Quanzeng You, Zhengyou Zhang, and Jiebo Luo. 2018.End-to-End Convolutional Semantic Embeddings.In IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pages 5735–5744.

Luke S. Zettlemoyer and Michael Collins. 2005. Learn-ing to Map Sentences to Logical Form: StructuredClassification with Probabilistic Categorial Gram-mars. In Proceedings of the Twenty-First Confer-ence on Uncertainty in Artificial Intelligence (UAI),pages 658–666.

arxiv:1904.05521v2 [cs.cv] 28 apr 2019 · 2019. 4. 30. · this paper is to be presented in the...

Documents