arxiv:1904.05521v2 [cs.cv] 28 apr 2019 · 2019. 4. 30. · this paper is to be presented in the...
TRANSCRIPT
UniVSE: Robust Visual Semantic Embeddings viaStructured Semantic Representations∗∗
Hao Wu1,3,4,6,††,‡, Jiayuan Mao5,6,†,‡, Yufeng Zhang2,6,‡, Yuning Jiang6, Lei Li6, Weiwei Sun1,3,4, Wei-Ying Ma6
1School of Computer Science, 2School of Economics, Fudan University3Systems and Shanghai Key Laboratory of Data Science, Fudan University
4Shanghai Insitute of Intelligent Electroics & Systems5ITCS, Institute for Interdisciplinary Information Sciences, Tsinghua University, 6Bytedance AI Lab
{wuhao5688, zhangyf, wwsun}@fudan.edu.cn, [email protected],
{jiangyuning, lileilab, maweiying}@bytedance.com
Abstract
We propose Unified Visual-Semantic Embed-dings (UniVSE) for learning a joint space ofvisual and textual concepts. The space unifiesthe concepts at different levels, including ob-jects, attributes, relations, and full scenes. Acontrastive learning approach is proposed forthe fine-grained alignment from only image-caption pairs. Moreover, we present an effec-tive approach for enforcing the coverage of se-mantic components that appear in the sentence.We demonstrate the robustness of Unified VSEin defending text-domain adversarial attackson cross-modal retrieval tasks. Such robust-ness also empowers the use of visual cues toresolve word dependencies in novel sentences.
1 Introduction
We study the problem of establishing accurate andgeneralizable alignments between visual conceptsand textual semantics efficiently, based upon richbut few, paired but noisy, or even biased visual-textual inputs (e.g., image-caption pairs). Considerthe image-caption pair A shown in Fig. 1: “A whiteclock on the wall is above a wooden table”. Thealignments are formed at multiple levels: This shortsentence can be decomposed into a rich set of se-mantic components (Abend et al., 2017): objects(clock, table and wall) and relations (clockabove table, and clock on wall). These compo-nents are linked with different parts of the scene.
This motives our work to introduce UnifiedVisual-Semantic Embeddings (Unified VSE for
∗This paper is to be presented in the non-archival trackof NAACL 2019 Spatial Language Understanding (SpLU)& Grounded Communication for Robotics (RoboNLP) work-shops only. The full version of this paper can be accessed inhttps://arxiv.org/abs/1904.05521v1, which has been publishedin the proceedings of CVPR 2019.
† indicates equal contribution.‡Work was done when HW, JM and YZ were intern re-
searchers at the Bytedance AI Lab.
CN
N
Word level encoder
Share
Global pooling
1x1 conv
clock is above tables and chairsA
Word level encoder
Word level encoder
Phrase/Sentence level encoder
Share
Share
7x7Image local embedding
Global sentence embedding
Global image embedding
Phrase-levellocal embedding
Instance-levellocal embedding
Phrase/Sentence level encoder
Local embedding loss
Global embedding loss
Local embedding loss
Concept constraint loss
0.9
0.80.6
0.10.0
...
clock?chair?table?
dish?cat?
Lglobal
Llocal
Lconcept
Llocal
+
+
+
“clock”
“table”
“clock and table”
“A clock is above tables and chairs”
Image global embedding
7x7 Image local embedding
Joint embedding space (unit ball)
global pooling
Joint embedding space (unit ball)
“A group of men are using computers”
“A clock is above tables and chairs”
attention module
A(F, f<clock>)
H×W×E
E
E
H×Wf<clock>
push
f<cat>
f<clock>
Flocal pull
attention map
F
H×W×E
E
f<clock>
push
f<cat>
pull
loss importance mask
F
triplet embedding loss
clock if ,F
Llocal
softmax
triplet embedding loss for each local cell
H×Wtriplet loss
E
weighted sum
Llocal
H×W
Word-level
encoder
GRU GRU
Word-level
encoder
GRU
Word-level
encoder
“clock”
“table”
“clock and table”
<clock> <and> <table>
encode in the same space
ResN
et152
Global pooling
7x7Image local embedding
Global image embedding
1x1 convspace projection
l2 normalization
0.9
0.80.6
0.10.0
...
clock?chair?table?
dish?cat?
Lconcept
clock is above tables and chairsA
Global sentence embedding
l2 normalization
Phrase-levellocal embedding
Instance-levellocal embedding
l2 normalization
l2 normalization
Word-level
encoder
GRU GRU
Word-level
encoder
GRU
Word-level
encoder
“chairs”
“table”
“clock and table”
<tables> <and> <chairs>
encode in the same space
<a> <clock> <is> <above>
GRU GRU GRUGRU GRU GRUGRU
Word-level
encoder
Word-level
encoder
Word-level
encoder
Word-level
encoder
Global sentence embedding
l2 normalization
Phrase-levellocal embedding
Instance-levellocal embedding
l2 normalization
l2 normalization
clock is above tables and chairsA
“a clock is above tables and chairs”
Constituency parsing tree
l2 normalization
Local image embedding
Semantic concept constraint
loss importance mask cap if ,F
Llocal
softmax
triplet embedding loss for each local cell
H×Wtriplet loss
weighted sum
H×Wf<chair> f<dog>f<clock>
0.9 0.150.85
softminpooling
slocal = 0.26
Bag of image local embeddings
Bag of text local embeddings
cap if ,F
f<chair> f<table>f<clock>
0.9 0.950.85softmaxpooling
softmin pooling
slocal = 0.89
softmaxpooling
softmaxpooling
Bag of image local embeddings
Bag of text local embeddings
softmaxpooling
softmaxpooling
softmaxpooling
Evidence score
Evidence score
“A clock is above table and chairs”
“A clock is above dog and chairs”
Mblue
blue young
red old
car dog grass bowl
= ×
Attribute Operators
Entity Operators
f<car>
f<blue car>
f<red car>
f<blue dog>
attribute negative
entity negative
attended triplet embedding loss
attended triplet embedding loss
clock Object level
Attributed object level
Relational phrase level
Sentence level
white clock
abovewhite clock
table wall
wallonwhite clock
clock
wall
table wooden
white
on above
a white clock on the wall is above a wooden table.
wooden table
wooden table
attributes
relationships
objects
CN
N
ObjectEncoder
Share
Global Pooling
1x1 conv
ObjectEncoder
ObjectEncoder
Neural Combiner
Share
Share
Embeddings of Image Local Regions
Sentence Combination
Global Image Embedding
clock
A white clock on the wall is above a table.
tableabovewhite clock
ObjectEncoder
Share
white clock Semantic Component
Combination
Semantic Component Alignment Loss
Neural Combiner
0.63 0.110.08 0.03( )basicnw
( )modifaw
( )modifnw
( )basicnw
uobj uattr urel usent
ucomp
ucap
uobj
uattr
urel
obj attr rel
sent
comp
Share
Text-to-Image Retrieval
CN
N
Word level encoder
Share
Global pooling
1x1 conv
Word level encoder
Word level encoder
Phrase/Sentence level encoder
Share
Share
7x7Image local embedding
Global sentence embedding
Global image embedding
Relational phrase-levellocal embedding
Noun-levellocal embedding
Phrase/Sentence level encoder
Local embedding loss
Global embedding loss
Local embedding loss
Concept constraint loss
0.9
0.80.6
0.10.0
...
clock?wall?table?
dish?cat?
Lglobal
Lrel
Lconcept
Lattr
+
+
+clock
a white clock on the wall is above a table.
tableabovewhite clock
Attribute operator
Word level encoder
Share
Noun phrase-levellocal embedding
Lnoun
+
Local embedding loss
white clock
white
Object
Pair A
Relational phrase
Sentence
white
abovewhite
wall
wallonwhite
clock
wall
table wooden
white
on above
A white clock on the wall is above a wooden table.
table
table
attributes
relationships
objects
Pair B
“white clock”
“wall”
“clock on wall”
“A white clock on the wall is above a wooden table”
Embedding of the Whole Image
Embeddings of Image Local Regions
Unified Visual-Semantic Embedding Space (Unit Ball)
Global Pooling
Caption Alignment Loss
A white basin is on a wooden
table in front of the wall .table wall
basin
ucomp
usent
v
object alignment
attribute alignment
relation alignment
Caption Embedding for Retrieval
V
7×7×d
d
u<clock>
push
u<cat>
pull
Relevance Map (as weights on 7×7 ranking losses)
V
( )clock is , u V softmax
Triplet Ranking Loss for Each Local Region
7×7ranking loss
d
Weighted Sum
7×7
obj
clock
clock wooden
wooden
clock
7×7×d
Figure 1: Two examplar image-caption pairs. Humansare able to establish accurate and generalizable align-ments between vision and language, at different levels:objects, relations and full sentences. Pair A and B forma pair of contrastive example for the concepts clockand basin.
short) Shown in Fig. 2, Unified VSE bridges vi-sual and textual representation in a joint embeddingspace that unifies the embeddings for objects (nounphrases vs. visual objects), attributes (prenomi-nal phrases vs. visual attributes), relations (verbsor prepositional phrases vs. visual relations) andscenes (sentence vs. image).
There are two major challenges in establishingsuch a factorized alignment. First, the link betweenthe textual description of an object and the cor-responding image region is ambiguous: A visualscene consists of multiple objects, and thus it is un-clear to the learner which object should be alignedwith the description. Second, it could be problem-atic to directly learn a neural network that com-bines various semantic components in a captionand form an encoding for the full sentence, withthe training objective to maximize the cross-modalretrieval performance in the training set (e.g., in(You et al., 2018; Ma et al., 2015; Shi et al., 2018)).As reported by (Shi et al., 2018), because of theinevitable bias in the dataset (e.g., two objects mayco-occur with each other in most cases, see thetable and the wall in Fig. 1 as an example), thelearned sentence encoders usually pay attention to
arX
iv:1
904.
0552
1v2
[cs
.CV
] 2
8 A
pr 2
019
CN
N
Word level encoder
Share
Global pooling
1x1 conv
clock is above tables and chairsA
Word level encoder
Word level encoder
Phrase/Sentence level encoder
Share
Share
7x7Image local embedding
Global sentence embedding
Global image embedding
Phrase-levellocal embedding
Instance-levellocal embedding
Phrase/Sentence level encoder
Local embedding loss
Global embedding loss
Local embedding loss
Concept constraint loss
0.9
0.80.6
0.10.0
...
clock?chair?table?
dish?cat?
Lglobal
Llocal
Lconcept
Llocal
+
+
+
“clock”
“table”
“clock and table”
“A clock is above tables and chairs”
Image global embedding
7x7 Image local embedding
Joint embedding space (unit ball)
global pooling
Joint embedding space (unit ball)
“A group of men are using computers”
“A clock is above tables and chairs”
attention module
A(F, f<clock>)
H×W×E
E
E
H×Wf<clock>
push
f<cat>
f<clock>
Flocal pull
attention map
F
H×W×E
E
f<clock>
push
f<cat>
pull
loss importance mask
F
triplet embedding loss
clock if ,F
Llocal
softmax
triplet embedding loss for each local cell
H×Wtriplet loss
E
weighted sum
Llocal
H×W
Word-level
encoder
GRU GRU
Word-level
encoder
GRU
Word-level
encoder
“clock”
“table”
“clock and table”
<clock> <and> <table>
encode in the same space
ResN
et152
Global pooling
7x7Image local embedding
Global image embedding
1x1 convspace projection
l2 normalization
0.9
0.80.6
0.10.0
...
clock?chair?table?
dish?cat?
Lconcept
clock is above tables and chairsA
Global sentence embedding
l2 normalization
Phrase-levellocal embedding
Instance-levellocal embedding
l2 normalization
l2 normalization
Word-level
encoder
GRU GRU
Word-level
encoder
GRU
Word-level
encoder
“chairs”
“table”
“clock and table”
<tables> <and> <chairs>
encode in the same space
<a> <clock> <is> <above>
GRU GRU GRUGRU GRU GRUGRU
Word-level
encoder
Word-level
encoder
Word-level
encoder
Word-level
encoder
Global sentence embedding
l2 normalization
Phrase-levellocal embedding
Instance-levellocal embedding
l2 normalization
l2 normalization
clock is above tables and chairsA
“a clock is above tables and chairs”
Constituency parsing tree
l2 normalization
Local image embedding
Semantic concept constraint
loss importance mask cap if ,F
Llocal
softmax
triplet embedding loss for each local cell
H×Wtriplet loss
weighted sum
H×Wf<chair> f<dog>f<clock>
0.9 0.150.85
softminpooling
slocal = 0.26
Bag of image local embeddings
Bag of text local embeddings
cap if ,F
f<chair> f<table>f<clock>
0.9 0.950.85softmaxpooling
softmin pooling
slocal = 0.89
softmaxpooling
softmaxpooling
Bag of image local embeddings
Bag of text local embeddings
softmaxpooling
softmaxpooling
softmaxpooling
Evidence score
Evidence score
“A clock is above table and chairs”
“A clock is above dog and chairs”
Mblue
blue young
red old
car dog grass bowl
= ×
Attribute Operators
Entity Operators
f<car>
f<blue car>
f<red car>
f<blue dog>
attribute negative
entity negative
attended triplet embedding loss
attended triplet embedding loss
clock Object level
Attributed object level
Relational phrase level
Sentence level
white clock
abovewhite clock
table wall
wallonwhite clock
clock
wall
table wooden
white
on above
a white clock on the wall is above a wooden table.
wooden table
wooden table
attributes
relationships
objects
CN
N
ObjectEncoder
Share
Global Pooling
1x1 conv
ObjectEncoder
ObjectEncoder
Neural Combiner
Share
Share
Embeddings of Image Local Regions
Sentence Combination
Global Image Embedding
clock
A white clock on the wall is above a table.
tableabovewhite clock
ObjectEncoder
Share
white clock Semantic Component
Combination
Semantic Component Alignment Loss
Neural Combiner
0.63 0.110.08 0.03( )basicnw
( )modifaw
( )modifnw
( )basicnw
uobj uattr urel usent
ucomp
ucap
uobj
uattr
urel
obj attr rel
sent
comp
Share
Text-to-Image Retrieval
CN
N
Word level encoder
Share
Global pooling
1x1 conv
Word level encoder
Word level encoder
Phrase/Sentence level encoder
Share
Share
7x7Image local embedding
Global sentence embedding
Global image embedding
Relational phrase-levellocal embedding
Noun-levellocal embedding
Phrase/Sentence level encoder
Local embedding loss
Global embedding loss
Local embedding loss
Concept constraint loss
0.9
0.80.6
0.10.0
...
clock?wall?table?
dish?cat?
Lglobal
Lrel
Lconcept
Lattr
+
+
+clock
a white clock on the wall is above a table.
tableabovewhite clock
Attribute operator
Word level encoder
Share
Noun phrase-levellocal embedding
Lnoun
+
Local embedding loss
white clock
white
Object
Pair A
Relational phrase
Sentence
white
abovewhite
wall
wallonwhite
clock
wall
table wooden
white
on above
A white clock on the wall is above a wooden table.
table
table
attributes
relationships
objects
Pair B
“white clock”
“wall”
“clock on wall”
“A white clock on the wall is above a wooden table”
Embedding of the Whole Image
Embeddings of Image Local Regions
Unified Visual-Semantic Embedding Space (Unit Ball)
Global Pooling
Caption Alignment Loss
A white basin is on a wooden
table in front of the wall .table wall
basin
ucomp
usent
v
object alignment
attribute alignment
relation alignment
Caption Embedding for Retrieval
V
7×7×d
d
u<clock>
push
u<cat>
pull
Relevance Map (as weights on 7×7 ranking losses)
V
( )clock is , u V softmax
Triplet Ranking Loss for Each Local Region
7×7ranking loss
d
Weighted Sum
7×7
obj
clock
clock wooden
wooden
clock
7×7×d
Figure 2: We build a visual-semantic embedding space,which unifies the embeddings for objects, attributes, re-lations and full scenes.
only part of the sentence. As a result, they are vul-nerable to text-domain adversarial attacks: Adver-sarial captions constructed from original captionsby adding small perturbations (e.g., by changingwall to be shelf) can easily fool the model (Shiet al., 2018; Shekhar et al., 2017).
We resolve the aforementioned challenges by anatural combination of two ideas: cross-situationallearning and the enforcement of semantic cover-age that regularizes the encoder. Cross-situationallearning, or learning from contrastive examples(Fazly et al., 2010), uses contrastive examples inthe dataset to resolve the referential ambiguity ofobjects: Looking at both Pair A and B in Fig. 1,we know that Clock should refer to an object thatoccurs only in scene A but not B. Meanwhile, toalleviate the biases of datasets such as object co-occurrence, we present an effective approach thatenforces the semantic converage: The meaning ofa caption is a composition of all semantic compo-nents in the sentence (Abend et al., 2017). Reflec-tively, the embedding of a caption should have acoverage of all semantic components, while chang-ing any of them should affect the global captionembedding.
Conceptually and empirically, Unified VSEmakes the following three contributions.
First, the explicit factorization of the visual-semantic embedding space enables us to build afine-grained correspondence between visual andtextual data, which further benefits a set of down-stream visual-textual tasks. We achieve thisthrough a contrastive example mining techniquethat uniformly applies to different semantic com-ponents, in contrast to the sentence or image-level contrastive samples used by existing visual-semantic learning (You et al., 2018; Ma et al., 2015;Faghri et al., 2018).
Second, we propose a caption encoder that en-
sures a coverage of all semantic components ap-peared in the sentence. We show that this regular-ization helps our model to learn a robust semanticrepresentation for captions. It effectively defendsadversarial attacks on the text domain.
Furthermore, we show how our learned embed-dings can provide visual cues to assist the parsingof novel sentences, including determining contentword dependencies and labelling semantic roles forcertain verbs. It ends up that our model can buildreliable connections between vision and languageusing given semantic cues and in return, bootstrapthe acquisition of language.
2 Related work
Visual semantic embedding. Visual semantic em-bedding (Frome et al., 2013) is a common tech-nique for learning a joint representation of visionand language. The embedding space empowersa set of cross-modal tasks such as image caption-ing (Vinyals et al., 2015; Xu et al., 2015; Donahueet al., 2015) and visual question answering (Antolet al., 2015; Xu and Saenko, 2016).
A fundamental technique proposed in (Fromeet al., 2013) for aligning two modalities is to usethe pairwise ranking to learn a distance metric fromsimilar and dissimilar cross-modal pairs (Wanget al., 2016; Ren et al., 2016; Kiros et al., 2014a;Eisenschtat and Wolf, 2017; Liu et al., 2017; Kiroset al., 2014b). As a representative, VSE++ (Faghriet al., 2018) uses the online hard negative mining(OHEM) strategy (Shrivastava et al., 2016) for datasampling and shows the performance gain. VSE-C (Shi et al., 2018), based on VSE++, enhancesthe robustness of the learned visual-semantic em-beddings by incorporating rule-generated textualadversarial samples as hard negatives during train-ing. In this paper, we present a contrastive learningapproach based on semantic components.
There are multiple VSE approaches that alsouse linguistically-aware techniques for the sentenceencoding and learning. Hierarchical multimodalLSTM (HM-LSTM) (Niu et al., 2017) and (Xiaoet al., 2017), as two examples, both leverage theconstituency parsing tree. Multimodal-CNN (m-CNN) (Ma et al., 2015) and CSE (You et al., 2018)apply convolutional neural networks to the captionand extract the a hierarchical representation of sen-tences. Our model differs with them in two aspects.First, Unified VSE is built upon a factorized se-mantic space instead of the syntactic knowledge.
Second, we employ a contrastive example min-ing approach that uniformly applies to differentsemantic components. It substantially improvesthe learned embeddings, while the related worksuse only sentence-level contrastive examples.
The learning of object-level alignment in unifiedVSE is also related to (Karpathy and Fei-Fei, 2015;Karpathy et al., 2014; Ren et al., 2017), where theauthors incorporate pre-trained object detectors forthe semantic alignment. (Engilberge et al., 2018)propose a selective pooling technique for the ag-gregation of object features. Compared with them,Unified VSE presents a more general approach thatembeds concepts of different levels, while still re-quiring no extra supervisions.Structured representation for vision and lan-guage. We connect visual and textual representa-tions in a structured embedding space. The designof its structure is partially motivated by the paperson relational visual representations (scene graphs)(Lu et al., 2016; Johnson et al., 2015, 2018), wherea scene is represented by a set of objects and theirrelations. Compared with them, our model doesnot rely on labelled graphs during training.
Researchers have designed various types of rep-resentations (Banarescu et al., 2013; Montague,1970) as well as different models (Liang et al.,2013; Zettlemoyer and Collins, 2005) for translat-ing natural language sentences into structured rep-resentations. In this paper, we present how the us-age of such semantic parsing into visual-semanticembedding facilitates the learning of the embed-ding space. Moreover, we present how the learnedVSE can, in return, helps the parser to resolve pars-ing ambiguities using visual cues.
3 Unified Visual-Semantic Embeddings
We now describe the overall architecture and train-ing paradigm for the proposed Unified Visual-Semantic Embeddings. Given an image-captionpair, we first parse the caption into a structuredmeaning representation, composed by a set of se-mantic components: object nouns, prenominalmodifiers, and relational dependencies. We encodedifferent types of semantic components with type-specific encoders. A caption encoder combinesthe embedding of the semantic components into acaption semantic embedding. Jointly, we encodeimages with a convolutional neural network (CNN)into the same, unified VSE space. The distancebetween the image embedding and the sentential
embedding measures the semantic similarity be-tween the image and the caption.
We employ a multi-task learning approach forthe joint learning of embeddings for semantic com-ponents (as the “basis” of the VSE space) as well asthe caption encoder (as the combiner of semanticcomponents).
3.1 Visual-Semantic Embedding: A Revisit
We begin the section with an introduction to thetwo-stream VSE approach. It jointly learns theembedding spaces of two modalities: vision andlanguage, and aligns them using parallel image-textpairs (e.g., image and captions from the MS-COCOdataset (Lin et al., 2014)).
Let v ∈ Rd be the representation of the im-age and u ∈ Rd be the representation of a cap-tion matching this image, both encoded by neuralmodules. To archive the alignment, a bidirectionalmargin-based ranking loss has been widely applied(Faghri et al., 2018; You et al., 2018; Huang et al.,2017). Formally, for an image (caption) embeddingv (u), denote the embedding of its matched cap-tion (image) as u+ (v+). A negative (unmatched)caption (image) is sampled whose embedding isdenoted as u− (v−). We define the bidirectionalranking loss `sent between captions and images as:
`sent =∑u
Fv−(|δ + s(u,v−)− s(u,v+)|+
)+∑v
Fu−(|δ + s(u−,v)− s(u+,v)|+
)(1)
where δ is a predefined margin, |x|+ = max(x, 0)is the traditional ranking loss and Fx(·) = maxx(·)denotes the hard negative mining strategy (Faghriet al., 2018; Shrivastava et al., 2016). s(·, ·) is asimilarity function between two embeddings andis usually implemented as cosine similarity (Faghriet al., 2018; Shi et al., 2018; You et al., 2018).
3.2 Semantic Encodings
The encoding of a caption is made up of three steps.As an example, consider the caption“A white clockon the wall is above a wooden table”. 1) Weextract a structured meaning representation as acollection of three types of semantic components:object (clock, wall, table), attribute-objectdependencies (white clock, wooden table) andrelational dependencies (clock above table, clockon wall). 2) We encode each component as well asthe full sentence with type-specific encoders into
the unified VSE space. 3) We represent the em-bedding of the caption by combining the semanticcomponents.Semantic parsing. We implement a semanticparser 1 of image captions based on (Schuster et al.,2015). Given the input sentence, the parser firstperforms a syntactic dependency parsing. A set ofrules is applied to the dependency tree and extractsobject entities appeared in the sentence, adjectivesthat modify the object nouns, subjects/objects ofthe verbs and prepositional phrases. For simplicity,we consider only single-word nouns for objects andsingle-word adjectives for object attributes.Encoding objects and attributes. We use an uni-fied object encoder φ for nouns and adjective-nounpairs. For each wordw in the vocabulary, we initial-ize a basic semantic embedding w(basic) ∈ Rdbasic
and a modifier semantic embedding w(modif) ∈Rdmodif .
For a single noun word wn (e.g., clock), wedefine its embedding wn as w
(basic)n ⊕ w
(modif)n ,
where ⊕ means the concatenation of vectors.For an (adjective, noun) pair (wa, wn) (e.g.,(white, clock)), its embedding wa,n is definedas w(basic)
n ⊕w(modif)a where w(modif)
a encodes theattribute information. In implementation, the ba-sic semantic embedding is initialized from GloVe(Pennington et al., 2014). The modifier semanticembeddings (both w
(modif)n and w
(modif)a ) are ran-
domly initialized and jointly learned. w(modif)n can
be regarded as an intrinsic modifier for each nouns.To fuse the embeddings of basic and modifier
semantics, we employ a gated fusion function:φ(wn) = Norm(σ(W1wn + b1)) tanh(W2wn + b2)),
φ(wa,n) = Norm(σ(W1wa,n + b1) tanh(W2wa,n + b2)).
Throughout the text, σ denotes the sigmoid func-tion: σ(x) = 1/(1+exp(−x)), and Norm denotesthe L2 normalization, i.e., Norm(w) = w/‖w‖2.One may interpret φ as a GRU cell (Chung et al.,2014) taking no historical state.Encoding relations and full sentence. Since re-lations and sentences are the composed based onobjects, we encode them with a neural combinerψ, which takes the embeddings of word-level se-mantics encoded by φ as input. In practice, weimplement ψ as an uni-directional GRU (Chunget al., 2014), and pick the L2-normalized last stateas the output.
To obtain a visual-semantic embedding for a re-lational triple (ws, wr, wo) (e.g., (clock, above,
1https://github.com/vacancy/SceneGraphParser
table)), we first extract the word embeddings forthe subject, relational word and the object usingφ. We then feed the encoded word embeddings inthe same order into ψ and takes the L2-normalizedlast state of the GRU cell. Mathematically, urel =ψ(ws, wr, wo) = ψ({φ(ws), φ(wr), φ(wo)}).
The embedding of a sentence usent is computedover the word sequence w1, w2, · · ·wk of thecaption:
usent = ψ({φ(w1), φ(w2), · · · , φ(wk)}),
where for any word x, φ(wx) = φ(w(basic)x ⊕
w(modif)x ) Note that we share the weights of the
encoders ψ and φ among the encoding processesof all semantic levels. This allows our encodersof various types of components to bootstrap thelearning of each other.Combining all of the components. A straight-forward implementation of the caption encoder is todirectly use the sentence embedding usent, as it hasalready combined the semantics of components ina contextually-weighted manner (Levy et al., 2018).However, it has been revealed in (Shi et al., 2018)that such combination is vulnerable to adversarialattacks: Because of the biases in the dataset, thecombiner ψ usually focuses on only a small set ofsemantic components appeared in the caption.
We alleviate such biases by enforcing the cov-erage of the semantic components appeared in thesentence. Specifically, to form the caption embed-ding ucap, the sentence embedding usent is com-bined with an explicit bag-of-components embed-ding ucomp. Mathematically, we define ucomp asan unweighted aggregation of all components inthe sentence:
ucomp = Norm(∑
objuobj +
∑attr
uattr +∑
relurel
)and encode the caption as: ucap = αusent + (1−α)ucomp, where 0 ≤ α ≤ 1 is a scalar weight. Thepresence of ucomp disallows the ignorance of anyof the components in the final caption embeddingucap.
3.3 Image EncodingsWe use CNN to encode the input RGB image intothe unified VSE space. Specifically, we choose aResNet-152 model (He et al., 2016) pretrained onImageNet (Russakovsky et al., 2015) as the imageencoder. We apply a layer of 1× 1 convolution ontop of the last convolutaion layer (i.e., conv5_3)and obtain a convolutional feature map of shape
7× 7× d for each image. d denotes the dimensionof the unified VSE space.
The feature map, denoted as V ∈ R7×7×d, canbe view as the embeddings of 7×7 local regions inthe image. The embedding v for the whole image isdefined as the aggregation of the embeddings at allregions through a global spatial pooling operator.
3.4 Learning ParadigmIn this section, we present how to align vision andlanguage into the unified space using contrastivelearning on different semantic levels. We start fromthe generation of contrastive exampls for differentsemantic components.Negative example sampling. It has been dis-cussed in (Shi et al., 2018) that to explore a largecompositional space of semantics, directly sam-pling negative captions from a human-built dataset(e.g., MS-COCO captions) is not sufficient. Inthis paper, instead of manually define rules thataugment the training data as in (Shi et al., 2018),we address this problem by sampling contrastivenegative examples in the explicitly factorized se-mantic space. The generation does not requiremanually labelled data, and can be easily appliedto any datasets. For a specific caption, we gener-ate the following four types of contrastive negativesamples.• Nouns. We sample negative noun words from
all nouns that do not appear in the caption. 2
• Attribute-noun pairs. We sample negativepairs by randomly substituting the adjectiveby another adjective or substituting the noun.• Relational triples. We sample negative
triples by randomly substituting the subject, orthe relation, or the object. Moreover, we alsosample the whole relational triples of captionsin the dataset which describe other images, asthe negative triples.• Sentences. We sample negative sentences
from the whole dataset. Meanwhile, following(Frome et al., 2013; Faghri et al., 2018), wealso sample negative images from the wholedataset as contrastive images.
The key motivation behind our visual-semanticalignment is that: an object appears in a local re-gion of the image, while the aggregation of all localregions should be aligned with the full semanticsof a caption.
2For the MS-COCO dataset, in all 5 captions associatedwith the same image. This also applies to other components.
CN
N
Word level encoder
Share
Global pooling
1x1 conv
clock is above tables and chairsA
Word level encoder
Word level encoder
Phrase/Sentence level encoder
Share
Share
7x7Image local embedding
Global sentence embedding
Global image embedding
Phrase-levellocal embedding
Instance-levellocal embedding
Phrase/Sentence level encoder
Local embedding loss
Global embedding loss
Local embedding loss
Concept constraint loss
0.9
0.80.6
0.10.0
...
clock?chair?table?
dish?cat?
Lglobal
Llocal
Lconcept
Llocal
+
+
+
“clock”
“table”
“clock and table”
“A clock is above tables and chairs”
Image global embedding
7x7 Image local embedding
Joint embedding space (unit ball)
global pooling
Joint embedding space (unit ball)
“A group of men are using computers”
“A clock is above tables and chairs”
attention module
A(F, f<clock>)
H×W×E
E
E
H×Wf<clock>
push
f<cat>
f<clock>
Flocal pull
attention map
F
H×W×E
E
f<clock>
push
f<cat>
pull
loss importance mask
F
triplet embedding loss
clock if ,F
Llocal
softmax
triplet embedding loss for each local cell
H×Wtriplet loss
E
weighted sum
Llocal
H×W
Word-level
encoder
GRU GRU
Word-level
encoder
GRU
Word-level
encoder
“clock”
“table”
“clock and table”
<clock> <and> <table>
encode in the same space
ResN
et152
Global pooling
7x7Image local embedding
Global image embedding
1x1 convspace projection
l2 normalization
0.9
0.80.6
0.10.0
...
clock?chair?table?
dish?cat?
Lconcept
clock is above tables and chairsA
Global sentence embedding
l2 normalization
Phrase-levellocal embedding
Instance-levellocal embedding
l2 normalization
l2 normalization
Word-level
encoder
GRU GRU
Word-level
encoder
GRU
Word-level
encoder
“chairs”
“table”
“clock and table”
<tables> <and> <chairs>
encode in the same space
<a> <clock> <is> <above>
GRU GRU GRUGRU GRU GRUGRU
Word-level
encoder
Word-level
encoder
Word-level
encoder
Word-level
encoder
Global sentence embedding
l2 normalization
Phrase-levellocal embedding
Instance-levellocal embedding
l2 normalization
l2 normalization
clock is above tables and chairsA
“a clock is above tables and chairs”
Constituency parsing tree
l2 normalization
Local image embedding
Semantic concept constraint
loss importance mask cap if ,F
Llocal
softmax
triplet embedding loss for each local cell
H×Wtriplet loss
weighted sum
H×Wf<chair> f<dog>f<clock>
0.9 0.150.85
softminpooling
slocal = 0.26
Bag of image local embeddings
Bag of text local embeddings
cap if ,F
f<chair> f<table>f<clock>
0.9 0.950.85softmaxpooling
softmin pooling
slocal = 0.89
softmaxpooling
softmaxpooling
Bag of image local embeddings
Bag of text local embeddings
softmaxpooling
softmaxpooling
softmaxpooling
Evidence score
Evidence score
“A clock is above table and chairs”
“A clock is above dog and chairs”
Mblue
blue young
red old
car dog grass bowl
= ×
Attribute Operators
Entity Operators
f<car>
f<blue car>
f<red car>
f<blue dog>
attribute negative
entity negative
attended triplet embedding loss
attended triplet embedding loss
clock Object level
Attributed object level
Relational phrase level
Sentence level
white clock
abovewhite clock
table wall
wallonwhite clock
clock
wall
table wooden
white
on above
a white clock on the wall is above a wooden table.
wooden table
wooden table
attributes
relationships
objects
CN
N
ObjectEncoder
Share
Global Pooling
1x1 conv
ObjectEncoder
ObjectEncoder
Neural Combiner
Share
Share
Embeddings of Image Local Regions
Sentence Combination
Global Image Embedding
clock
A white clock on the wall is above a table.
tableabovewhite clock
ObjectEncoder
Share
white clock Semantic Component
Combination
Semantic Component Alignment Loss
Neural Combiner
0.63 0.110.08 0.03( )basicnw
( )modifaw
( )modifnw
( )basicnw
uobj uattr urel usent
ucomp
ucap
uobj
uattr
urel
obj attr rel
sent
comp
Share
Text-to-Image RetrievalC
NN
Word level encoder
Share
Global pooling
1x1 conv
Word level encoder
Word level encoder
Phrase/Sentence level encoder
Share
Share
7x7Image local embedding
Global sentence embedding
Global image embedding
Relational phrase-levellocal embedding
Noun-levellocal embedding
Phrase/Sentence level encoder
Local embedding loss
Global embedding loss
Local embedding loss
Concept constraint loss
0.9
0.80.6
0.10.0
...
clock?wall?table?
dish?cat?
Lglobal
Lrel
Lconcept
Lattr
+
+
+clock
a white clock on the wall is above a table.
tableabovewhite clock
Attribute operator
Word level encoder
Share
Noun phrase-levellocal embedding
Lnoun
+
Local embedding loss
white clock
white
Object
Pair A
Relational phrase
Sentence
white
abovewhite
wall
wallonwhite
clock
wall
table wooden
white
on above
A white clock on the wall is above a wooden table.
table
table
attributes
relationships
objects
Pair B
“white clock”
“wall”
“clock on wall”
“A white clock on the wall is above a wooden table”
Embedding of the Whole Image
Embeddings of Image Local Regions
Unified Visual-Semantic Embedding Space (Unit Ball)
Global Pooling
Caption Alignment Loss
A white basin is on a wooden
table in front of the wall .table wall
basin
ucomp
usent
v
object alignment
attribute alignment
relation alignment
Caption Embedding for Retrieval
V
7×7×d
d
u<clock>
push
u<cat>
pull
Relevance Map (as weights on 7×7 ranking losses)
V
( )clock is , u V softmax
Margin-based Ranking Losses at Each Local Region
7×7ranking loss
d
Weighted Sum
7×7
obj
clock
clock wooden
wooden
clock
7×7×d
Figure 3: An illustration of our relevance-weightedalignment mechanism. The relevance map shows thesimilarity of each region with the object embeddingu<clock>. We weight the alignment loss with the mapto reinforce the correspondence between the u<clock>
and its matched region.
Local region-level alignment. In detail, we pro-pose a relevance-weighted alignment mechanismfor linking textual object descriptors and local im-age regions. As shown in Fig. 3, consider the em-bedding of a positive textual object descriptor u+
o ,a negative textual object descriptor u−
o and the setimage local region embeddings Vi where i ∈ 7×7extracted from the image. We generate a relevancemap M ∈ R7×7 with Mi, i ∈ 7 × 7 represent-ing the relevance between u+
o and Vi, computedas as Eq. (2). We compute the loss for noun and(adjective, noun) pairs by:
Mi =exp(s(u+
o ,Vi))∑j exp(s(u
+o ,Vj))
(2)
`obj =∑
i∈7×7
(Mi ·
∣∣δ + s(u−o ,Vi)− s(u+o ,Vi)
∣∣+
)(3)
The intuition behind the definition is that, we ex-plicitly try to align the embedding at each imageregion with u+
o . The losses are weighted by thematching score, thus reinforce the correspondencebetween u+
o and the matched region. This tech-nique is related to multi-instance learning (Wuet al., 2015).Global image-level alignment. For relationaltriples urel, semantic components aggregationsucomp and sentences usent, their semantics usu-ally cover multiple objects. Thus, we align themwith the full image embedding v via bidirectionalranking losses as Eq. (1)3. The alignment loss isdenoted as `rel, `comp and `sent, respectively.
We want to highlight that, during training, weseparately align the two type of semantic represen-tations of the caption, i.e., usent and ucomp, withthe image. This differs from the inference-timecomputation of the caption. Recall that α can be
3Only textual negative samples are used for `rel.
Object attack Attribute attack Relation attackMetric R@1 R@5 R@10 rsum R@1 R@5 R@10 rsum R@1 R@5 R@10 rsum total sumVSE++ 32.3 69.6 81.4 183.3 19.8 59.4 76.0 155.2 26.1 66.8 78.7 171.6 510.1VSE-C 41.1 76.0 85.6 202.7 26.7 61.0 74.3 162.0 35.5 71.1 81.5 188.1 552.8
UniVSE (usent+ucomp) 45.3 78.3 87.3 210.9 35.3 71.5 83.1 189.9 39.0 76.5 86.7 202.2 603.0UniVSE (usent) 40.7 76.4 85.5 202.6 30.0 70.5 80.6 181.1 32.6 72.6 83.5 188.7 572.4
UniVSE (usent+uobj ) 42.9 77.2 85.6 205.7 30.1 69.0 79.8 178.9 34.0 71.2 83.6 188.8 573.4UniVSE (usent+uattr) 40.1 73.9 83.3 197.3 37.4 72.0 81.9 191.3 30.5 70.0 81.9 182.4 571.0UniVSE (usent+urel) 45.4 77.1 85.5 208.0 29.2 68.1 78.5 175.8 42.8 77.5 85.6 205.9 589.7
Table 1: Results on image-to-sentence retrieval task with text-domain adversarial attacks. For each caption, wegenerate 5 adversarial fake captions which do not match the images. Thus, the models need to retrieve 5 positivecaptions from 30,000 candidate captions.
viewed as a factor that balances the training ob-jective and the enforcement of semantic coverage.This allows us to flexibly adjust α during inference.
4 Experiments
We evaluate our model on the MS-COCO (Linet al., 2014) dataset. It contains 82,783 trainingimages with each image annotated by 5 captions.We use the common 1K validation and test splitfrom (Karpathy and Fei-Fei, 2015).
We first validate the effectiveness of enforcingthe semantic coverage of caption embeddings bycomparing models on cross-modal retrieval taskswith adversarial examples. We then propose aunified text-to-image retrieval task to support thecontrastive learning on various semantic compo-nents. We end this section with an application ofusing visual cues to facilitate the semantic pars-ing of novel sentences. We include two baselines:VSE++(Faghri et al., 2018) and VSE-C(Shi et al.,2018) for comparison.
4.1 Retrieval under text-domain adversarialattack
Recent works (Shi et al., 2018; Shekhar et al., 2017)have raised their concerns on the robustness of thelearned visual-semantic embeddings. They showthat existing models are vulnerable to text-domainadversarial attacks (i.e., using adversarial captions)and can be easily fooled. This is closely relatedto the bias in small datasets over a large, composi-tional semantic space (Shi et al., 2018). To provethe robustness of the learned unifed VSE, we fur-ther conduct experiments on the image-to-sentenceretrieval task with text-domain adversarial attacks.Following (Shi et al., 2018), we first design severaltypes of adversarial captions by adding perturba-tions to existing captions.
1. Object attack: Randomly replace / appendby an irrelevant one in the original caption.
2. Attribute attack: Randomly replace / add anirrelevant attribute modifier for one object inthe original caption.
3. Relational attack: 1) Randomly replace thesubject/relation/object word by an irrelevantone. 2) Randomly select an entity as a sub-ject/object and add an irrelevant relationalword and object/subject.
The results are shown in Table 1 where differ-ent columns represent different types of attacks.VSE++ performs worst as it is only optimized forthe retrieval performance on the dataset. Its sen-tence encoder is insensitive to a small perturbationin the text. VSE-C explicitly generates the adversar-ial captions based on human-designed rules as hardnegative examples during training, which makes itrelatively robust to those adversarial attacks. Uni-fied VSE shows strong robustness across all typesof adversarial attacks and outperforms all baselines.
The ability of Unified VSE to defend adversar-ial texts comes almost for free: we present zeroadversarial captions during training. Unified VSEbuilds fine-grained semantic alignments via the con-trastive learning of semantic components. It usethe explicit aggregation of the components ucomp
to alleviate the dataset biases.Ablation study: semantic components. We nowdelve into the effectiveness of different semanticcomponents by choosing different combinations ofcomponents for the caption embedding. Shown inTable 1, we use different subsets of the semanticcomponents to form the bag-of-component embed-dings ucomp. For example, in UniVSEobj , onlyobject nouns are selected and aggregated as ucomp.
The results demonstrate the effectiveness of theenforcement of semantic coverage: even if the se-mantic components have got fine-grained align-ment with visual concepts, directly using usent asthe caption encoding still degenerates the robust-ness against adversarial examples. Consistent with
0.0 0.2 0.4 0.6 0.8 1.030
40
50
60
R@
1
image-to-sentence retrieval (no attack) sentence-to-image retrieval (no attack)
(a) Normal cross-modal re-trieval (5,000 captions)
0.0 0.2 0.4 0.6 0.8 1.030
35
40
45
R@
1
obj attack (img-to-sent) attr attack (img-to-sent) rel attack (img-to-sent)
(b) Adversarial attackedimage-to-sentence retrieval(30,000 captions)
Figure 4: The performance of UniVSE on cross-modalretrieval tasks with different combination weight α.Our model can effective defending adversarial attacks,with no sacrifice for the performance on other tasks bychoosing a reasonable α (thus we set α = 0.75 in allother experiments).
Task obj attr rel obj (det) sumVSE++ 29.95 26.64 27.54 50.57 134.70VSE-C 27.48 28.76 26.55 46.20 128.99
UniVSEall 39.49 33.43 39.13 58.37 170.42UniVSEobj 39.71 33.37 34.38 56.84 164.3UniVSEattr 31.31 37.51 34.73 52.26 155.81UniVSErel 37.55 32.7 39.57 59.12 168.94
Table 2: The mAP performance on the unified text-to-image retrieval task. Please refer to the text for details.
the intuition, enforcing of coverage of a certain typeof components (e.g., objects) helps the model to de-fend the adversarial attacks of the same type (e.g.,defending adversarial attacks of nouns). Combin-ing all components leads to the best performance.Choice of the combination factor: α. We studythe choice of α by conducting experiments onboth normal retrieval tasks and the adversarial one.Fig 4 shows the R@1 performance under the nor-mal/adversarial retrieval scenario w.r.t. differentchoices of α. We observe that the ucomp term con-tributes little on the normal cross-modal retrievaltasks but largely on tasks with adversarial attacks.Recall that α can be viewed as a factor that bal-ances the training objective and the enforcementof semantic coverage. By choosing α from a rea-sonable region (e.g., from 0.6 to 0.8), our modelcan effective defend adversarial attacks, with nosacrifice for the overall performance.
4.2 Unified Text-to-Image Retrieval
We extend the word-to-scene retrieval used by (Shiet al., 2018) into a general unified text-to-imageretrieval task. In this task, models receive queriesof different semantic levels, including single words(e.g., “Clock.”), noun phrases (e.g., “White clock.”),relational phrases (e.g., “Clocks on wall”) and full
sentences. For all baselines, the texts of differenttypes as treated as full sentences. The result ispresented in Table 2.
We generate positive image-text pairs by ran-domly choosing an image and a semantic com-ponent from 5 matched captions with the chosenimage. It is worth mention that the semantic com-ponents extracted from captions may not cover allvisual concepts in the corresponding image, whichmakes the annotation noisy. To address this, wealso leverage the MS-COCO detection annotationsto facilitate the evaluation (see obj(det) column).We treat the labels for detection bounding boxes asthe annotation of objects in the scene.Ablation study: contrastive learning of compo-nents. We evaluate the effectiveness of using con-trastive samples for different semantic components.Shown in Table 2, UniVSEobj denotes the modeltrained with only contrastive samples of noun com-ponents. The same notation applies to other mod-els. The UniVSE trained with a certain type of con-trastive examples (e.g., UniVSEobj with contrastivenouns) consistently improves the retrieval perfor-mance of the same type of queries (e.g., retrievingimages from a single noun). UniVSE trained withall kinds of contrastive samples performs best inoverall and shows a significant gap w.r.t. otherbaselines.Visualization of the semantic alignment. We vi-sualize the semantic-relevance map on an imagew.r.t. a given query uq for a qualitative evaluationof the alignment performance of various seman-tic components. The map Mi is computed as thesimilarity between each image region vi and uq,in a similar way as Eq. (2). Shown as Fig. 5, thisvisualization helps to verify that our model success-fully aligns different semantic components with thecorresponding image regions.
4.3 Semantic Parsing with Visual Cues
As a side application, we show how the learned uni-fied VSE space can provide the visual cues to helpthe semantic parsing of sentences. Fig. 6 shows thegeneral idea. When parsing a sentence, ambiguitymay occur, e.g., the subject of the relational wordeat may be sweater or burger. It is not easyfor a textual parser to decide which one is correctbecause of the innate syntactic ambiguity. How-ever, we can use the image which is depicted by thissentence to assist the parsing by. This is relatedto previous works on using image segmentation
Query: black dog Query: white dog Query: man swing bat
0.257 0.211 0.205
0.2550.2510.2860.302
0.2470.255
0.406 0.393
0.404
0.359
0.247
0.490 0.406 0.404 0.393 0.3590.302 0.286 0.255 0.251 0.2470.257 0.255 0.247 0.211 0.205
Query: black dog Query: white dog Query: player swing bat
0.490 0.406 0.404 0.393 0.3590.302 0.286 0.255 0.251 0.2470.257 0.255 0.247 0.211 0.205
RetrievedImage
RelevanceMap
GroundedArea
Matching Score
Figure 5: The relevance maps and grounded areas obtained from the retrieved images w.r.t. three queries. Thetemperature of the softmax for visualizing the relevance map is τ = 0.1. Pixels in white indicates a highermatching score. Note that the third image of the query “black dog” contains two dogs, while our model successfullylocates the black one (on the left). It also succeeded in finding the white dog in the first image of “white dog”.Moreover, for the query “player swing bat”, although there are many players in the image, our model only attendto the man swinging the bat.
A in is eating
A in is eating
ambiguity in parsing
eat
eat
blue
blue
girl
girl sweater burger
girlsweater
sweater blue
burger
burgereat
in
girl
sweater blue
burger
eat
in
0.42
0.42
0.31
0.48
sweater
0.3118
-
0.1877
burger
0.4850
0.2794
-
eat
girl
sweater
burger
girl
-
0.3559
0.2705
sweater burger
girl
sweater
burger
girl
0.3118
-
0.1877
0.4850
0.2794
-
-
0.3559
0.2705
eat
sweater
0.49
burger
0.2blue
girl
0.1
sweater
0.49
burger
0.2blue
girl
0.1
A in eating
A in eating
ambiguity in parsing
eat
eat
blue
blue
girl
girl sweater burger
girlsweater
sweater blue
burger
burgereat
in
girl
sweater blue
burger
eat
in
0.42
0.42
0.31
0.48
sweater burger
girl
sweater
burger
girl
0.3118
-
0.1877
0.4850
0.2794
-
-
0.3559
0.2705
Relation: eat
subjectobject
Matching score of all possible combinations w.r.t. the relation eat, to the image
√
×
A in eating
A in eating
ambiguity in parsing
eat
eat
blue
blue
girl
girl sweater burger
girlsweater
sweater blue
burger
burgereat
in
girl
sweater blue
burger
eat
in
0.42
0.42
0.28
0.48
sweater burger
girl
sweater
burger
girl
0.3118
-
0.1877
0.4850
0.2794
-
-
0.3559
0.2705
Relation: eat
subjectobject
Matching score of all possible combinations w.r.t. the relation eat, to the image
√
×
Figure 6: Example showing our model can leverage image to assist semantic parsing when there is ambiguity inthe sentence. We can infer that the matching score of “girl eat burger” is much higher than “sweater eat burger”,which can help to eliminate the ambiguity. Note that the other components in the scene graph are also correctlyinferred by our model.
models to facilitate the sentence parsing (Christieet al., 2016).
This motivates us to design two tasks, 1) recover-ing the dependency between attributes and entities,and 2) recovering the relational triples. In detail,we first extract the entities, attributes and relationalwords from the raw sentence without knowing theirdependencies. For each possible combination ofcertain semantic component, our model computesits embedding in the unified joint space. E.g., inFig. 6, there are in total 3× (3− 1) = 6 possibledependencies for eat. We choose the combina-tion with the highest matching score with the im-age to decide the subject/object dependencies ofthe relation eat. We use parsed semantic compo-nents as the ground-truth and report the accuracy,defined as the fraction of the number of correctdependency resolution and the total number of at-tributes/relations.
Table 3 reports the results on assisting seman-tic parsing with visual cues, compared with otherbaselines. Fig. 6 shows a real case in which wesuccessfully resolve the textual ambiguity.
Task attributed object relational phraseRandom 37.41 31.90VSE++ 41.12 43.31VSE-C 43.44 41.08
UniVSE 64.82 62.69
Table 3: The accuracy of different models on recover-ing word dependencies with visual cues. In the “Ran-dom” baseline, we randomly assign the word dependen-cies.
5 Conclusion
We present a unified visual-semantic embeddingapproach that learns a joint representation space ofvision and language in a factorized manner: Differ-ent levels of textual semantic components such asobjects and relations get aligned with regions of im-ages. A contrastive learning approach for semanticcomponents is proposed for the efficient learning ofthe fine-grained alignment. We also introduce theenforcement of semantic coverage: each captionembedding should have a coverage of all semanticcomponents in the sentence. Unified VSE showssuperiority on multiple cross-modal retrieval tasksand can effectively defend text-domain adversar-ial attacks. We hope the proposed approach canempower machines that learn vision and languagejointly, efficiently and robustly.
ReferencesOmri Abend, Tom Kwiatkowski, Nathaniel J Smith,
Sharon Goldwater, and Mark Steedman. 2017. Boot-strapping language acquisition. Cognition, 164:116–143.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C. Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual Question An-swering. In International Conference on ComputerVision (ICCV).
Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract Meaning Representationfor Sembanking. In Proceedings of the 7th Linguis-tic Annotation Workshop and Interoperability withDiscourse, pages 178–186.
Gordon Christie, Ankit Laddha, Aishwarya Agrawal,Stanislaw Antol, Yash Goyal, Kevin Kochersberger,and Dhruv Batra. 2016. Resolving language andvision ambiguities together: Joint segmentation& prepositional attachment resolution in captionedscenes. arXiv preprint arXiv:1604.02125.
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho,and Yoshua Bengio. 2014. Empirical Evaluationof Gated Recurrent Neural Networks on SequenceModeling. In NIPS 2014 Workshop on Deep Learn-ing, December 2014.
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadar-rama, Marcus Rohrbach, Subhashini Venugopalan,Kate Saenko, and Trevor Darrell. 2015. Long-Term Recurrent Convolutional Networks for VisualRecognition and Description. In IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR), pages 2625–2634.
Aviv Eisenschtat and Lior Wolf. 2017. Linking Im-age and Text with 2-way Nets. In IEEE conferenceon computer vision and pattern recognition (CVPR),pages 1855–1865.
Martin Engilberge, Louis Chevallier, Patrick Perez, andMatthieu Cord. 2018. Finding Beans in Burgers:Deep Semantic-Visual Embedding with Localiza-tion. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 3984–3993.
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, andSanja Fidler. 2018. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives.
Afsaneh Fazly, Afra Alishahi, and Suzanne Steven-son. 2010. A probabilistic computational model ofcross-situational word learning. Cognitive Science,34(6):1017–1063.
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Ben-gio, Jeff Dean, Tomas Mikolov, et al. 2013. De-vise: A Deep Visual-Semantic Embedding Model.In Advances in Neural Information Processing Sys-tems (NIPS), pages 2121–2129.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep Residual Learning for ImageRecognition. In IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR), pages 770–778.
Yan Huang, Wei Wang, and Liang Wang. 2017.Instance-Aware Image and Sentence Matching withSelective Multimodal LSTM. In IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR), pages 7254–7262.
Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Im-age Generation from Scene Graphs.
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein,and Fei-Fei Li. 2015. Image Retrieval using SceneGraphs. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 3668–3678.
Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-Semantic Alignments for Generating Image Descrip-tions. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 3128–3137.
Andrej Karpathy, Armand Joulin, and Li F Fei-Fei.2014. Deep Fragment Embeddings for BidirectionalImage Sentence Mapping. In Conference on Neu-ral Information Processing Systems (NIPS), pages1889–1897.
Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel.2014a. Multimodal Neural Language Models. In In-ternational Conference on Machine Learning, pages595–603.
Ryan Kiros, Ruslan Salakhutdinov, and Richard SZemel. 2014b. Unifying visual-semantic embed-dings with multimodal neural language models.arXiv preprint arXiv:1411.2539.
Omer Levy, Kenton Lee, Nicholas FitzGerald, andLuke Zettlemoyer. 2018. Long Short-Term Mem-ory as a Dynamically Computed Element-wiseWeighted Sum. arXiv preprint arXiv:1805.03716.
Percy Liang, Michael I. Jordan, and Dan Klein. 2013.Learning Dependency-Based Compositional Seman-tics. Computational Linguistics, 39(2):389–446.
Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft COCO:Common Objects in Context. In European Confer-ence on Computer Vision (ECCV), pages 740–755.
Yu Liu, Yanming Guo, Erwin M Bakker, and Michael SLew. 2017. Learning a Recurrent Residual FusionNetwork for Multimodal Matching. In IEEE in-ternational conference on computer vision (ICCV),pages 4127–4136.
Cewu Lu, Ranjay Krishna, Michael Bernstein, andLi Fei-Fei. 2016. Visual relationship detection withlanguage priors. In European Conference on Com-puter Vision (ECCV), pages 852–869. Springer.
Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li.2015. Multimodal Convolutional Neural Networksfor Matching Image and Sentence. In IEEE interna-tional conference on computer vision (ICCV), pages2623–2631.
Richard Montague. 1970. Universal Grammar. Theo-ria, 36(3):373–398.
Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, andGang Hua. 2017. Hierarchical Multimodal LSTMfor Dense Visual-Semantic Embedding. In IEEE in-ternational conference on computer vision (ICCV),pages 1899–1907.
Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global Vectors for WordRepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1532–1543.
Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and AlanYuille. 2016. Joint Image-Text Representation byGaussian Visual-Semantic Embedding. In ACMMultimedia (ACM-MM), pages 207–211.
Zhou Ren, Hailin Jin, Zhe Lin, Chen Fang, and AlanYuille. 2017. Multiple Instance Visual-SemanticEmbedding. In British Machine Vision Conference(BMVC).
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2015. Ima-geNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV),115(3):211–252.
Sebastian Schuster, Ranjay Krishna, Angel Chang,Li Fei-Fei, and Christopher D. Manning. 2015. Gen-erating semantically precise scene graphs from tex-tual descriptions for improved image retrieval. InWorkshop on Vision and Language (VL15), Lisbon,Portugal. Association for Computational Linguis-tics.
Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich,Aurelie Herbelot, Moin Nabi, Enver Sangineto, andRaffaella Bernardi. 2017. FOIL it! Find One Mis-match between Image and Language Caption. arXivpreprint arXiv:1705.01359.
Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang,and Jian Sun. 2018. Learning Visually-GroundedSemantics from Contrastive Adversarial Samples.In Proceedings of the 27th International Conferenceon Computational Linguistics (COLING), pages3715–3727.
Abhinav Shrivastava, Abhinav Gupta, and Ross Gir-shick. 2016. Training Region-Based Object Detec-tors with Online Hard Example Mining. In IEEEConference on Computer Vision and Pattern Recog-nition (CVPR), pages 5253–5262.
Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and Tell: A NeuralImage Caption Generator. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR),pages 3156–3164.
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016.Learning Deep Structure-Preserving Image-TextEmbeddings. In IEEE conference on computer vi-sion and pattern recognition (CVPR), pages 5005–5013.
Jiajun Wu, Yinan Yu, Chang Huang, and Kai Yu. 2015.Deep multiple instance learning for image classifi-cation and auto-annotation. In The IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR).
Fanyi Xiao, Leonid Sigal, and Yong Jae Lee. 2017.Weakly-Supervised Visual Grounding of Phraseswith Linguistic Structures. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR),pages 5945–5954.
Huijuan Xu and Kate Saenko. 2016. Ask, Attend andAnswer: Exploring Question-Guided Spatial Atten-tion for Visual Question Answering. In EuropeanConference on Computer Vision (ECCV), pages 451–466. Springer.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, Attend and Tell:Neural Image Caption Generation with Visual At-tention. In International Conference on MachineLearning (ICML), pages 2048–2057.
Quanzeng You, Zhengyou Zhang, and Jiebo Luo. 2018.End-to-End Convolutional Semantic Embeddings.In IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pages 5735–5744.
Luke S. Zettlemoyer and Michael Collins. 2005. Learn-ing to Map Sentences to Logical Form: StructuredClassification with Probabilistic Categorial Gram-mars. In Proceedings of the Twenty-First Confer-ence on Uncertainty in Artificial Intelligence (UAI),pages 658–666.