expressing visual relationships via language - arxiv.org · expressing visual relationships via...
TRANSCRIPT
Expressing Visual Relationships via Language
Hao Tan1,2, Franck Dernoncourt2, Zhe Lin2, Trung Bui2, Mohit Bansal11UNC Chapel Hill 2Adobe Research
{haotan, mbansal}@cs.unc.edu, {dernonco, zlin, bui}@adobe.com
Abstract
Describing images with text is a fundamen-tal problem in vision-language research. Cur-rent studies in this domain mostly focus onsingle image captioning. However, in vari-ous real applications (e.g., image editing, dif-ference interpretation, and retrieval), generat-ing relational captions for two images, canalso be very useful. This important problemhas not been explored mostly due to lack ofdatasets and effective models. To push for-ward the research in this direction, we firstintroduce a new language-guided image edit-ing dataset that contains a large number ofreal image pairs with corresponding editing in-structions. We then propose a new relationalspeaker model based on an encoder-decoderarchitecture with static relational attention andsequential multi-head attention. We also ex-tend the model with dynamic relational atten-tion, which calculates visual alignment whiledecoding. Our models are evaluated on ournewly collected and two public datasets con-sisting of image pairs annotated with relation-ship sentences. Experimental results, based onboth automatic and human evaluation, demon-strate that our model outperforms all baselinesand existing methods on all the datasets.1
1 Introduction
Generating captions to describe natural images isa fundamental research problem at the intersectionof computer vision and natural language process-ing. Single image captioning (Mori et al., 1999;Farhadi et al., 2010; Kulkarni et al., 2011) hasmany practical applications such as text-based im-age search, photo curation, assisting of visually-impaired people, image understanding in social
1Our data and code are publicly available at:https://github.com/airsplay/VisualRelationships
Remove the people from the picture.
Relational Speaker
Figure 1: An example result of our method showingthe input image pair from our Image Editing Requestdataset, and the output instruction predicted by our re-lational speaker model trained on the dataset.
media, etc. This task has drawn significant at-tention in the research community with numerousstudies (Vinyals et al., 2015; Xu et al., 2015; An-derson et al., 2018), and recent state of the artmethods have achieved promising results on largecaptioning datasets, such as MS COCO (Lin et al.,2014). Besides single image captioning, the com-munity has also explored other visual captioningproblems such as video captioning (Venugopalanet al., 2015; Xu et al., 2016), and referring expres-sions (Kazemzadeh et al., 2014; Yu et al., 2017).However, the problem of two-image captioning,especially the task of describing the relationshipsand differences between two images, is still under-explored. In this paper, we focus on advanc-ing research in this challenging problem by intro-ducing a new dataset and proposing novel neuralrelational-speaker models.
To the best of our knowledge, Jhamtani andBerg-Kirkpatrick (2018) is the only public datasetaimed at generating natural language descriptionsfor two real images. This dataset is about ‘spottingthe difference’, and hence focuses more on de-scribing exhaustive differences by learning align-ments between multiple text descriptions and mul-
arX
iv:1
906.
0768
9v2
[cs
.CL
] 1
9 Ju
n 20
19
tiple image regions; hence the differences are in-tended to be explicitly identifiable by subtractingtwo images. There are many other tasks that re-quire more diverse, detailed and implicit relation-ships between two images. Interpreting imageediting effects with instructions is a suitable taskfor this purpose, because it has requirements ofexploiting visual transformations and it is widelyused in real life, such as explanation of compleximage editing effects for laypersons or visually-impaired users, image edit or tutorial retrieval, andlanguage-guided image editing systems. We firstbuild a new language-guided image editing datasetwith high quality annotations by (1) crawling im-age pairs from real image editing request web-sites, (2) annotating editing instructions via Ama-zon Mechanical Turk, and (3) refining the annota-tions through experts.
Next, we propose a new neural speaker modelfor generating sentences that describe the vi-sual relationship between a pair of images. Ourmodel is general and not dependent on any spe-cific dataset. Starting from an attentive encoder-decoder baseline, we first develop a model en-hanced with two attention-based neural compo-nents, a static relational attention and a sequentialmulti-head attention, to address these two chal-lenges, respectively. We further extend it by de-signing a dynamic relational attention module tocombine the advantages of these two components,which finds the relationship between two imageswhile decoding. The computation of dynamic re-lational attention is mathematically equivalent toattention over all visual “relationships”. Thus, ourmethod provides a direct way to model visual re-lationships in language.
To show the effectiveness of our models, weevaluate them on three datasets: our new dataset,the ”Spot-the-Diff” dataset (Jhamtani and Berg-Kirkpatrick, 2018), and the two-image visual rea-soning NLVR2 dataset (Suhr et al., 2019) (adaptedfor our task). We train models separately on eachdataset with the same hyper-parameters and eval-uate them on the same test set across all methods.Experimental results demonstrate that our modeloutperforms all the baselines and existing meth-ods. The main contributions of our paper are: (1)We create a novel human language guided imageediting dataset to boost the study in describing vi-sual relationships; (2) We design novel relational-speaker models, including a dynamic relational
attention module, to handle the problem of two-image captioning by focusing on all their visualrelationships; (3) Our method is evaluated on sev-eral datasets and achieves the state-of-the-art.
2 Datasets
We present the collection process and statisticsof our Image Editing Request dataset and brieflyintroduce two public datasets (viz., Spot-the-Diffand NLVR2). All three datasets are used to studythe task of two-image captioning and evaluatingour relational-speaker models. Examples fromthese three datasets are shown in Fig. 2.
2.1 Image Editing Request Dataset
Each instance in our dataset consists of an imagepair (i.e., a source image and a target image) and acorresponding editing instruction which correctlyand comprehensively describes the transformationfrom the source image to the target image. Ourcollected Image Editing Request dataset will bepublicly released along with the scripts to unifyit with the other two datasets.
2.1.1 Collection ProcessTo create a high-quality, diverse dataset, we followa three-step pipeline: image pairs collection, edit-ing instructions annotation, and post-processingby experts (i.e., cleaning and test set annotationslabeling).
Images Pairs Collection We first crawl the edit-ing image pairs (i.e., a source image and a targetimage) from posts on Reddit (Photoshop requestsubreddit)2 and Zhopped3. Posts generally startwith an original image and an editing specifica-tion. Other users would send their modified im-ages by replying to the posts. We collect originalimages and modified images as source images andtarget images, respectively.
Editing Instruction Annotation The texts inthe original Reddit and Zhopped posts are toonoisy to be used as image editing instructions. Toaddress this problem, we collect the image edit-ing instructions on MTurk using an interactive in-terface that allows the MTurk annotators to eitherwrite an image editing instruction correspondingto a displayed image pair, or flag it as invalid (e.g.,if the two images have nothing in common).
2https://www.reddit.com/r/photoshoprequest3http://zhopped.com
Convert
Add a sword and a cloak to the squirrel. The blue truck is no longer there.
A car is approaching the parking lot from the right.
Ours (Image Editing Request) Spot-the-Diff
NLVR2 Captioning NLVR2 Classification
Each image shows a row of dressed dogs posing with a cat that is also wearing some garment.
In at least one of the images, six dogs are posing for a picture, while on a bench.
Each image shows a row of dressed dogs posing with a cat that is also wearing some garment.
True
False
Figure 2: Examples from three datasets: our Image Editing Request, Spot-the-Diff, and NLVR2. Each example in-volves two natural images and an associated sentence describing their relationship. The task of generating NLVR2captions is converted from its original classification task.
B-1 B-2 B-3 B-4 Rouge-LOurs 52 34 21 13 45
Spot-the-Diff 41 25 15 8 31MS COCO 38 22 15 8 34
Table 1: Human agreement on our datasets, comparedwith Spot-the-Diff and MS COCO (captions=3). B-1to B-4 are BLEU-1 to BLEU-4. Our dataset has thehighest human agreement.
Post-Processing by Experts Mturk annotatorsare not always experts in image editing. To ensurethe quality of the dataset, we hire an image edit-ing expert to label each image editing instructionof the dataset as one of the following four options:1. correct instruction, 2. incomplete instruction, 3.implicit request, 4. other type of errors. Only thedata instances labeled with “correct instruction”are selected to compose our dataset, and are usedin training or evaluating our neural speaker model.
Moreover, two additional experts are required towrite two more editing instructions (one instruc-tion per expert) for each image pair in the valida-tion and test sets. This process enables the datasetto be a multi-reference one, which allows vari-ous automatic evaluation metrics, such as BLEU,CIDEr, and ROUGE to more accurately evaluatethe quality of generated sentences.
2.1.2 Dataset Statistics
The Image Editing Request dataset that we havecollected and annotated currently contains 3,939image pairs (3061 in training, 383 in validation,495 in test) with 5,695 human-annotated instruc-tions in total. Each image pair in the training sethas one instruction, and each image pair in the val-idation and test sets has three instructions, writtenby three different annotators. Instructions have anaverage length of 7.5 words (standard deviation:4.8). After removing the words with less thanthree occurrences, the dataset has a vocabulary of786 words. The human agreement of our datasetis shown in Table 1. The word frequencies in ourdataset are visualized in Fig. 3. Most of the imagesin our dataset are realistic. Since the task is im-age editing, target images may have some artifacts(see Image Editing Request examples in Fig. 2 andFig. 5).
2.2 Existing Public Datasets
To show the generalization of our speaker model,we also train and evaluate our model on two pub-lic datasets, Spot-the-Diff (Jhamtani and Berg-Kirkpatrick, 2018) and NLVR2 (Suhr et al., 2019).Instances in these two datasets are each composedof two natural images and a human written sen-tence describing the relationship between the two
3/4/2019 wordcloud.svg
file:///Users/hatan/Desktop/wordcloud.svg 1/1
imageremovebackground
add
change
make increase
picture
white
crop color
blackcontrast
photo
entire
brightness
face
whole
blue
text
replaceman
brighten
redbehind
put
right
zoom
headpeople
colors
filterlighten
darken
hair girl
yellow
around
take
rotate
left
light
greenbrighter
woman
effect
insert
turn
eyes
top
sky
side
area
decrease
hand
blur
dog
darker
frame
man'sbottom
car
skin
baby
pink
logo
sharpen
sign
person
girl's
new
hat
delete
frontcorner
glare
word
one
glasses
lighter
needs
twolittle
place
water
boy
give
back
onto
resize
cut
smalleraway
reflection
everything
purple
move slightly
distort
girls
shirt
edit
look
guy
orange
bigger
brow
n
mans
significantly
lines
tone
cat
photoshop
please
wall
beach
solid
line
saturation
border
words
dogs
just
eye
adjust
select
shadow
different
enlarge
clouds
dark
grey
bodyflip
finger
nose
tree
like
lighting
degrees
middle
gray
circle
leash
part
colorful
tattoo
except
closer
faces
boy'sless fire
can
bit
bottle
chair
glass
size
clockwisecartoon
center
larger
child
lot
sharpness removed
blurry
tint
sunuse
exposure
flowers
vibrant
instead
outline
show
sunglasses
letters
reduce
grass
santa
clear
moon
boys
fill
arm
hue
woman's
forest
shrink
camera bright
ground
trump
first
erase
star
flag
stains
apply
tiger
flame
spotscats
guys
boat saturated
sharper
tshirt
clearer
womans
donald
couple
window
zombie
figure
smoke
table
hands
scene
signsheart
rest
old
building
saturate
persons
shadows
sunset
centre
mouth
wires
space
heads
tool
lady
half
motorcycle
photograph
lettering
portrait
number
object
smooth
mirror
wheels
turned
paint
floor
areas
lower
focus
much
want
neck
cap
gunbag
fix
get
see
colorize
subjects
mountain
enhance
writing
effects
lights
collar
clone
fence
fish
rims
glow
bike
big
billboard
characterballoons
graffiti
gradient
mustache
borders
cropped
clothes
inside
flower
coming
edges
treesraise
match
torch
suit
next
beer
door
day
kid
rid
imperfections
straighten
landscape
position
wrinkles
subject
visible
stripes
counter
switch
cracks
beard
ocean
parts
youre
going
house
cover
arms
city
ball
says
bar
say
yellowness
highlightintensityscratches
cigarette
standing
straight
children
clarity
overall
guitars
colored
corners
rainbow
bubbles
correct
graphic
cooler
design
second
extend
layer
looks dress
bread leave
paste
sepia
plain
horse
three
write
paper
tiki
sand
gold
four
coin
font
dots
box
horizontal
headphones
completely
christmas
blemishes
lightning
coloring
together
pictures
isolate
objects
creases
monster
vehicle
looking
portion
sitting
cloudy
batman
pencil
invert
cowboy
pupils
wooden
meadow
sketch
covers
leaves
clean
shark
sides
style
shape
birds
babys
happy
boots
bring
field
claus
need
legs
view
loaf
mask
lens
tan
paw
tv
desaturate
backround
grayscal
e
eliminate
president
vertical
eyebrows
forehead
vibrance
drawing
holding
shorten
another
clinton
natural
overlay
unknown
images
baby's
helmet
whiter
street
chairs
warmer
better
blonde
across
silver
wings
shade
thing
stray
stain
crown
cross
cat's
night
marks
smallcable
bokeh
blood
gonna
feet
fall
long
lion
frog
draw
tilt
game
snow
lips
pole
card
fold
cars
halo
hes
www.maxim.com
microphone
highlights
minecraft
butterfly
circular
american
necklace
birthday
ponytail
balance
stretch
peoples
quality
blanket
outside
redeye
player
create
crease
throne
appear
square
higher
strong
poster
height
screen
symbol
family
letter
guitar
statue
facing
whiten
longer
third
blend
chest
mario
cream
hairs
anime
saber
boost
touch
sword
ghost
stars
stamp
every
brick
board
shoes
truck
mommy
flare
tears
dog's
beam
neon
timepalm
lake
auto
pool
bald
road
name
fade
date
lamp
men
icefit
tag
reflections
orientation
foreground
cartoonish
television
basketball
vertically
mountains
brightess
explosion
hairstyle
rectangle
original
graphics
sandwich
godzilla
triangle
drawingspainting
dinosaur
removing
separate
entirely
outlined
shoulder
numbers
figures
restore
wearing
america
hillary
focused
gorilla
wording
circledpattern
setting
zombies
animals
redness
texture
showing
section
thicker
emblem
planet
makeup
simply
bronze
smiley
ninety
redeye
ladies
making
flames
expand
upside
levels
photos
trumps
desert
within
outfit
along
jesusextra large
swirl
thatsparty
bluer
drive
plate
tubes
bunch
teeth
makes
piece
empty
grand
trash
cloud
rocks
smile
cigar
beams
stoneclown
theft
crowd
month
bride
bunny
chain
paws
wars
mark
dive
haze
kids
spot
iron
wood
acne
army
bird
cans
will
made
jack
roof
also full
swap
duck
pony puss
Figure 3: Word cloud showing the vocabulary frequen-cies of our Image Editing Request dataset.
images. To the best of our knowledge, these arethe only two public datasets with a reasonableamount of data that are suitable for our task. Wenext briefly introduce these two datasets.
Spot-the-Diff This dataset is designed to helpgenerate a set of instructions that can comprehen-sively describe all visual differences. Thus, thedataset contains images from video-surveillancefootage, in which differences can be easily found.This is because all the differences could be effec-tively captured by subtractions between two im-ages, as shown in Fig. 2. The dataset contains13,192 image pairs, and an average of 1.86 cap-tions are collected for each image pair. The datasetis split into training, validation, and test sets witha ratio of 8:1:1.
NLVR2 The original task of Cornell NaturalLanguage for Visual Reasoning (NLVR2) datasetis visual sentence classification, see Fig. 2 for anexample. Given two related images and a natu-ral language statement as inputs, a learned modelneeds to determine whether the statement cor-rectly describes the visual contents. We convertthis classification task to a generation task by tak-ing only the image pairs with correct descriptions.After conversion, the amount of data is 51,020,which is almost half of the original dataset witha size of 107,296. We also preserve the training,validation, and test split in the original dataset.
3 Relational Speaker Models
In this section, we aim to design a general speakermodel that describes the relationship between twoimages. Due to the different kinds of visual rela-tionships, the meanings of images vary in different
tasks: “before” and “after” in Spot-the-Diff, “left”and “right” in NLVR2, “source” and “target” inour Image Editing Request dataset. We use thenomenclature of “source” and “target” for simpli-fication, but our model is general and not designedfor any specific dataset. Formally, the model gen-erates a sentence {w1, w2, ..., wT } describing therelationship between the source image ISRC andthe target image ITRG. {wt}Tt=1 are the word to-kens with a total length of T . ISRC and ITRG arenatural images in their raw RGB pixels. In therest of this section, we first introduce our basic at-tentive encoder-decoder model, and show how wegradually improve it to fit the task better.
3.1 Basic Model
Our basic model (Fig. 4(a)) is similar to thebaseline model in Jhamtani and Berg-Kirkpatrick(2018), which is adapted from the attentiveencoder-decoder model for single image caption-ing (Xu et al., 2015). We use ResNet-101 (Heet al., 2016) as the feature extractor to encode thesource image ISRC and the target image ITRG. Thefeature maps of size N ×N × 2048 are extracted,where N is the height or width of the feature map.Each feature in the feature map represents a partof the image. Feature maps are then flattened totwo N2 × 2048 feature sequences f SRC and f TRG,which are further concatenated to a single featuresequence f .
f SRC = ResNet (ISRC) (1)
f TRG = ResNet (ITRG) (2)
f =[f SRC1 , . . . , f SRC
N2 , fTRG1 , , . . . , f TRG
N2
](3)
At each decoding step t, the LSTM cell takes theembedding of the previous word wt−1 as an input.The wordwt−1 either comes from the ground truth(in training) or takes the token with maximal prob-ability (in evaluating). The attention module thenattends to the feature sequence f with the hiddenoutput ht as a query. Inside the attention module,it first computes the alignment scores αt,i betweenthe query ht and each fi. Next, the feature se-quence f is aggregated with a weighted average(with a weight of α) to form the image context f .Lastly, the context ft and the hidden vector ht aremerged into an attentive hidden vector ht with a
LSTM LSTM
Alignment
LSTM LSTM
Multi-Heads Att
LSTM
Att Module
LSTM
(a) Basic Model (b) Multi-Head Attention (d) Dynamic Relational Attention
(c) Static Relational Attention
LSTM LSTM
Alignment
Reduction
Att Module
Att Module Att Module
ht
wtwt�1p(wt)
Figure 4: The evolution diagram of our models to describe the visual relationships. One decoding step at t isshown. The linear layers are omitted for clarity. The basic model (a) is an attentive encoder-decoder model, whichis enhanced by the multi-head attention (b) and static relational attention (c). Our best model (d) dynamicallycomputes the relational scores in decoding to avoid losing relationship information.
fully-connected layer:
wt−1 = embedding (wt−1) (4)
ht, ct = LSTM(wt−1, ht−1, ct−1) (5)
αt,i = softmaxi
(h>t WIMGfi
)(6)
ft =∑i
αt,ifi (7)
ht = tanh(W1[ft;ht] + b1) (8)
The probability of generating the k-th word tokenat time step t is softmax over a linear transforma-tion of the attentive hidden ht. The loss Lt is thenegative log likelihood of the ground truth wordtoken w∗t :
pt(wt,k) = softmaxk
(WW ht + bW
)(9)
Lt = − log pt(w∗t ) (10)
3.2 Sequential Multi-Head AttentionOne weakness of the basic model is that the plainattention module simply takes the concatenatedimage feature f as the input, which does not differ-entiate between the two images. We thus considerapplying a multi-head attention module (Vaswani
et al., 2017) to handle this (Fig. 4(b)). Instead ofusing the simultaneous multi-head attention 4 inTransformer (Vaswani et al., 2017), we implementthe multi-head attention in a sequential way. Thisway, when the model is attending to the target im-age, the contextual information retrieved from thesource image is available and can therefore per-form better at differentiation or relationship learn-ing.
In detail, the source attention head first attendsto the flattened source image feature f SRC. Theattention module is built in the same way as inSec. 3.1, except that it now only attends to thesource image:
αSRCt,i = softmaxi(h
>t WSRCf
SRCi ) (11)
f SRCt =
∑i
αSRCt,i f
SRCi (12)
hSRCt = tanh(W2[f
SRCt ;ht] + b2) (13)
The target attention head then takes the out-put of the source attention hSRC
t as a query to re-trieve appropriate information from the target fea-
4We also tried the original multi-head attention but it isempirically weaker than our sequential multi-head attention.
ture f TRG:
αTRGt,j = softmaxj(h
SRC>t WTRGf
TRGj ) (14)
f TRGt =
∑j
αTRGt,j f
TRGj (15)
hTRGt = tanh(W3[f
TRGt ; hSRC
t ] + b3) (16)
In place of ht, the output of the target head hTRGt is
used to predict the next word.5
3.3 Static Relational AttentionAlthough the sequential multi-head attentionmodel can learn to differentiate the two images,visual relationships are not explicitly examined.We thus allow the model to statically (i.e., not indecoding) compute the relational score betweensource and target feature sequences and reducethe scores into two relationship-aware feature se-quences. We apply a bi-directional relational at-tention (Fig. 4(c)) for this purpose: one from thesource to the target, and one from the target to thesource. For each feature in the source feature se-quence, the source-to-target attention computes itsalignment with the features in the target feature se-quences. The source feature, the attended targetfeature, and the difference between them are thenmerged together with a fully-connected layer:
αS→Ti,j = softmaxj((WSf
SRCi )>(WTf
TRGj )) (17)
f S→Ti =
∑j
αS→Ti,j f TRG
j (18)
f Si = tanh(W4[f
SRCi ; f S→T
i ] + b4) (19)
We decompose the attention weight into two smallmatrices WS and WT so as to reduce the numberof parameters, because the dimension of the im-age feature is usually large. The target-to-sourcecross-attention is built in an opposite way: it takeseach target feature f TRG
j as a query, attends to thesource feature sequence, and get the attentive fea-ture f T
j . We then use these two bidirectional at-tentive sequences f S
i and f Tj in the multi-head at-
tention module (shown in previous subsection) ateach decoding step.
3.4 Dynamic Relational AttentionThe static relational attention module compressespairwise relationships (of size N4) into two
5We tried to exchange the order of two heads or have twoorders concurrently. We didn’t see any significant differencein results between them.
relationship-aware feature sequences (of size 2×N2). The compression saves computational re-sources but has potential drawback in informationloss as discussed in Bahdanau et al. (2015) and Xuet al. (2015). In order to avoid losing information,we modify the static relational attention moduleto its dynamic version, which calculates the rela-tional scores while decoding (Fig. 4(d)).
At each decoding step t, the dynamic relationalattention calculates the alignment score at,i,j be-tween three vectors: a source feature f SRC
i , a tar-get feature f TRG
j , and the hidden state ht. Sincethe dot-product used in previous attention modulesdoes not have a direct extension for three vectors,we extend the dot product and use it to computethe three-vector alignment score.
dot(x, y) =∑d
xd yd = x>y (20)
dot∗(x, y, z) =∑d
xd ydzd = (x� y)>z (21)
at,i,j = dot∗(WSKfSRCi ,WTKf
TRGj ,WHKht) (22)
= (WSKfSRCi �WTKf
TRGj )>WHKht (23)
where � is the element-wise multiplication.The alignment scores (of size N4) are normal-
ized by softmax. And the attention information isfused to the attentive hidden vector fD
t as previous.
αt,i,j = softmaxi,j (at,i,j) (24)
f SRC-Dt =
∑i,j
αt,i,jfSRCi (25)
f TRG-Dt =
∑i,j
αt,i,jfTRGj (26)
fDt = tanh(W5[f
SRC-Dt ; f TRG-D
t ;ht]+b5) (27)
= tanh(W5SfSRC-Dt +W5Tf
TRG-Dt +
W5Hht + b5) (28)
whereW5S,W5T,W5H are sub-matrices ofW5 andW5 = [W5S,W5T,W5H].
According to Eqn. 23 and Eqn. 28, we find ananalog in conventional attention layers with fol-lowing specifications:
• Query: ht
• Key: WSKfSRCi �WTKf
TRGj
• Value: W5SfSRCi +W5Tf
TRGj
The key WSKfSRCi � WTKf
TRGj and the value
W5SfSRCi +W5Tf
TRGj can be considered as repre-
sentations of the visual relationships between f SRCi
Method BLEU-4 CIDEr METEOR ROUGE-LOur Dataset (Image Editing Request)
basic model 5.04 21.58 11.58 34.66+multi-head att 6.13 22.82 11.76 35.13+static rel-att 5.76 20.70 12.59 35.46
-static +dynamic rel-att 6.72 26.36 12.80 37.25Spot-the-Diff
CAPT(Jhamtani and Berg-Kirkpatrick, 2018) 7.30 26.30 10.50 25.60DDLA(Jhamtani and Berg-Kirkpatrick, 2018) 8.50 32.80 12.00 28.60
basic model 5.68 22.20 10.98 24.21+multi-head att 7.52 31.39 11.64 26.96+static rel-att 8.31 33.98 12.95 28.26
-static +dynamic rel-att 8.09 35.25 12.20 31.38NLVR2
basic model 5.04 43.39 10.82 22.19+multi-head att 5.11 44.80 10.72 22.60+static rel-att 4.95 45.67 10.89 22.69
-static +dynamic rel-att 5.00 46.41 10.37 22.94
Table 2: Automatic metric of test results on three datasets. Best results of the main metric are marked in bold font.Our full model is the best on all three datasets with the main metric.
and f TRGj . It is a direct attention to the visual re-
lationship between the source and target images,hence is suitable for the task of generating rela-tionship descriptions.
4 Results
To evaluate the performance of our relationalspeaker models (Sec. 3), we trained them on allthree datasets (Sec. 2). We evaluate our modelsbased on both automatic metrics as well as pair-wise human evaluation. We also show our gener-ated examples for each dataset.
4.1 Experimental Setup
We use the same hyperparameters when applyingour model to the three datasets. Dimensions ofhidden vectors are 512. The model is optimizedby Adam with a learning rate of 1e − 4. Weadd dropout layers of rate 0.5 everywhere to avoidover-fitting. When generating instructions forevaluation, we use maximum-decoding: the wordwt generated at time step t is argmaxk p(wt,k).For the Spot-the-Diff dataset, we take the “Singlesentence decoding” experiment as in Jhamtani andBerg-Kirkpatrick (2018). We also try to mix thethree datasets but we do not see any improvement.We also try different ways to mix the three datasetsbut we do not see improvement. We first train a
unified model on the union of these datasets. Themetrics drop a lot because the tasks and languagedomains (e.g., the word dictionary and lengths ofsentences) are different from each other. We nextonly share the visual components to overcome thedisagreement in language. However, the imagedomain are still quite different from each other (asshown in Fig. 2). Thus, we finally separately trainthree models on the three datasets with minimalcross-dataset modifications.
4.2 Metric-Based Evaluation
As shown in Table 2, we compare the performanceof our models on all three datasets with variousautomated metrics. Results on the test sets arereported. Following the setup in Jhamtani andBerg-Kirkpatrick (2018), we takes CIDEr (Vedan-tam et al., 2015) as the main metric in evaluatingthe Spot-the-Diff and NLVR2 datasets. However,CIDEr is known as its problem in up-weightingunimportant details (Kilickaya et al., 2017; Liuet al., 2017b). In our dataset, we find that instruc-tions generated from a small set of short phrasescould get a high CIDEr score. We thus change themain metric of our dataset to METEOR (Banerjeeand Lavie, 2005), which is manually verified to bealigned with human judgment on the validation setin our dataset. To avoid over-fitting, the model is
Basic Full Both Good Both NotOurs(IEdit) 11 24 5 60
Spot-the-Diff 22 37 6 35NLVR2 24 37 17 22
Table 3: Human evaluation on 100 examples. Imagepair and two captions generated by our basic model andfull model are shown to the user. The user choosesone from ‘Basic’ model wins, ‘Full’ model wins, ‘BothGood’, or ‘Both Not’. Better model marked in boldfont.
early-stopped based on the main metric on vali-dation set. We also report the BLEU-4 (Papineniet al., 2002) and ROUGE-L (Lin, 2004) scores.
The results on various datasets shows the grad-ual improvement made by our novel neural com-ponents, which are designed to better describe therelationship between 2 images. Our full model hasa significant improvement in result over baseline.The improvement on the NLVR2 dataset is lim-ited because the comparison of two images wasnot forced to be considered when generating in-structions.
4.3 Human Evaluation and QualitativeAnalysis
We conduct a pairwise human evaluation on ourgenerated sentences, which is used in Celikyil-maz et al. (2018) and Pasunuru and Bansal (2017).Agarwala (2018) also shows that the pairwisecomparison is better than scoring sentences indi-vidually. We randomly select 100 examples fromthe test set in each dataset and generate captionsvia our full speaker model. We ask users to choosea better instruction between the captions generatedby our full model and the basic model, or alter-natively indicate that the two captions are equalin quality. The Image Editing Request dataset isspecifically annotated by the image editing expert.The winning rate of our full model (dynamic re-lation attention) versus the basic model is shownin Table 3. Our full model outperforms the ba-sic model significantly. We also show positive andnegative examples generated by our full model inFig. 5. In our Image Editing Request corpus, themodel was able to detect and describe the edit-ing actions but it failed in handling the arbitrarycomplex editing actions. We keep these hard ex-amples in our dataset to match real-world require-ments and allow follow-up future works to pursuethe remaining challenges in this task. Our modelis designed for non-localized relationships thus
we do not explicitly model the pixel-level differ-ences; however, we still find that the model couldlearn these differences in the Spot-the-Diff dataset.Since the descriptions in Spot-the-Diff is relativelysimple, the errors mostly come from wrong enti-ties or undetected differences as shown in Fig. 5.Our model is also sensitive to the image contentsas shown in the NLVR2 dataset.
5 Related Work
In order to learn a robust captioning system, pub-lic datasets have been released for diverse tasksincluding single image captioning (Lin et al.,2014; Plummer et al., 2015; Krishna et al., 2017),video captioning (Xu et al., 2016), referring ex-pressions (Kazemzadeh et al., 2014; Mao et al.,2016), and visual question answering (Antol et al.,2015; Zhu et al., 2016; Johnson et al., 2017). Interms of model progress, recent years witnessedstrong research progress in generating natural lan-guage sentences to describe visual contents, suchas Vinyals et al. (2015); Xu et al. (2015); Ran-zato et al. (2016); Anderson et al. (2018) in singleimage captioning, Venugopalan et al. (2015); Panet al. (2016); Pasunuru and Bansal (2017) in videocaptioning, Mao et al. (2016); Liu et al. (2017a);Yu et al. (2017); Luo and Shakhnarovich (2017) inreferring expressions, Jain et al. (2017); Li et al.(2018); Misra et al. (2018) in visual question gen-eration, and Andreas and Klein (2016); Cohn-Gordon et al. (2018); Luo et al. (2018); Vedantamet al. (2017) in other setups.
Single image captioning is the most relevantproblem to the two-images captioning. Vinyalset al. (2015) created a powerful encoder-decoder(i.e., CNN to LSTM) framework in solving thecaptioning problem. Xu et al. (2015) furtherequipped it with an attention module to handlethe memorylessness of fixed-size vectors. Ran-zato et al. (2016) used reinforcement learning toeliminate exposure bias. Recently, Anderson et al.(2018) brought the information from object detec-tion system to further boost the performance.
Our model is built based on the attentiveencoder-decoder model (Xu et al., 2015), which isthe same choice in Jhamtani and Berg-Kirkpatrick(2018). We apply the RL training with self-critical (Rennie et al., 2017) but do not see signif-icant improvement, possibly because of the rela-tively small data amount compared to MS COCO.We also observe that the detection system in An-
add a filter to the image
change the background to blue
Positive Examples
Negative Examples
Image Editing Request Spot-the-Diff NLVR2
there is a bookshelf with a white shelf in one of the images .
the left image shows a pair of shoes wearing a pair of shoes .
the person in the white shirt is gone
the black car in the middle row is gone
Figure 5: Examples of positive and negative results of our model from the three datasets. Selfies are blurred.
derson et al. (2018) has a high probability to failin the three datasets, e.g., the detection system cannot detect the small cars and people in spot-the-diff dataset. The DDLA (Difference Descriptionwith Latent Alignment) method proposed in Jham-tani and Berg-Kirkpatrick (2018) learns the align-ment between descriptions and visual differences.It relies on the nature of the particular dataset andthus could not be easily transferred to other datasetwhere the visual relationship is not obvious. Thetwo-images captioning could also be consideredas a two key-frames video captioning problem,and our sequential multi-heads attention is a modi-fied version of the seq-to-seq model (Venugopalanet al., 2015). Some existing work (Chen et al.,2018; Wang et al., 2018; Manjunatha et al., 2018)also learns how to modify images. These datasetsand methods focus on the image colorization andadjustment tasks, while our dataset aims to studythe general image editing request task.
6 Conclusion
In this paper, we explored the task of describ-ing the visual relationship between two images.We collected the Image Editing Request dataset,which contains image pairs and human annotatedediting instructions. We designed novel relationalspeaker models and evaluate them on our col-lected and other public existing dataset. Based onautomatic and human evaluations, our relationalspeaker model improves the ability to capture vi-sual relationships. For future work, we are goingto further explore the possibility to merge the threedatasets by either learning a joint image represen-tation or by transferring domain-specific knowl-edge. We are also aiming to enlarge our ImageEditing Request dataset with newly-released postson Reddit and Zhopped.
Acknowledgments
We thank the reviewers for their helpful commentsand Nham Le for helping with the initial datacollection. This work was supported by Adobe,ARO-YIP Award #W911NF-18-1-0336, and fac-ulty awards from Google, Facebook, and Sales-force. The views, opinions, and/or findings con-tained in this article are those of the authors andshould not be interpreted as representing the offi-cial views or policies, either expressed or implied,of the funding agency.
ReferencesAseem Agarwala. 2018. Automatic photography with
google clips.
Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attention forimage captioning and visual question answering. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 6077–6086.
Jacob Andreas and Dan Klein. 2016. Reasoning aboutpragmatics with neural listeners and speakers. InProceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing, pages1173–1182.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C Lawrence Zitnick,and Devi Parikh. 2015. Vqa: Visual question an-swering. In Proceedings of the IEEE internationalconference on computer vision, pages 2425–2433.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In International Con-ference on Learning Representations.
Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improvedcorrelation with human judgments. In Proceedings
of the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation and/or sum-marization, pages 65–72.
Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, andYejin Choi. 2018. Deep communicating agents forabstractive summarization. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Pa-pers), pages 1662–1675.
Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu,and Xiaodong Liu. 2018. Language-based imageediting with recurrent attentive models. In Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 8721–8729.
Reuben Cohn-Gordon, Noah Goodman, and Christo-pher Potts. 2018. Pragmatically informative imagecaptioning with character-level inference. In Pro-ceedings of the 2018 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 2 (Short Papers), volume 2, pages 439–443.
Ali Farhadi, Mohsen Hejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, JuliaHockenmaier, and David Forsyth. 2010. Every pic-ture tells a story: Generating sentences from images.In European conference on computer vision, pages15–29. Springer.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.
Unnat Jain, Ziyu Zhang, and Alexander G Schwing.2017. Creativity: Generating diverse questions us-ing variational autoencoders. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 6485–6494.
Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2018.Learning to describe differences between pairs ofsimilar images. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 4024–4034.
Justin Johnson, Bharath Hariharan, Laurens van derMaaten, Li Fei-Fei, C Lawrence Zitnick, and RossGirshick. 2017. Clevr: A diagnostic dataset forcompositional language and elementary visual rea-soning. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages2901–2910.
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten,and Tamara Berg. 2014. Referitgame: Referring toobjects in photographs of natural scenes. In Pro-ceedings of the 2014 conference on empirical meth-ods in natural language processing (EMNLP), pages787–798.
Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis,and Erkut Erdem. 2017. Re-evaluating automaticmetrics for image captioning. In Proceedings of the15th Conference of the European Chapter of the As-sociation for Computational Linguistics: Volume 1,Long Papers, volume 1, pages 199–209.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma,et al. 2017. Visual genome: Connecting languageand vision using crowdsourced dense image anno-tations. International Journal of Computer Vision,123(1):32–73.
Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Sim-ing Li, Yejin Choi, Alexander C Berg, and Tamara LBerg. 2011. Baby talk: Understanding and generat-ing image descriptions. In Proceedings of the 24thCVPR. Citeseer.
Yikang Li, Nan Duan, Bolei Zhou, Xiao Chu, WanliOuyang, Xiaogang Wang, and Ming Zhou. 2018.Visual question generation as dual task of visualquestion answering. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition, pages 6116–6124.
Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. Text SummarizationBranches Out.
Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European confer-ence on computer vision, pages 740–755. Springer.
Jingyu Liu, Liang Wang, and Ming-Hsuan Yang.2017a. Referring expression generation and com-prehension via attributes. In Proceedings of theIEEE International Conference on Computer Vision,pages 4856–4864.
Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama,and Kevin Murphy. 2017b. Improved image cap-tioning via policy gradient optimization of spider.In Proceedings of the IEEE international conferenceon computer vision, pages 873–881.
Ruotian Luo, Brian Price, Scott Cohen, and GregoryShakhnarovich. 2018. Discriminability objective fortraining descriptive captions. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 6964–6974.
Ruotian Luo and Gregory Shakhnarovich. 2017.Comprehension-guided referring expressions. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 7102–7111.
Varun Manjunatha, Mohit Iyyer, Jordan Boyd-Graber,and Larry Davis. 2018. Learning to color from lan-guage. In Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers), pages 764–769.
Junhua Mao, Jonathan Huang, Alexander Toshev, OanaCamburu, Alan L Yuille, and Kevin Murphy. 2016.Generation and comprehension of unambiguous ob-ject descriptions. In Proceedings of the IEEE con-ference on computer vision and pattern recognition,pages 11–20.
Ishan Misra, Ross Girshick, Rob Fergus, MartialHebert, Abhinav Gupta, and Laurens van derMaaten. 2018. Learning by asking questions. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 11–20.
Yasuhide Mori, Hironobu Takahashi, and Ryuichi Oka.1999. Image-to-word transformation based on di-viding and vector quantizing images with words. InFirst International Workshop on Multimedia Intelli-gent Storage and Retrieval Management, pages 1–9.Citeseer.
Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yuet-ing Zhuang. 2016. Hierarchical recurrent neural en-coder for video representation with application tocaptioning. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages1029–1038.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.
Ramakanth Pasunuru and Mohit Bansal. 2017. Re-inforced video captioning with entailment rewards.In Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing, pages979–985.
Bryan A Plummer, Liwei Wang, Chris M Cervantes,Juan C Caicedo, Julia Hockenmaier, and SvetlanaLazebnik. 2015. Flickr30k entities: Collectingregion-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEEinternational conference on computer vision, pages2641–2649.
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2016. Sequence level train-ing with recurrent neural networks. In InternationalConference on Learning Representations.
Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jerret Ross, and Vaibhava Goel. 2017. Self-criticalsequence training for image captioning. In Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 7008–7024.
Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai,and Yoav Artzi. 2019. A corpus for reasoning aboutnatural language grounded in photographs. In Pro-ceedings of the 57th annual meeting on associationfor computational linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.
Ramakrishna Vedantam, Samy Bengio, Kevin Murphy,Devi Parikh, and Gal Chechik. 2017. Context-awarecaptions from context-agnostic supervision. In Pro-ceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition, pages 251–260.
Ramakrishna Vedantam, C Lawrence Zitnick, and DeviParikh. 2015. Cider: Consensus-based image de-scription evaluation. In Proceedings of the IEEEconference on computer vision and pattern recog-nition, pages 4566–4575.
Subhashini Venugopalan, Marcus Rohrbach, JeffreyDonahue, Raymond Mooney, Trevor Darrell, andKate Saenko. 2015. Sequence to sequence-video totext. In Proceedings of the IEEE international con-ference on computer vision, pages 4534–4542.
Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Proceedings of the IEEEconference on computer vision and pattern recogni-tion, pages 3156–3164.
Hai Wang, Jason D Williams, and SingBing Kang.2018. Learning to globally edit images with textualdescription. arXiv preprint arXiv:1810.05786.
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridgingvideo and language. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition, pages 5288–5296.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual atten-tion. In International conference on machine learn-ing, pages 2048–2057.
Licheng Yu, Hao Tan, Mohit Bansal, and Tamara LBerg. 2017. A joint speaker-listener-reinforcermodel for referring expressions. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition, pages 7282–7290.
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answeringin images. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages4995–5004.