expressing visual relationships via language - arxiv.org · expressing visual relationships via...

11
Expressing Visual Relationships via Language Hao Tan 1,2 , Franck Dernoncourt 2 , Zhe Lin 2 , Trung Bui 2 , Mohit Bansal 1 1 UNC Chapel Hill 2 Adobe Research {haotan, mbansal}@cs.unc.edu, {dernonco, zlin, bui}@adobe.com Abstract Describing images with text is a fundamen- tal problem in vision-language research. Cur- rent studies in this domain mostly focus on single image captioning. However, in vari- ous real applications (e.g., image editing, dif- ference interpretation, and retrieval), generat- ing relational captions for two images, can also be very useful. This important problem has not been explored mostly due to lack of datasets and effective models. To push for- ward the research in this direction, we first introduce a new language-guided image edit- ing dataset that contains a large number of real image pairs with corresponding editing in- structions. We then propose a new relational speaker model based on an encoder-decoder architecture with static relational attention and sequential multi-head attention. We also ex- tend the model with dynamic relational atten- tion, which calculates visual alignment while decoding. Our models are evaluated on our newly collected and two public datasets con- sisting of image pairs annotated with relation- ship sentences. Experimental results, based on both automatic and human evaluation, demon- strate that our model outperforms all baselines and existing methods on all the datasets. 1 1 Introduction Generating captions to describe natural images is a fundamental research problem at the intersection of computer vision and natural language process- ing. Single image captioning (Mori et al., 1999; Farhadi et al., 2010; Kulkarni et al., 2011) has many practical applications such as text-based im- age search, photo curation, assisting of visually- impaired people, image understanding in social 1 Our data and code are publicly available at: https://github.com/airsplay/ VisualRelationships Remove the people from the picture. Relational Speaker Figure 1: An example result of our method showing the input image pair from our Image Editing Request dataset, and the output instruction predicted by our re- lational speaker model trained on the dataset. media, etc. This task has drawn significant at- tention in the research community with numerous studies (Vinyals et al., 2015; Xu et al., 2015; An- derson et al., 2018), and recent state of the art methods have achieved promising results on large captioning datasets, such as MS COCO (Lin et al., 2014). Besides single image captioning, the com- munity has also explored other visual captioning problems such as video captioning (Venugopalan et al., 2015; Xu et al., 2016), and referring expres- sions (Kazemzadeh et al., 2014; Yu et al., 2017). However, the problem of two-image captioning, especially the task of describing the relationships and differences between two images, is still under- explored. In this paper, we focus on advanc- ing research in this challenging problem by intro- ducing a new dataset and proposing novel neural relational-speaker models. To the best of our knowledge, Jhamtani and Berg-Kirkpatrick (2018) is the only public dataset aimed at generating natural language descriptions for two real images. This dataset is about ‘spotting the difference’, and hence focuses more on de- scribing exhaustive differences by learning align- ments between multiple text descriptions and mul- arXiv:1906.07689v2 [cs.CL] 19 Jun 2019

Upload: others

Post on 27-Oct-2019

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Expressing Visual Relationships via Language - arxiv.org · Expressing Visual Relationships via Language Hao Tan1; 2, Franck Dernoncourt , Zhe Lin , Trung Bui2, Mohit Bansal1 1UNC

Expressing Visual Relationships via Language

Hao Tan1,2, Franck Dernoncourt2, Zhe Lin2, Trung Bui2, Mohit Bansal11UNC Chapel Hill 2Adobe Research

{haotan, mbansal}@cs.unc.edu, {dernonco, zlin, bui}@adobe.com

Abstract

Describing images with text is a fundamen-tal problem in vision-language research. Cur-rent studies in this domain mostly focus onsingle image captioning. However, in vari-ous real applications (e.g., image editing, dif-ference interpretation, and retrieval), generat-ing relational captions for two images, canalso be very useful. This important problemhas not been explored mostly due to lack ofdatasets and effective models. To push for-ward the research in this direction, we firstintroduce a new language-guided image edit-ing dataset that contains a large number ofreal image pairs with corresponding editing in-structions. We then propose a new relationalspeaker model based on an encoder-decoderarchitecture with static relational attention andsequential multi-head attention. We also ex-tend the model with dynamic relational atten-tion, which calculates visual alignment whiledecoding. Our models are evaluated on ournewly collected and two public datasets con-sisting of image pairs annotated with relation-ship sentences. Experimental results, based onboth automatic and human evaluation, demon-strate that our model outperforms all baselinesand existing methods on all the datasets.1

1 Introduction

Generating captions to describe natural images isa fundamental research problem at the intersectionof computer vision and natural language process-ing. Single image captioning (Mori et al., 1999;Farhadi et al., 2010; Kulkarni et al., 2011) hasmany practical applications such as text-based im-age search, photo curation, assisting of visually-impaired people, image understanding in social

1Our data and code are publicly available at:https://github.com/airsplay/VisualRelationships

Remove the people from the picture.

Relational Speaker

Figure 1: An example result of our method showingthe input image pair from our Image Editing Requestdataset, and the output instruction predicted by our re-lational speaker model trained on the dataset.

media, etc. This task has drawn significant at-tention in the research community with numerousstudies (Vinyals et al., 2015; Xu et al., 2015; An-derson et al., 2018), and recent state of the artmethods have achieved promising results on largecaptioning datasets, such as MS COCO (Lin et al.,2014). Besides single image captioning, the com-munity has also explored other visual captioningproblems such as video captioning (Venugopalanet al., 2015; Xu et al., 2016), and referring expres-sions (Kazemzadeh et al., 2014; Yu et al., 2017).However, the problem of two-image captioning,especially the task of describing the relationshipsand differences between two images, is still under-explored. In this paper, we focus on advanc-ing research in this challenging problem by intro-ducing a new dataset and proposing novel neuralrelational-speaker models.

To the best of our knowledge, Jhamtani andBerg-Kirkpatrick (2018) is the only public datasetaimed at generating natural language descriptionsfor two real images. This dataset is about ‘spottingthe difference’, and hence focuses more on de-scribing exhaustive differences by learning align-ments between multiple text descriptions and mul-

arX

iv:1

906.

0768

9v2

[cs

.CL

] 1

9 Ju

n 20

19

Page 2: Expressing Visual Relationships via Language - arxiv.org · Expressing Visual Relationships via Language Hao Tan1; 2, Franck Dernoncourt , Zhe Lin , Trung Bui2, Mohit Bansal1 1UNC

tiple image regions; hence the differences are in-tended to be explicitly identifiable by subtractingtwo images. There are many other tasks that re-quire more diverse, detailed and implicit relation-ships between two images. Interpreting imageediting effects with instructions is a suitable taskfor this purpose, because it has requirements ofexploiting visual transformations and it is widelyused in real life, such as explanation of compleximage editing effects for laypersons or visually-impaired users, image edit or tutorial retrieval, andlanguage-guided image editing systems. We firstbuild a new language-guided image editing datasetwith high quality annotations by (1) crawling im-age pairs from real image editing request web-sites, (2) annotating editing instructions via Ama-zon Mechanical Turk, and (3) refining the annota-tions through experts.

Next, we propose a new neural speaker modelfor generating sentences that describe the vi-sual relationship between a pair of images. Ourmodel is general and not dependent on any spe-cific dataset. Starting from an attentive encoder-decoder baseline, we first develop a model en-hanced with two attention-based neural compo-nents, a static relational attention and a sequentialmulti-head attention, to address these two chal-lenges, respectively. We further extend it by de-signing a dynamic relational attention module tocombine the advantages of these two components,which finds the relationship between two imageswhile decoding. The computation of dynamic re-lational attention is mathematically equivalent toattention over all visual “relationships”. Thus, ourmethod provides a direct way to model visual re-lationships in language.

To show the effectiveness of our models, weevaluate them on three datasets: our new dataset,the ”Spot-the-Diff” dataset (Jhamtani and Berg-Kirkpatrick, 2018), and the two-image visual rea-soning NLVR2 dataset (Suhr et al., 2019) (adaptedfor our task). We train models separately on eachdataset with the same hyper-parameters and eval-uate them on the same test set across all methods.Experimental results demonstrate that our modeloutperforms all the baselines and existing meth-ods. The main contributions of our paper are: (1)We create a novel human language guided imageediting dataset to boost the study in describing vi-sual relationships; (2) We design novel relational-speaker models, including a dynamic relational

attention module, to handle the problem of two-image captioning by focusing on all their visualrelationships; (3) Our method is evaluated on sev-eral datasets and achieves the state-of-the-art.

2 Datasets

We present the collection process and statisticsof our Image Editing Request dataset and brieflyintroduce two public datasets (viz., Spot-the-Diffand NLVR2). All three datasets are used to studythe task of two-image captioning and evaluatingour relational-speaker models. Examples fromthese three datasets are shown in Fig. 2.

2.1 Image Editing Request Dataset

Each instance in our dataset consists of an imagepair (i.e., a source image and a target image) and acorresponding editing instruction which correctlyand comprehensively describes the transformationfrom the source image to the target image. Ourcollected Image Editing Request dataset will bepublicly released along with the scripts to unifyit with the other two datasets.

2.1.1 Collection ProcessTo create a high-quality, diverse dataset, we followa three-step pipeline: image pairs collection, edit-ing instructions annotation, and post-processingby experts (i.e., cleaning and test set annotationslabeling).

Images Pairs Collection We first crawl the edit-ing image pairs (i.e., a source image and a targetimage) from posts on Reddit (Photoshop requestsubreddit)2 and Zhopped3. Posts generally startwith an original image and an editing specifica-tion. Other users would send their modified im-ages by replying to the posts. We collect originalimages and modified images as source images andtarget images, respectively.

Editing Instruction Annotation The texts inthe original Reddit and Zhopped posts are toonoisy to be used as image editing instructions. Toaddress this problem, we collect the image edit-ing instructions on MTurk using an interactive in-terface that allows the MTurk annotators to eitherwrite an image editing instruction correspondingto a displayed image pair, or flag it as invalid (e.g.,if the two images have nothing in common).

2https://www.reddit.com/r/photoshoprequest3http://zhopped.com

Page 3: Expressing Visual Relationships via Language - arxiv.org · Expressing Visual Relationships via Language Hao Tan1; 2, Franck Dernoncourt , Zhe Lin , Trung Bui2, Mohit Bansal1 1UNC

Convert

Add a sword and a cloak to the squirrel. The blue truck is no longer there.

A car is approaching the parking lot from the right.

Ours (Image Editing Request) Spot-the-Diff

NLVR2 Captioning NLVR2 Classification

Each image shows a row of dressed dogs posing with a cat that is also wearing some garment.

In at least one of the images, six dogs are posing for a picture, while on a bench.

Each image shows a row of dressed dogs posing with a cat that is also wearing some garment.

True

False

Figure 2: Examples from three datasets: our Image Editing Request, Spot-the-Diff, and NLVR2. Each example in-volves two natural images and an associated sentence describing their relationship. The task of generating NLVR2captions is converted from its original classification task.

B-1 B-2 B-3 B-4 Rouge-LOurs 52 34 21 13 45

Spot-the-Diff 41 25 15 8 31MS COCO 38 22 15 8 34

Table 1: Human agreement on our datasets, comparedwith Spot-the-Diff and MS COCO (captions=3). B-1to B-4 are BLEU-1 to BLEU-4. Our dataset has thehighest human agreement.

Post-Processing by Experts Mturk annotatorsare not always experts in image editing. To ensurethe quality of the dataset, we hire an image edit-ing expert to label each image editing instructionof the dataset as one of the following four options:1. correct instruction, 2. incomplete instruction, 3.implicit request, 4. other type of errors. Only thedata instances labeled with “correct instruction”are selected to compose our dataset, and are usedin training or evaluating our neural speaker model.

Moreover, two additional experts are required towrite two more editing instructions (one instruc-tion per expert) for each image pair in the valida-tion and test sets. This process enables the datasetto be a multi-reference one, which allows vari-ous automatic evaluation metrics, such as BLEU,CIDEr, and ROUGE to more accurately evaluatethe quality of generated sentences.

2.1.2 Dataset Statistics

The Image Editing Request dataset that we havecollected and annotated currently contains 3,939image pairs (3061 in training, 383 in validation,495 in test) with 5,695 human-annotated instruc-tions in total. Each image pair in the training sethas one instruction, and each image pair in the val-idation and test sets has three instructions, writtenby three different annotators. Instructions have anaverage length of 7.5 words (standard deviation:4.8). After removing the words with less thanthree occurrences, the dataset has a vocabulary of786 words. The human agreement of our datasetis shown in Table 1. The word frequencies in ourdataset are visualized in Fig. 3. Most of the imagesin our dataset are realistic. Since the task is im-age editing, target images may have some artifacts(see Image Editing Request examples in Fig. 2 andFig. 5).

2.2 Existing Public Datasets

To show the generalization of our speaker model,we also train and evaluate our model on two pub-lic datasets, Spot-the-Diff (Jhamtani and Berg-Kirkpatrick, 2018) and NLVR2 (Suhr et al., 2019).Instances in these two datasets are each composedof two natural images and a human written sen-tence describing the relationship between the two

Page 4: Expressing Visual Relationships via Language - arxiv.org · Expressing Visual Relationships via Language Hao Tan1; 2, Franck Dernoncourt , Zhe Lin , Trung Bui2, Mohit Bansal1 1UNC

3/4/2019 wordcloud.svg

file:///Users/hatan/Desktop/wordcloud.svg 1/1

imageremovebackground

add

change

make increase

picture

white

crop color

blackcontrast

photo

entire

brightness

face

whole

blue

text

replaceman

brighten

redbehind

put

right

zoom

headpeople

colors

filterlighten

darken

hair girl

yellow

around

take

rotate

left

light

greenbrighter

woman

effect

insert

turn

eyes

top

sky

side

area

decrease

hand

blur

dog

darker

frame

man'sbottom

car

skin

baby

pink

logo

sharpen

sign

person

girl's

new

hat

delete

frontcorner

glare

word

one

glasses

lighter

needs

twolittle

place

water

boy

give

back

onto

resize

cut

smalleraway

reflection

everything

purple

move slightly

distort

girls

shirt

edit

look

guy

orange

bigger

brow

n

mans

significantly

lines

tone

cat

photoshop

please

wall

beach

solid

line

saturation

border

words

dogs

just

eye

adjust

select

shadow

different

enlarge

clouds

dark

grey

bodyflip

finger

nose

tree

like

lighting

degrees

middle

gray

circle

leash

part

colorful

tattoo

except

closer

faces

boy'sless fire

can

bit

bottle

chair

glass

size

clockwisecartoon

center

larger

child

lot

sharpness removed

blurry

tint

sunuse

exposure

flowers

vibrant

instead

outline

show

sunglasses

letters

reduce

grass

santa

clear

moon

boys

fill

arm

hue

woman's

forest

shrink

camera bright

ground

trump

first

erase

star

flag

stains

apply

tiger

flame

spotscats

guys

boat saturated

sharper

t­shirt

clearer

womans

donald

couple

window

zombie

figure

smoke

table

hands

scene

signsheart

rest

old

building

saturate

persons

shadows

sunset

centre

mouth

wires

space

heads

tool

lady

half

motorcycle

photograph

lettering

portrait

number

object

smooth

mirror

wheels

turned

paint

floor

areas

lower

focus

much

want

neck

cap

gunbag

fix

get

see

colorize

subjects

mountain

enhance

writing

effects

lights

collar

clone

fence

fish

rims

glow

bike

big

billboard

characterballoons

graffiti

gradient

mustache

borders

cropped

clothes

inside

flower

coming

edges

treesraise

match

torch

suit

next

beer

door

day

kid

rid

imperfections

straighten

landscape

position

wrinkles

subject

visible

stripes

counter

switch

cracks

beard

ocean

parts

youre

going

house

cover

arms

city

ball

says

bar

say

yellowness

highlightintensityscratches

cigarette

standing

straight

children

clarity

overall

guitars

colored

corners

rainbow

bubbles

correct

graphic

cooler

design

second

extend

layer

looks dress

bread leave

paste

sepia

plain

horse

three

write

paper

tiki

sand

gold

four

coin

font

dots

box

horizontal

headphones

completely

christmas

blemishes

lightning

coloring

together

pictures

isolate

objects

creases

monster

vehicle

looking

portion

sitting

cloudy

batman

pencil

invert

cowboy

pupils

wooden

meadow

sketch

covers

leaves

clean

shark

sides

style

shape

birds

babys

happy

boots

bring

field

claus

need

legs

view

loaf

mask

lens

tan

paw

tv

desaturate

backround

grayscal

e

eliminate

president

vertical

eyebrows

forehead

vibrance

drawing

holding

shorten

another

clinton

natural

overlay

unknown

images

baby's

helmet

whiter

street

chairs

warmer

better

blonde

across

silver

wings

shade

thing

stray

stain

crown

cross

cat's

night

marks

smallcable

bokeh

blood

gonna

feet

fall

long

lion

frog

draw

tilt

game

snow

lips

pole

card

fold

cars

halo

hes

www.maxim.com

microphone

highlights

minecraft

butterfly

circular

american

necklace

birthday

ponytail

balance

stretch

peoples

quality

blanket

outside

red­eye

player

create

crease

throne

appear

square

higher

strong

poster

height

screen

symbol

family

letter

guitar

statue

facing

whiten

longer

third

blend

chest

mario

cream

hairs

anime

saber

boost

touch

sword

ghost

stars

stamp

every

brick

board

shoes

truck

mommy

flare

tears

dog's

beam

neon

timepalm

lake

auto

pool

bald

road

name

fade

date

lamp

men

icefit

tag

reflections

orientation

foreground

cartoonish

television

basketball

vertically

mountains

brightess

explosion

hairstyle

rectangle

original

graphics

sandwich

godzilla

triangle

drawingspainting

dinosaur

removing

separate

entirely

outlined

shoulder

numbers

figures

restore

wearing

america

hillary

focused

gorilla

wording

circledpattern

setting

zombies

animals

redness

texture

showing

section

thicker

emblem

planet

makeup

simply

bronze

smiley

ninety

redeye

ladies

making

flames

expand

upside

levels

photos

trumps

desert

within

outfit

along

jesusextra large

swirl

thatsparty

bluer

drive

plate

tubes

bunch

teeth

makes

piece

empty

grand

trash

cloud

rocks

smile

cigar

beams

stoneclown

theft

crowd

month

bride

bunny

chain

paws

wars

mark

dive

haze

kids

spot

iron

wood

acne

army

bird

cans

will

made

jack

roof

also full

swap

duck

pony puss

Figure 3: Word cloud showing the vocabulary frequen-cies of our Image Editing Request dataset.

images. To the best of our knowledge, these arethe only two public datasets with a reasonableamount of data that are suitable for our task. Wenext briefly introduce these two datasets.

Spot-the-Diff This dataset is designed to helpgenerate a set of instructions that can comprehen-sively describe all visual differences. Thus, thedataset contains images from video-surveillancefootage, in which differences can be easily found.This is because all the differences could be effec-tively captured by subtractions between two im-ages, as shown in Fig. 2. The dataset contains13,192 image pairs, and an average of 1.86 cap-tions are collected for each image pair. The datasetis split into training, validation, and test sets witha ratio of 8:1:1.

NLVR2 The original task of Cornell NaturalLanguage for Visual Reasoning (NLVR2) datasetis visual sentence classification, see Fig. 2 for anexample. Given two related images and a natu-ral language statement as inputs, a learned modelneeds to determine whether the statement cor-rectly describes the visual contents. We convertthis classification task to a generation task by tak-ing only the image pairs with correct descriptions.After conversion, the amount of data is 51,020,which is almost half of the original dataset witha size of 107,296. We also preserve the training,validation, and test split in the original dataset.

3 Relational Speaker Models

In this section, we aim to design a general speakermodel that describes the relationship between twoimages. Due to the different kinds of visual rela-tionships, the meanings of images vary in different

tasks: “before” and “after” in Spot-the-Diff, “left”and “right” in NLVR2, “source” and “target” inour Image Editing Request dataset. We use thenomenclature of “source” and “target” for simpli-fication, but our model is general and not designedfor any specific dataset. Formally, the model gen-erates a sentence {w1, w2, ..., wT } describing therelationship between the source image ISRC andthe target image ITRG. {wt}Tt=1 are the word to-kens with a total length of T . ISRC and ITRG arenatural images in their raw RGB pixels. In therest of this section, we first introduce our basic at-tentive encoder-decoder model, and show how wegradually improve it to fit the task better.

3.1 Basic Model

Our basic model (Fig. 4(a)) is similar to thebaseline model in Jhamtani and Berg-Kirkpatrick(2018), which is adapted from the attentiveencoder-decoder model for single image caption-ing (Xu et al., 2015). We use ResNet-101 (Heet al., 2016) as the feature extractor to encode thesource image ISRC and the target image ITRG. Thefeature maps of size N ×N × 2048 are extracted,where N is the height or width of the feature map.Each feature in the feature map represents a partof the image. Feature maps are then flattened totwo N2 × 2048 feature sequences f SRC and f TRG,which are further concatenated to a single featuresequence f .

f SRC = ResNet (ISRC) (1)

f TRG = ResNet (ITRG) (2)

f =[f SRC1 , . . . , f SRC

N2 , fTRG1 , , . . . , f TRG

N2

](3)

At each decoding step t, the LSTM cell takes theembedding of the previous word wt−1 as an input.The wordwt−1 either comes from the ground truth(in training) or takes the token with maximal prob-ability (in evaluating). The attention module thenattends to the feature sequence f with the hiddenoutput ht as a query. Inside the attention module,it first computes the alignment scores αt,i betweenthe query ht and each fi. Next, the feature se-quence f is aggregated with a weighted average(with a weight of α) to form the image context f .Lastly, the context ft and the hidden vector ht aremerged into an attentive hidden vector ht with a

Page 5: Expressing Visual Relationships via Language - arxiv.org · Expressing Visual Relationships via Language Hao Tan1; 2, Franck Dernoncourt , Zhe Lin , Trung Bui2, Mohit Bansal1 1UNC

LSTM LSTM

Alignment

LSTM LSTM

Multi-Heads Att

LSTM

Att Module

LSTM

(a) Basic Model (b) Multi-Head Attention (d) Dynamic Relational Attention

(c) Static Relational Attention

LSTM LSTM

Alignment

Reduction

Att Module

Att Module Att Module

ht

wtwt�1p(wt)

Figure 4: The evolution diagram of our models to describe the visual relationships. One decoding step at t isshown. The linear layers are omitted for clarity. The basic model (a) is an attentive encoder-decoder model, whichis enhanced by the multi-head attention (b) and static relational attention (c). Our best model (d) dynamicallycomputes the relational scores in decoding to avoid losing relationship information.

fully-connected layer:

wt−1 = embedding (wt−1) (4)

ht, ct = LSTM(wt−1, ht−1, ct−1) (5)

αt,i = softmaxi

(h>t WIMGfi

)(6)

ft =∑i

αt,ifi (7)

ht = tanh(W1[ft;ht] + b1) (8)

The probability of generating the k-th word tokenat time step t is softmax over a linear transforma-tion of the attentive hidden ht. The loss Lt is thenegative log likelihood of the ground truth wordtoken w∗t :

pt(wt,k) = softmaxk

(WW ht + bW

)(9)

Lt = − log pt(w∗t ) (10)

3.2 Sequential Multi-Head AttentionOne weakness of the basic model is that the plainattention module simply takes the concatenatedimage feature f as the input, which does not differ-entiate between the two images. We thus considerapplying a multi-head attention module (Vaswani

et al., 2017) to handle this (Fig. 4(b)). Instead ofusing the simultaneous multi-head attention 4 inTransformer (Vaswani et al., 2017), we implementthe multi-head attention in a sequential way. Thisway, when the model is attending to the target im-age, the contextual information retrieved from thesource image is available and can therefore per-form better at differentiation or relationship learn-ing.

In detail, the source attention head first attendsto the flattened source image feature f SRC. Theattention module is built in the same way as inSec. 3.1, except that it now only attends to thesource image:

αSRCt,i = softmaxi(h

>t WSRCf

SRCi ) (11)

f SRCt =

∑i

αSRCt,i f

SRCi (12)

hSRCt = tanh(W2[f

SRCt ;ht] + b2) (13)

The target attention head then takes the out-put of the source attention hSRC

t as a query to re-trieve appropriate information from the target fea-

4We also tried the original multi-head attention but it isempirically weaker than our sequential multi-head attention.

Page 6: Expressing Visual Relationships via Language - arxiv.org · Expressing Visual Relationships via Language Hao Tan1; 2, Franck Dernoncourt , Zhe Lin , Trung Bui2, Mohit Bansal1 1UNC

ture f TRG:

αTRGt,j = softmaxj(h

SRC>t WTRGf

TRGj ) (14)

f TRGt =

∑j

αTRGt,j f

TRGj (15)

hTRGt = tanh(W3[f

TRGt ; hSRC

t ] + b3) (16)

In place of ht, the output of the target head hTRGt is

used to predict the next word.5

3.3 Static Relational AttentionAlthough the sequential multi-head attentionmodel can learn to differentiate the two images,visual relationships are not explicitly examined.We thus allow the model to statically (i.e., not indecoding) compute the relational score betweensource and target feature sequences and reducethe scores into two relationship-aware feature se-quences. We apply a bi-directional relational at-tention (Fig. 4(c)) for this purpose: one from thesource to the target, and one from the target to thesource. For each feature in the source feature se-quence, the source-to-target attention computes itsalignment with the features in the target feature se-quences. The source feature, the attended targetfeature, and the difference between them are thenmerged together with a fully-connected layer:

αS→Ti,j = softmaxj((WSf

SRCi )>(WTf

TRGj )) (17)

f S→Ti =

∑j

αS→Ti,j f TRG

j (18)

f Si = tanh(W4[f

SRCi ; f S→T

i ] + b4) (19)

We decompose the attention weight into two smallmatrices WS and WT so as to reduce the numberof parameters, because the dimension of the im-age feature is usually large. The target-to-sourcecross-attention is built in an opposite way: it takeseach target feature f TRG

j as a query, attends to thesource feature sequence, and get the attentive fea-ture f T

j . We then use these two bidirectional at-tentive sequences f S

i and f Tj in the multi-head at-

tention module (shown in previous subsection) ateach decoding step.

3.4 Dynamic Relational AttentionThe static relational attention module compressespairwise relationships (of size N4) into two

5We tried to exchange the order of two heads or have twoorders concurrently. We didn’t see any significant differencein results between them.

relationship-aware feature sequences (of size 2×N2). The compression saves computational re-sources but has potential drawback in informationloss as discussed in Bahdanau et al. (2015) and Xuet al. (2015). In order to avoid losing information,we modify the static relational attention moduleto its dynamic version, which calculates the rela-tional scores while decoding (Fig. 4(d)).

At each decoding step t, the dynamic relationalattention calculates the alignment score at,i,j be-tween three vectors: a source feature f SRC

i , a tar-get feature f TRG

j , and the hidden state ht. Sincethe dot-product used in previous attention modulesdoes not have a direct extension for three vectors,we extend the dot product and use it to computethe three-vector alignment score.

dot(x, y) =∑d

xd yd = x>y (20)

dot∗(x, y, z) =∑d

xd ydzd = (x� y)>z (21)

at,i,j = dot∗(WSKfSRCi ,WTKf

TRGj ,WHKht) (22)

= (WSKfSRCi �WTKf

TRGj )>WHKht (23)

where � is the element-wise multiplication.The alignment scores (of size N4) are normal-

ized by softmax. And the attention information isfused to the attentive hidden vector fD

t as previous.

αt,i,j = softmaxi,j (at,i,j) (24)

f SRC-Dt =

∑i,j

αt,i,jfSRCi (25)

f TRG-Dt =

∑i,j

αt,i,jfTRGj (26)

fDt = tanh(W5[f

SRC-Dt ; f TRG-D

t ;ht]+b5) (27)

= tanh(W5SfSRC-Dt +W5Tf

TRG-Dt +

W5Hht + b5) (28)

whereW5S,W5T,W5H are sub-matrices ofW5 andW5 = [W5S,W5T,W5H].

According to Eqn. 23 and Eqn. 28, we find ananalog in conventional attention layers with fol-lowing specifications:

• Query: ht

• Key: WSKfSRCi �WTKf

TRGj

• Value: W5SfSRCi +W5Tf

TRGj

The key WSKfSRCi � WTKf

TRGj and the value

W5SfSRCi +W5Tf

TRGj can be considered as repre-

sentations of the visual relationships between f SRCi

Page 7: Expressing Visual Relationships via Language - arxiv.org · Expressing Visual Relationships via Language Hao Tan1; 2, Franck Dernoncourt , Zhe Lin , Trung Bui2, Mohit Bansal1 1UNC

Method BLEU-4 CIDEr METEOR ROUGE-LOur Dataset (Image Editing Request)

basic model 5.04 21.58 11.58 34.66+multi-head att 6.13 22.82 11.76 35.13+static rel-att 5.76 20.70 12.59 35.46

-static +dynamic rel-att 6.72 26.36 12.80 37.25Spot-the-Diff

CAPT(Jhamtani and Berg-Kirkpatrick, 2018) 7.30 26.30 10.50 25.60DDLA(Jhamtani and Berg-Kirkpatrick, 2018) 8.50 32.80 12.00 28.60

basic model 5.68 22.20 10.98 24.21+multi-head att 7.52 31.39 11.64 26.96+static rel-att 8.31 33.98 12.95 28.26

-static +dynamic rel-att 8.09 35.25 12.20 31.38NLVR2

basic model 5.04 43.39 10.82 22.19+multi-head att 5.11 44.80 10.72 22.60+static rel-att 4.95 45.67 10.89 22.69

-static +dynamic rel-att 5.00 46.41 10.37 22.94

Table 2: Automatic metric of test results on three datasets. Best results of the main metric are marked in bold font.Our full model is the best on all three datasets with the main metric.

and f TRGj . It is a direct attention to the visual re-

lationship between the source and target images,hence is suitable for the task of generating rela-tionship descriptions.

4 Results

To evaluate the performance of our relationalspeaker models (Sec. 3), we trained them on allthree datasets (Sec. 2). We evaluate our modelsbased on both automatic metrics as well as pair-wise human evaluation. We also show our gener-ated examples for each dataset.

4.1 Experimental Setup

We use the same hyperparameters when applyingour model to the three datasets. Dimensions ofhidden vectors are 512. The model is optimizedby Adam with a learning rate of 1e − 4. Weadd dropout layers of rate 0.5 everywhere to avoidover-fitting. When generating instructions forevaluation, we use maximum-decoding: the wordwt generated at time step t is argmaxk p(wt,k).For the Spot-the-Diff dataset, we take the “Singlesentence decoding” experiment as in Jhamtani andBerg-Kirkpatrick (2018). We also try to mix thethree datasets but we do not see any improvement.We also try different ways to mix the three datasetsbut we do not see improvement. We first train a

unified model on the union of these datasets. Themetrics drop a lot because the tasks and languagedomains (e.g., the word dictionary and lengths ofsentences) are different from each other. We nextonly share the visual components to overcome thedisagreement in language. However, the imagedomain are still quite different from each other (asshown in Fig. 2). Thus, we finally separately trainthree models on the three datasets with minimalcross-dataset modifications.

4.2 Metric-Based Evaluation

As shown in Table 2, we compare the performanceof our models on all three datasets with variousautomated metrics. Results on the test sets arereported. Following the setup in Jhamtani andBerg-Kirkpatrick (2018), we takes CIDEr (Vedan-tam et al., 2015) as the main metric in evaluatingthe Spot-the-Diff and NLVR2 datasets. However,CIDEr is known as its problem in up-weightingunimportant details (Kilickaya et al., 2017; Liuet al., 2017b). In our dataset, we find that instruc-tions generated from a small set of short phrasescould get a high CIDEr score. We thus change themain metric of our dataset to METEOR (Banerjeeand Lavie, 2005), which is manually verified to bealigned with human judgment on the validation setin our dataset. To avoid over-fitting, the model is

Page 8: Expressing Visual Relationships via Language - arxiv.org · Expressing Visual Relationships via Language Hao Tan1; 2, Franck Dernoncourt , Zhe Lin , Trung Bui2, Mohit Bansal1 1UNC

Basic Full Both Good Both NotOurs(IEdit) 11 24 5 60

Spot-the-Diff 22 37 6 35NLVR2 24 37 17 22

Table 3: Human evaluation on 100 examples. Imagepair and two captions generated by our basic model andfull model are shown to the user. The user choosesone from ‘Basic’ model wins, ‘Full’ model wins, ‘BothGood’, or ‘Both Not’. Better model marked in boldfont.

early-stopped based on the main metric on vali-dation set. We also report the BLEU-4 (Papineniet al., 2002) and ROUGE-L (Lin, 2004) scores.

The results on various datasets shows the grad-ual improvement made by our novel neural com-ponents, which are designed to better describe therelationship between 2 images. Our full model hasa significant improvement in result over baseline.The improvement on the NLVR2 dataset is lim-ited because the comparison of two images wasnot forced to be considered when generating in-structions.

4.3 Human Evaluation and QualitativeAnalysis

We conduct a pairwise human evaluation on ourgenerated sentences, which is used in Celikyil-maz et al. (2018) and Pasunuru and Bansal (2017).Agarwala (2018) also shows that the pairwisecomparison is better than scoring sentences indi-vidually. We randomly select 100 examples fromthe test set in each dataset and generate captionsvia our full speaker model. We ask users to choosea better instruction between the captions generatedby our full model and the basic model, or alter-natively indicate that the two captions are equalin quality. The Image Editing Request dataset isspecifically annotated by the image editing expert.The winning rate of our full model (dynamic re-lation attention) versus the basic model is shownin Table 3. Our full model outperforms the ba-sic model significantly. We also show positive andnegative examples generated by our full model inFig. 5. In our Image Editing Request corpus, themodel was able to detect and describe the edit-ing actions but it failed in handling the arbitrarycomplex editing actions. We keep these hard ex-amples in our dataset to match real-world require-ments and allow follow-up future works to pursuethe remaining challenges in this task. Our modelis designed for non-localized relationships thus

we do not explicitly model the pixel-level differ-ences; however, we still find that the model couldlearn these differences in the Spot-the-Diff dataset.Since the descriptions in Spot-the-Diff is relativelysimple, the errors mostly come from wrong enti-ties or undetected differences as shown in Fig. 5.Our model is also sensitive to the image contentsas shown in the NLVR2 dataset.

5 Related Work

In order to learn a robust captioning system, pub-lic datasets have been released for diverse tasksincluding single image captioning (Lin et al.,2014; Plummer et al., 2015; Krishna et al., 2017),video captioning (Xu et al., 2016), referring ex-pressions (Kazemzadeh et al., 2014; Mao et al.,2016), and visual question answering (Antol et al.,2015; Zhu et al., 2016; Johnson et al., 2017). Interms of model progress, recent years witnessedstrong research progress in generating natural lan-guage sentences to describe visual contents, suchas Vinyals et al. (2015); Xu et al. (2015); Ran-zato et al. (2016); Anderson et al. (2018) in singleimage captioning, Venugopalan et al. (2015); Panet al. (2016); Pasunuru and Bansal (2017) in videocaptioning, Mao et al. (2016); Liu et al. (2017a);Yu et al. (2017); Luo and Shakhnarovich (2017) inreferring expressions, Jain et al. (2017); Li et al.(2018); Misra et al. (2018) in visual question gen-eration, and Andreas and Klein (2016); Cohn-Gordon et al. (2018); Luo et al. (2018); Vedantamet al. (2017) in other setups.

Single image captioning is the most relevantproblem to the two-images captioning. Vinyalset al. (2015) created a powerful encoder-decoder(i.e., CNN to LSTM) framework in solving thecaptioning problem. Xu et al. (2015) furtherequipped it with an attention module to handlethe memorylessness of fixed-size vectors. Ran-zato et al. (2016) used reinforcement learning toeliminate exposure bias. Recently, Anderson et al.(2018) brought the information from object detec-tion system to further boost the performance.

Our model is built based on the attentiveencoder-decoder model (Xu et al., 2015), which isthe same choice in Jhamtani and Berg-Kirkpatrick(2018). We apply the RL training with self-critical (Rennie et al., 2017) but do not see signif-icant improvement, possibly because of the rela-tively small data amount compared to MS COCO.We also observe that the detection system in An-

Page 9: Expressing Visual Relationships via Language - arxiv.org · Expressing Visual Relationships via Language Hao Tan1; 2, Franck Dernoncourt , Zhe Lin , Trung Bui2, Mohit Bansal1 1UNC

add a filter to the image

change the background to blue

Positive Examples

Negative Examples

Image Editing Request Spot-the-Diff NLVR2

there is a bookshelf with a white shelf in one of the images .

the left image shows a pair of shoes wearing a pair of shoes .

the person in the white shirt is gone

the black car in the middle row is gone

Figure 5: Examples of positive and negative results of our model from the three datasets. Selfies are blurred.

derson et al. (2018) has a high probability to failin the three datasets, e.g., the detection system cannot detect the small cars and people in spot-the-diff dataset. The DDLA (Difference Descriptionwith Latent Alignment) method proposed in Jham-tani and Berg-Kirkpatrick (2018) learns the align-ment between descriptions and visual differences.It relies on the nature of the particular dataset andthus could not be easily transferred to other datasetwhere the visual relationship is not obvious. Thetwo-images captioning could also be consideredas a two key-frames video captioning problem,and our sequential multi-heads attention is a modi-fied version of the seq-to-seq model (Venugopalanet al., 2015). Some existing work (Chen et al.,2018; Wang et al., 2018; Manjunatha et al., 2018)also learns how to modify images. These datasetsand methods focus on the image colorization andadjustment tasks, while our dataset aims to studythe general image editing request task.

6 Conclusion

In this paper, we explored the task of describ-ing the visual relationship between two images.We collected the Image Editing Request dataset,which contains image pairs and human annotatedediting instructions. We designed novel relationalspeaker models and evaluate them on our col-lected and other public existing dataset. Based onautomatic and human evaluations, our relationalspeaker model improves the ability to capture vi-sual relationships. For future work, we are goingto further explore the possibility to merge the threedatasets by either learning a joint image represen-tation or by transferring domain-specific knowl-edge. We are also aiming to enlarge our ImageEditing Request dataset with newly-released postson Reddit and Zhopped.

Acknowledgments

We thank the reviewers for their helpful commentsand Nham Le for helping with the initial datacollection. This work was supported by Adobe,ARO-YIP Award #W911NF-18-1-0336, and fac-ulty awards from Google, Facebook, and Sales-force. The views, opinions, and/or findings con-tained in this article are those of the authors andshould not be interpreted as representing the offi-cial views or policies, either expressed or implied,of the funding agency.

ReferencesAseem Agarwala. 2018. Automatic photography with

google clips.

Peter Anderson, Xiaodong He, Chris Buehler, DamienTeney, Mark Johnson, Stephen Gould, and LeiZhang. 2018. Bottom-up and top-down attention forimage captioning and visual question answering. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 6077–6086.

Jacob Andreas and Dan Klein. 2016. Reasoning aboutpragmatics with neural listeners and speakers. InProceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing, pages1173–1182.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C Lawrence Zitnick,and Devi Parikh. 2015. Vqa: Visual question an-swering. In Proceedings of the IEEE internationalconference on computer vision, pages 2425–2433.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In International Con-ference on Learning Representations.

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: Anautomatic metric for mt evaluation with improvedcorrelation with human judgments. In Proceedings

Page 10: Expressing Visual Relationships via Language - arxiv.org · Expressing Visual Relationships via Language Hao Tan1; 2, Franck Dernoncourt , Zhe Lin , Trung Bui2, Mohit Bansal1 1UNC

of the acl workshop on intrinsic and extrinsic evalu-ation measures for machine translation and/or sum-marization, pages 65–72.

Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, andYejin Choi. 2018. Deep communicating agents forabstractive summarization. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long Pa-pers), pages 1662–1675.

Jianbo Chen, Yelong Shen, Jianfeng Gao, Jingjing Liu,and Xiaodong Liu. 2018. Language-based imageediting with recurrent attentive models. In Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 8721–8729.

Reuben Cohn-Gordon, Noah Goodman, and Christo-pher Potts. 2018. Pragmatically informative imagecaptioning with character-level inference. In Pro-ceedings of the 2018 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 2 (Short Papers), volume 2, pages 439–443.

Ali Farhadi, Mohsen Hejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, JuliaHockenmaier, and David Forsyth. 2010. Every pic-ture tells a story: Generating sentences from images.In European conference on computer vision, pages15–29. Springer.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

Unnat Jain, Ziyu Zhang, and Alexander G Schwing.2017. Creativity: Generating diverse questions us-ing variational autoencoders. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 6485–6494.

Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2018.Learning to describe differences between pairs ofsimilar images. In Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 4024–4034.

Justin Johnson, Bharath Hariharan, Laurens van derMaaten, Li Fei-Fei, C Lawrence Zitnick, and RossGirshick. 2017. Clevr: A diagnostic dataset forcompositional language and elementary visual rea-soning. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages2901–2910.

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten,and Tamara Berg. 2014. Referitgame: Referring toobjects in photographs of natural scenes. In Pro-ceedings of the 2014 conference on empirical meth-ods in natural language processing (EMNLP), pages787–798.

Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis,and Erkut Erdem. 2017. Re-evaluating automaticmetrics for image captioning. In Proceedings of the15th Conference of the European Chapter of the As-sociation for Computational Linguistics: Volume 1,Long Papers, volume 1, pages 199–209.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma,et al. 2017. Visual genome: Connecting languageand vision using crowdsourced dense image anno-tations. International Journal of Computer Vision,123(1):32–73.

Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Sim-ing Li, Yejin Choi, Alexander C Berg, and Tamara LBerg. 2011. Baby talk: Understanding and generat-ing image descriptions. In Proceedings of the 24thCVPR. Citeseer.

Yikang Li, Nan Duan, Bolei Zhou, Xiao Chu, WanliOuyang, Xiaogang Wang, and Ming Zhou. 2018.Visual question generation as dual task of visualquestion answering. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition, pages 6116–6124.

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. Text SummarizationBranches Out.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European confer-ence on computer vision, pages 740–755. Springer.

Jingyu Liu, Liang Wang, and Ming-Hsuan Yang.2017a. Referring expression generation and com-prehension via attributes. In Proceedings of theIEEE International Conference on Computer Vision,pages 4856–4864.

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama,and Kevin Murphy. 2017b. Improved image cap-tioning via policy gradient optimization of spider.In Proceedings of the IEEE international conferenceon computer vision, pages 873–881.

Ruotian Luo, Brian Price, Scott Cohen, and GregoryShakhnarovich. 2018. Discriminability objective fortraining descriptive captions. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 6964–6974.

Ruotian Luo and Gregory Shakhnarovich. 2017.Comprehension-guided referring expressions. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 7102–7111.

Varun Manjunatha, Mohit Iyyer, Jordan Boyd-Graber,and Larry Davis. 2018. Learning to color from lan-guage. In Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers), pages 764–769.

Page 11: Expressing Visual Relationships via Language - arxiv.org · Expressing Visual Relationships via Language Hao Tan1; 2, Franck Dernoncourt , Zhe Lin , Trung Bui2, Mohit Bansal1 1UNC

Junhua Mao, Jonathan Huang, Alexander Toshev, OanaCamburu, Alan L Yuille, and Kevin Murphy. 2016.Generation and comprehension of unambiguous ob-ject descriptions. In Proceedings of the IEEE con-ference on computer vision and pattern recognition,pages 11–20.

Ishan Misra, Ross Girshick, Rob Fergus, MartialHebert, Abhinav Gupta, and Laurens van derMaaten. 2018. Learning by asking questions. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 11–20.

Yasuhide Mori, Hironobu Takahashi, and Ryuichi Oka.1999. Image-to-word transformation based on di-viding and vector quantizing images with words. InFirst International Workshop on Multimedia Intelli-gent Storage and Retrieval Management, pages 1–9.Citeseer.

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yuet-ing Zhuang. 2016. Hierarchical recurrent neural en-coder for video representation with application tocaptioning. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages1029–1038.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.

Ramakanth Pasunuru and Mohit Bansal. 2017. Re-inforced video captioning with entailment rewards.In Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing, pages979–985.

Bryan A Plummer, Liwei Wang, Chris M Cervantes,Juan C Caicedo, Julia Hockenmaier, and SvetlanaLazebnik. 2015. Flickr30k entities: Collectingregion-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEEinternational conference on computer vision, pages2641–2649.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2016. Sequence level train-ing with recurrent neural networks. In InternationalConference on Learning Representations.

Steven J Rennie, Etienne Marcheret, Youssef Mroueh,Jerret Ross, and Vaibhava Goel. 2017. Self-criticalsequence training for image captioning. In Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 7008–7024.

Alane Suhr, Stephanie Zhou, Iris Zhang, Huajun Bai,and Yoav Artzi. 2019. A corpus for reasoning aboutnatural language grounded in photographs. In Pro-ceedings of the 57th annual meeting on associationfor computational linguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

Ramakrishna Vedantam, Samy Bengio, Kevin Murphy,Devi Parikh, and Gal Chechik. 2017. Context-awarecaptions from context-agnostic supervision. In Pro-ceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition, pages 251–260.

Ramakrishna Vedantam, C Lawrence Zitnick, and DeviParikh. 2015. Cider: Consensus-based image de-scription evaluation. In Proceedings of the IEEEconference on computer vision and pattern recog-nition, pages 4566–4575.

Subhashini Venugopalan, Marcus Rohrbach, JeffreyDonahue, Raymond Mooney, Trevor Darrell, andKate Saenko. 2015. Sequence to sequence-video totext. In Proceedings of the IEEE international con-ference on computer vision, pages 4534–4542.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Proceedings of the IEEEconference on computer vision and pattern recogni-tion, pages 3156–3164.

Hai Wang, Jason D Williams, and SingBing Kang.2018. Learning to globally edit images with textualdescription. arXiv preprint arXiv:1810.05786.

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridgingvideo and language. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition, pages 5288–5296.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual atten-tion. In International conference on machine learn-ing, pages 2048–2057.

Licheng Yu, Hao Tan, Mohit Bansal, and Tamara LBerg. 2017. A joint speaker-listener-reinforcermodel for referring expressions. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition, pages 7282–7290.

Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answeringin images. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pages4995–5004.