arxiv:2109.05281v1 [cs.cl] 11 sep 2021

12
COSMic: A Coherence-Aware Generation Metric for Image Descriptions Mert ˙ Inan University of Pittsburgh [email protected] Piyush Sharma Google Research [email protected] Baber Khalid Rutgers University [email protected] Radu Soricut Google Research [email protected] Matthew Stone Rutgers University [email protected] Malihe Alikhani University of Pittsburgh [email protected] Abstract Developers of text generation models rely on automated evaluation metrics as a stand-in for slow and expensive manual evaluations. However, image captioning metrics have strug- gled to give accurate learned estimates of the semantic and pragmatic success of out- put text. We address this weakness by intro- ducing the first discourse-aware learned gen- eration metric for evaluating image descrip- tions. Our approach is inspired by computa- tional theories of discourse for capturing infor- mation goals using coherence. We present a dataset of image–description pairs annotated with coherence relations. We then train a coherence-aware metric on a subset of the Conceptual Captions dataset and measure its effectiveness—its ability to predict human rat- ings of output captions—on a test set com- posed of out-of-domain images. We demon- strate a higher Kendall Correlation Coefficient for our proposed metric with the human judg- ments for the results of a number of state- of-the-art coherence-aware caption generation models when compared to several other met- rics including recently proposed learned met- rics such as BLEURT and BERTScore. 1 Introduction An investigation of the descriptions used with im- ages on the web shows that image descriptions can have different functions and goals (Kruk et al., 2019a; Alikhani et al., 2020). For instance, cap- tions may describe visible entities, activities and relationships, provide background information that goes beyond what’s visible, or report the writer’s own subjective reactions to what’s displayed. By drawing on such diverse examples, image caption- ing models can learn the different inferential links between text and images and use that information at generation time to produce descriptions that can fulfill different discourse goals and inject the de- sired context into their output (Papineni et al., 2002; Caption Coh. CIDEr COSMic Model first flower of the year Story 0.000 0.653 Human close-up of pink flowers Visible Figure 1: A comparison of the scores for a generated (Model) caption that has a different coherence relation than the reference (Human) caption. “Coh.” represents the coherence labels for generated and reference cap- tions. Our coherence-aware metric COSMic is aware of the different information goals for these captions, and assigns a more adequate score when comparing the Model caption against the Human caption. In this case where a caption that does not just describe the image but elaborates on it, our metric recognizes that the model output is potentially successful (Photo credit: Moorthy Gounder) Lin, 2004; Denkowski and Lavie, 2014; Anderson et al., 2016a). So far, however, efforts to develop such expres- sive captioning models have been hindered by the lack of automatic metrics that can evaluate their output with respect to their information goals in context. Previous approaches to automatic cap- tion evaluation have mostly focused on n-gram measures of similarity to reference output (Vedan- tam et al., 2014); such surface-level models fail to deal with the lexical and syntactic diversity of image descriptions. More recent approaches more closely approximate semantic similarity using word embedding-based techniques. These models show arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

Upload: others

Post on 14-Jan-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

COSMic A Coherence-Aware Generation Metric for ImageDescriptions

Mert InanUniversity of Pittsburghmertinanpittedu

Piyush SharmaGoogle Research

piyushsharmagooglecom

Baber KhalidRutgers University

baberkhalidrutgersedu

Radu SoricutGoogle Research

rsoricutgooglecom

Matthew StoneRutgers University

mdstonecsrutgersedu

Malihe AlikhaniUniversity of Pittsburgh

malihepittedu

Abstract

Developers of text generation models rely onautomated evaluation metrics as a stand-infor slow and expensive manual evaluationsHowever image captioning metrics have strug-gled to give accurate learned estimates ofthe semantic and pragmatic success of out-put text We address this weakness by intro-ducing the first discourse-aware learned gen-eration metric for evaluating image descrip-tions Our approach is inspired by computa-tional theories of discourse for capturing infor-mation goals using coherence We present adataset of imagendashdescription pairs annotatedwith coherence relations We then train acoherence-aware metric on a subset of theConceptual Captions dataset and measure itseffectivenessmdashits ability to predict human rat-ings of output captionsmdashon a test set com-posed of out-of-domain images We demon-strate a higher Kendall Correlation Coefficientfor our proposed metric with the human judg-ments for the results of a number of state-of-the-art coherence-aware caption generationmodels when compared to several other met-rics including recently proposed learned met-rics such as BLEURT and BERTScore

1 Introduction

An investigation of the descriptions used with im-ages on the web shows that image descriptionscan have different functions and goals (Kruk et al2019a Alikhani et al 2020) For instance cap-tions may describe visible entities activities andrelationships provide background information thatgoes beyond whatrsquos visible or report the writerrsquosown subjective reactions to whatrsquos displayed Bydrawing on such diverse examples image caption-ing models can learn the different inferential linksbetween text and images and use that informationat generation time to produce descriptions that canfulfill different discourse goals and inject the de-sired context into their output (Papineni et al 2002

Caption Coh CIDEr COSMic

Model first flower ofthe year Story

0000 0653

Human close-up ofpink flowers Visible

Figure 1 A comparison of the scores for a generated(Model) caption that has a different coherence relationthan the reference (Human) caption ldquoCohrdquo representsthe coherence labels for generated and reference cap-tions Our coherence-aware metric COSMic is awareof the different information goals for these captionsand assigns a more adequate score when comparingthe Model caption against the Human caption In thiscase where a caption that does not just describe theimage but elaborates on it our metric recognizes thatthe model output is potentially successful (Photo creditMoorthy Gounder)

Lin 2004 Denkowski and Lavie 2014 Andersonet al 2016a)

So far however efforts to develop such expres-sive captioning models have been hindered by thelack of automatic metrics that can evaluate theiroutput with respect to their information goals incontext Previous approaches to automatic cap-tion evaluation have mostly focused on n-grammeasures of similarity to reference output (Vedan-tam et al 2014) such surface-level models failto deal with the lexical and syntactic diversity ofimage descriptions More recent approaches moreclosely approximate semantic similarity using wordembedding-based techniques These models show

arX

iv2

109

0528

1v1

[cs

CL

] 1

1 Se

p 20

21

robust performance and achieve a higher correla-tion with human judgments than that of previousmetrics Nevertheless they too fail to generalizeto the different kinds of content that successful de-scriptions may exhibit across different goals andcontexts That is they cannot distinguish reason-able descriptions that happen to differ from refer-ence output in their goals and perspective fromproblematic descriptions that hallucinate inappro-priate content or context

To bridge this gap we present a coherence-awareembedding-based generation metric that learns torespect diverse discourse goals without penalizingcaptions that are purposefully generated to fulfilldifferent purposes or communicate background in-formation Figure 1 demonstrates this capability bypresenting an example image and captions with dif-ferent coherence labels together with their scores

Our approach to modeling discourse goals isbased on the framework of discourse coherence the-ory (Hobbs 1985) which characterizes the infer-ences that give discourse units a coherent joint in-terpretation using a constrained inventory of coher-ence relations In particular we use the taxonomyfor imagendashtext coherence developed by Alikhaniet al (2020) which for example includes VisibleStory and Subjective relations between the text andthe image A description and an image stand in aVisible relation if the text includes information thatis recognizably depicted in the image Subjectivecaptions react to the content of the image and Storycaptions provide a free-standing description of thecircumstances depicted in the image similar to theNarration relation in text Our metric is learned inpart from a new dataset of 4000 images with de-scriptions labeled with different coherence labelsin this taxonomy

In inaugurating the study of coherence-awaregeneration metrics we make the following specificcontributions In Section 3 we present two differ-ent annotated datasets for training and testing acoherence-aware metric We present a model toscore a generated caption given the image refer-ence caption and the discourse goals of both thesecaptions (Section 4) We compare this metric to pre-vious ones using a common methodology rankingthe performance of several different caption genera-tion systems on out-of-domain imagesmdashrelying ona new benchmark out-of-domain test set which wepublish providing reference captions for a subsetof OpenImages (Kuznetsova et al 2020b) Our

experiments demonstrate that among all these met-rics our proposed metric has the highest correlationwith human judgments

2 Related work

There are diverse ways of characterizing the con-tributions of text and imagery Gao et al (2015)investigate the genre of image captions and Huangand Kovashka (2016) study the persuasive implicitrelationships between text and images Kruk et al(2019b) study the emotional links between text andimages Otto et al (2019) present an annotateddataset of text and imagery that compares the in-formation load in text and images However webuild on works that study information-level infer-ences between discourse units in different modali-ties such as comic book panels (McCloud 1993)movie plots (Cumming et al 2017) and diagram-matic elements (Hiippala et al 2021) In particularwe use Alikhani et al (2020)rsquos relations that char-acterize inferences between text and images

Coherence-aware models have benefited sev-eral NLP tasks such as gesture interpretation (Las-carides and Stone 2009 Pustejovsky and Krish-naswamy 2020) text summarization (Xu et al2019) machine comprehension (Gao et al 2020)The majority of these works use Rhetorical Struc-ture Theory (RST) (Mann and Thompson 1987)and Penn Discourse TreeBank (PDTB) (Prasadet al 2008b) datasets to learn and predict theserelations between two adjacent text spans In thisline of work we are the first to present a coherence-aware generation metric

The most widely used automatic evaluation met-rics are ngram-based which compute the exactnumber of ngram matches between reference andgenerated text (Cui et al 2018) Examples ofsuch metrics that are commonly used for evaluatingthe output of captioning translation and summa-rization models are BLEU (Papineni et al 2002)ROUGE (Lin 2004) and CIDEr (Vedantam et al2015) The major problem of the n-gram sim-ilarity metrics is that they give no credit to syn-onym matches of reference n-grams even if thosewords are common and used appropriately in thegenerated text Embedding-based metrics such asBLEURT (Sellam et al 2020) and BERTScore(Zhang et al 2020) designed to address this lim-itation are closer to human ratings BLEURT isa data-intensive training scheme that is based onBERT (Devlin et al 2019) fine-tuned on human

Facade of a glass building A pink flower bush in a gar-den

The underside of the Arc deTriomphe

Close-up of a fly sitting on adaisy

Man sitting by his artworklooking at a large statue ofa man on a horse in a royalcourtyard

Woman with an umbrellareading a book sitting in thegrass in front of a city sky-line

Cowboy on a horse and cow-boy on the ground workingtogether to lasso a calf in apen

Black and white artworkpainted on a blue wall

Figure 2 Examples of the ground truth captions that we collected for the COIN dataset (Photo credits from left toright top to bottom Sharron Mollerus Northfielder George M Groutas davebloggs007 Tim Adams BrisbaneCity Council Colin Brown Guilhem Vellut)

ratings of generated text BERTScore howevercomputes the similarity score as the average ofcosine similarities between predicted tokens andtheir top matching reference tokens These metricshowever do not respect the information goal andthe purpose for which the model has generated thetext We address this problem by introducing thefirst coherence-aware generation metric Similarto SPICE (Anderson et al 2016b) and VIFIDEL(Madhyastha et al 2019) we use the informationencoded in images We further propose the addi-tion of coherence relations that facilitate learningwith fewer samples by a multimodal metric usingpre-trained BERT and ViLBERT

3 Data Collection

We collect two datasets human judgments forimage captions that are generated by coherence-aware captioning systems using Conceptual Cap-tions dataset and ground-truth labels for the OpenImages dataset With Conceptual Captions cor-pora we fine-tune ViLBERT with ratings and showthat addition of coherence relations can make au-tomated scoring closer to human scoring We useOpenImages corpora to reinforce that multimodal-ity and coherence relations have significant contri-butions to scoring out-of-domain datasets as well

Protocol We hired two expert linguists for dataannotation and designed an annotation website tofacilitate the annotation procedure They are na-tive English speakers who identify themselves as

of White and Latino ethnicity The code 1 of theannotation website and the details of the protocolis publicly available The study has been approvedby our institutionrsquos human subject board

Conceptual Captions Score Annotation Wehave collected ratings on the quality of different im-age descriptions with coherence labels for a subsetof 1000 images from the Conceptual Captions (CC)training dataset (Ng et al 2020) With this paperwe are publishing this dataset as a benchmark forevaluation metrics that are coherence-aware Theset-up of the data collection is as follows CCimages are input into a caption-generation modelcreated by Alikhani et al (2020) This modelgenerates coherence-aware descriptions for inputimages in 4 different coherence classes of MetaVisible Subjective Story These 4000imagecaption pairs are then presented to humanannotators who are asked to select the correctcoherence label for each pair

bull Meta the caption talks about when whereand how the picture is taken Meta-talk inSchiffrin (1980)

bull Visible the caption is true just by looking atthe picture Restatement relation in Prasadet al (2008a)

bull Subjective the captions is the matter of opin-ion Evaluation relation in Hobbs (1985)

bull Story text and image work like story and il-lustration Occasion relation in Hobbs (1985)

1httpsgithubcomMertermCOSMic

Figure 3 An illustration of different flavors of COSMic that outputs a score for the generated caption given theimage reference caption and the coherence-labels for both the captions (a) COSMic Vanilla uses only globaltextual and visual features while (b) COSMic ViLBERT uses combined visio-linguistic features with both localand global focus This model takes into account the information goals (determined by coherence-labels) for boththe captions when comparing the generated caption to the reference for evaluation

After the annotator selects a specific coherencelabel from the above we ask them to rate the qualityof the captions given the label on a scale of 1 to5 We use these annotations as training data for ourcoherence-aware captioning metric COSMic Wecall this data we annotated RaCCoon (Ratings forConceptual Caption)

To calculate the Cohenrsquos κ agreement measurewe selected 150 images randomly and assignedthem to two annotators The Kappa coefficient isκ = 089 which indicates a substantial agreement(Viera and Garrett 2005)

OpenImages Ground Truth Captions To cre-ate an out of domain test set we asked our anno-tators to write Visible captions for 1000 images2

from the OpenImages dataset (Kuznetsova et al2020a) We call this dataset COIN (Corpus ofOpenImages with Natural descriptions) A sampleof these ground truth captions written by our expertlinguists are presented in Figure 2 We use thisdataset to test COSMic and other learned metricsin Section 5 and present our benchmark results inTable 1

4 Method

The goal of a coherence-aware image captioningmetric is to predict a score for the generated cap-tion given the image reference caption and coher-ence relations of one generated caption and one

2The same subset named T2 was used for theCVPR-2019 Workshop on Conceptual Captionswwwconceptualcaptionscom

reference caption This metric function M can beformalized as predicting a score s as follows

s =M(I g r gc rc θ) (1)

where the metric is defined by parameters θ andwhere the model inputs are defined as I being theimage being captioned g and r the generated andreference captions respectively gc and rc are thecoherence relations for g r respectively

We now describe the architecture of ourcoherence-aware image captioning metric COS-Mic (COherence-Sensitive Metric of imagecaptions) It has two flavors mdash a ViLBERT-basedmodel pre-trained on large multimodal data and abaseline Vanilla version as illustrated in Figure 3Both are trained on RaCCoon training data (Sec-tion 3) with normalized human annotated rating toobtain the modelrsquos target score

41 COSMic ViLBERT

ViLBERT (Lu et al 2019) is a multimodal featurelearning model pre-trained on 33 million Concep-tual Captions image and captions data It is trainedfor masked multi-modal learning and multi-modalalignment prediction and demonstrates strong per-formance on several downstream multimodal taskssuch as VQA VCR grounding and image retrievalFor this reason we use a pre-trained ViLBERT toembed our multimodal inputs shown in Equation 1with changes to incorporate both the captions andcoherence relations

For input image (I) we use the same processas ViLBERT We use a Faster R-CNN (Ren et al2016) model pre-trained on Visual Genome (Kr-ishna et al 2016) to detect objects regions and ex-tract features The sequence of these image featuresis denoted as I prime with 100 bounding box featureswhere each element is R2048 Similar to ViLBERTwe use the special token [IMG] to denote the be-ginning of the bounding box features list

For input captions (g r) and coherence labels(gc gr) the sequence begins with the special token[CLS] followed by input text embeddings Each ofour text inputs are tokenized and embedded usingViLBERTrsquos input text pre-processing and denotedas gprime rprime gprimec g

primer for g r gc and gr respectively

Note that the coherence labels are processed as textinputs such as ldquoVisiblerdquo and ldquoStoryrdquo which allowsthe model to use its pre-trained representations ofthese concepts Each of these input sequences areseparated by the special token [SEP] to form ourinput sequence

Hence our input to ViLBERT is of formv = ([IMG] I prime [CLS] rprime [SEP] gprime [SEP] rprimec [SEP] gprimec)

We use a linear layer with sigmoid activationon ViLBERTrsquos output text logits to compute COS-Micrsquos output metric score (s)

s = Linear(ViLBERT(v)) (2)

During training we fine-tune ViLBERT and theoutput linear layer in an end-to-end fashion by mini-mizing the Mean-Squared error between the outputscore s and the corresponding reference score yon the RaCCoon dataset

42 COSMic VanillaThe COSMic ViLBERT approach above takes ad-vantage of multimodal pre-training on the Concep-tual Captions dataset to embed the image and textinputs As a simpler baseline we now presentCOSMic Vanilla which independently embeds theinput image and text to be later combined for scorecomputation with no end-to-end training

To extract image features we use a ResNet50v2(He et al 2015) model pre-trained on ImageNet(Deng et al 2009) and linearly transform the globalimage representation to 512-dimensional space

eI = Linear1(AveragePool(ResNet(I))) (3)

In our textual feature extraction module weembed g and r independently with a pre-trained

BERT-Large-512 model We use the [CLS] to-ken embedding as 1024 dimensional caption-levelrepresentation in each case and transform them to512-dimensional space

eg = Linear2(BERTCLS(g))

er = Linear2(BERTCLS(r))(4)

In our coherence label embedding module gcand rc are each represented as one-hot vectors suchthat the dimensions correspond to labels Meta Vis-ible Subjective and Story Each is embedded intoa 512-dimensional space

egc = Linear3(gc)

erc = Linear3(rc)(5)

We thus obtain the 5 vectors (each R512)representing one of the inputs of Equation 1We concatenate and use a feed-forward net-work with progressively smaller hidden layersof sizes [512 256 128 64 32 16 8] each withReLU (Agarap 2018) activation The output scores is computed by a final linear layer on top of theabove network

e = concat([eI eg er egc erc ]))

s = Linear4(MLP1(e))(6)

where e isin R2560 and s isin RTo understand the role of each component of this

implementation we further deconstruct each mod-ule in ablation experiments described in Table 2

43 Coherence-aware Captioning SystemsIn order to experiment with COSMic we generateour own captions In this section we describe thecoherence-aware captioning systems used to gener-ate these image captions for the training and testingof COSMic

For our base captioning system we use the state-of-the-art coherence-aware captioning system in-troduced by (Alikhani et al 2020) It uses aTransformer-based (Vaswani et al 2017) encoder-decoder architecture where the encoder inputs are(1) global image features (2) image labels and (3)coherence label The coherence-label also servesas the first input token for the decoder which gen-erates the output captions We set the coherencelabel to the groundtruth relation at training timeand the desired relation at inference time We usethe Conceptual Captions dataset (Sharma et al2018) with machine-generated coherence labels for

System AvgHum

Rating

Metrics

Model CohLabel B1 B2 M RL C S BR BS-

FCOSMicVanilla

COSMicViL-BERT

COSMicVanilla+

COSMicViL-BERT+

BUTD Visible 2191 163 077 049 160 092 030 -877 863 706 796 522 641

Base

Visible 3532 050 025 019 066 020 002 -1114 862 696 777 516 614Meta 3213 041 000 012 063 012 000 -1059 863 548 727 505 602Subj 2830 033 012 011 057 017 000 -1197 849 323 421 358 403Story 2915 029 000 017 058 013 000 -1304 842 533 629 482 527

Lite

Visible 3298 028 011 013 053 011 000 -1101 863 684 784 515 604Meta 2830 026 010 008 055 015 000 -1084 859 548 748 511 565Subj 2298 039 012 019 066 024 003 -1217 849 364 451 379 419Story 2426 036 000 018 062 021 000 -1362 842 568 666 499 519

KendallrsquosCorrelation (τ ) 1000 071 154 036 -036 -571 -052 286 445 571 546 667 764

Table 1 System-level scores for 9 different image captioning systems as evaluated by human annotators andvarious captioning metrics Bottom-Up Top-Down (BUTD) is trained on COCO while others are trained on theConceptual Captions (CC) dataset The evaluation however is conducted on COIN dataset which is out-of-domainfor both COCO and CC This domain shift causes the n-gram based metrics (eg BLEU ROUGE CIDEr) to assignvery low scores to otherwise correct captions (See Table 4) Whereas embedding based metrics (eg BLEURTBERTScore and COSMic) do not suffer from this limitation Since all metrics have different scales instead ofabsolute scores we use Kendall Rank Correlation to measure agreement with human scores Model names areabbreviated as follows B1 Bleu1 B2 Bleu2 M METEOR RL ROUGEL C CIDEr S SPICE BR BLEURTBS-F BERTScore F1 COSMic models with rsquo+rsquo denote application of data augmentation to remove training databias More metrics and detailed results can be found on the code repository

training this captioning system To obtain the co-herence labels above we closely follow (Alikhaniet al 2020) to train a coherence classifier on theClue dataset (Alikhani et al 2020) that providesaround 4K human annotated (image caption rela-tion) triplets We present two caption-generationsystems in this section

Base-systems family A family of 4 captioningsystems is created by setting the coherence-labelto Meta Visible Subjective or Story in the basecaptioning model described above These are con-sidered different captioning systems because theinformation content and discourse goals as con-trolled by the coherence label are different

Lite-systems family We remove the global im-age features from the base modelrsquos input to obtaina smaller light-weight (lite) model Similar to thebase model we obtain a family of 4 captioningsystems by changing the coherence-label

In Section 5 we study the order in which sev-eral image captioning metrics rank these 8 systemsThe goal is to identify the metric that agrees themost with the groundtruth rankings based on hu-man assessments

44 COCO-trained Captioning SystemCOSMicrsquos training data RaCCoon is based onConceptual Captions and it is coherence-aware Totest the modelrsquos generalization capability we use

a captioning system trained on MS COCO (Chenet al 2015) Since COSMic expects an input co-herence label and COCO captions are Visible styleby design we set the label to Visible Specificallywe use the Bottom-Up Top-Down (BUTD) Atten-tion model (Anderson et al 2018) This helpsstudy how well COSMic generalizes to other cap-tioning datasets and coherence-agnostic captioningsystems

5 Experiments

Here we describe the experimental setup to com-pare COSMic with other metrics As outlined inSection 3 and 4 we use the RaCCoon data to trainour models and COIN to test COSMic and othermetrics We have several baseline metrics that wecompare to which can be found on Table 1

51 Model Training Setup

We implement COSMicmdashas described in Sec-tion 4mdashwith PyTorch (Paszke et al 2019) and trainon a GTX1080 GPU We pre-compute BERT3 andResNet4 features using their TensorFlow (Abadiet al 2015) implementations We use the pub-

3httpsgithubcomgoogle-researchbert

4httpswwwtensorfloworgapi_docspythontfkerasapplicationsResNet50V2

lic ViLBERT5 implementation We use a batchsize of 4 and a learning rate of 2times 10minus6 for fine-tuning ViLBERT and use RAdam optimizer andstop the training when the validation score doesnot change for 3 epochs For COSMic Vanillawe train with a batch-size of 10 Adam optimizer(Kingma and Ba 2017) with a base learning rateof 10minus3 that decays by a factor of 10minus2 every 10epochs We observe that the Vanilla convergesin approximately 100 epochs and ViLBERT con-verges in 9 epochs ViLBERT has 250 millionparameters COSMic Vanilla includes 3062913trainable parameters Pre-trained BERT-Large andResNet50V2 have an additional 350 million param-eters The setup for coherence-aware captioningmodels to obtain machine-generated captions forour study is the same as (Alikhani et al 2020)

52 Baseline Captioning Metrics

To benchmark COSMic we compare it with otherlearned metrics In this section we describe thesevarious metrics traditionally used for measuringimage captioning systems None of these metricswere designed to support the coherence relationsof the reference or generated captions These serveas baselines for COSMic

N-gram based The most popular image caption-ing metrics are based on precision and recall of n-grams from generated and reference captions Wecompare with Bleu1 Bleu2 Bleu3 Bleu4 (Guo andHu 2019) ROUGEL (Lin 2004) CIDEr (Vedan-tam et al 2015) and SPICE (Anderson et al2016b) We compute these using their popularopen-source implementation6

BLEURT We use a pre-trained BLEURT model7

as a baseline for our work Unlike N-gram basedapproaches BLEURT uses BERT-based word em-beddings which are robust to variations in surfaceword realizations between the reference and gen-erated captions We do not do any fine-tuning forthis baseline

BERTScore BERTScore8 uses a pre-trainedBERT model to embed the reference and gener-ated captions Text-level similarity scores are then

5httpsgithubcomfacebookresearchvilbert-multi-task

6httpsgithubcomtylincoco-caption7httpsgithubcomgoogle-research

bleurt8httpsgithubcomTiiigerbert_score

computed by matching the tokensrsquo output embed-dings

Please note that for both BERT-based baselinesabove (BLEURT BERTScore) we use the BERT-Large-512 size model

53 COIN-based Evaluation Setup

We use each baseline metric and COSMic to scorethe 8 different image captioning systems describedin Section 4 on the same set of test images withreference captions Note that the range and scaleof each metric is different however they are allmonotonously increasing functions of model qual-ity So in our study we do not analyze the abso-lute score assigned by these metrics but only theirranks We also ask human annotators to rank these8 captioning systems on the same set of test im-ages The ranks assigned by a higher performingmetric will align better with the ranks from humanannotators

Since the captioning systems above are trainedon Conceptual Captions or COCO we use im-agecaption pairs from COIN for an out-of-domainevaluation A subset of 50 random images is usedto rank the captioning systems as described aboveresulting in 400 machine-generated captions totalfor the 8 captioning systems These were thenevaluated by human annotators using the processdescribed in Section 3 The human-scored systemlevel performance for each captioning system onthis test set is reported in Table 1 in ldquoAverage Hu-man Ratingrdquo

We measure the alignment between metric-assigned and human-assigned scores using theKendall (Kendall 1938) correlation coefficient Inorder to calculate the score we first aggregate allthe sample scores and average them Then wecalculate the Kendall tau score using the SciPy171 implementation The score is calculatedbetween two vectors first of which is the aver-age human ratings for 8 models and the secondbeing the investigated metric scores for 8 mod-els in the following order[BaseV isible BaseMetaBaseSubjective BaseStory LiteV isible LiteMetaLiteSubjective LiteStory] Due to the small sam-ple size Kendall correlation is the most suitablecorrelation measure

A key measure of the success of an automaticevaluation metric is whether it makes the same deci-sion about which system is better in a head-to-headevaluation as we would get from a human-subjects

evaluation If each system is evaluated based onits average score then success comes when the av-erage computed metric correlates closely with theaverage human-ranking In particular we measurethe alignment between metric assigned and humanassigned scores using the Kendall score followingthe work of (Sellam et al 2020)

6 Results

Table 1 presents the results of the COIN-basedstudy The last row reports the Kendall correla-tion coefficient between the scores assigned by themetric and humans

All N-gram based metrics such as BLEU andCIDEr fail to adapt to the out-of-domain ground-truth captions from COIN This results in a rela-tively flat distribution of system-level scores con-centrated close to 0 and hence low correlation co-efficients CIDEr has a highly negative Kendallrsquosτ which denotes a strong negative associationwith human judgements This is partly due to low(sim001) and hence noisy CIDEr scores (Figure 4provides example cases that illustrate this argu-ment)

Embedding-based methods BLEURT andBERTScore do not suffer from this limitation re-sulting in more meaningful scoring of systems andhence higher correlation with human scores How-ever by design both these metrics are agnostic tocoherence-labels and the input image COSMicwhich is coherence-aware obtains the highest cor-relation with human scores COSMic ViLBERThas the highest Kendallrsquos correlation among all ofour models COSMic Vanilla performs the sec-ond best among our models and it performs betterthan the rest of the models in terms of Kendallrsquoscorrelation

Data Augmentation The raw RaCCoon trainingdata has a coherence-level bias as demonstrated bythe average COSMic score for each class mdash Visi-ble (0622) Meta (0459) Subjective (0236) andStory (0397) This reflects the human annotatorsrsquobias towards liking Visible captions the most andSubjective captions the least which is expectedHowever training COSMic on this data injects thesame coherence-bias into the model which is un-desirable As presented in Table 1 both flavors ofCOSMic (without the lsquo+rsquo) assign high scores toVisible captioning systems

To mitigate this issue we algorithmically aug-ment the training data to bring the average scoresfor each coherence class to comparable values Weachieve this by pairing images with random cap-tions from the coherence class and assigning thema score of 0 This is a valid training sample becausethe randomly sampled caption does not describe thesaid image and serves as a negative sample Withthese operations the class bias is significantly re-duced mdash Visible (0459) Meta (0439) Subjective(0328) and Story (0425) The COSMic columnsin Table 1 with lsquo+rsquo denote that this data augmen-tation approach improves ranking of captioningsystems leading to better alignment with humanjudgements

Ablation Study Table 2 reports the perfor-mance of COSMic Vanilla without coherence-labels andor the image as model inputs We findthat removal of image features affects COSMicrsquosperformance showing the important contributionof images The performance deteriorates signifi-cantly when the coherence-labels are removed fromthe model (No rc gc column in Table 2) Thisdemonstrates that COSMic successfully integratescoherence-relations in the caption scoring process

Reference two men in scrubs per-forming surgery

mountains in front of aclear blue sky

large brick building next toa green lawn and big trees

a foggy forest

Generated surgeons operating on apatient

mountain range as seenfrom the trail

the front of the house light shining throughthe trees

Figure 4 Illustration of COIN reference captions and corresponding outputs of the Base-Visible model Thoughthe generated captions are correct an n-gram based metric such as CIDEr assigns them a very low score due to thevariations in surface word realizations See Table 1 for average scores over the test set (Photo credits from left toright US Army Africa Gabriel Fr James Bradley Rosmarie Voegtli)

System COSMic

Model CohLabel

Full No I No c No I amp c

Base

Visible 516 447 434 442Meta 505 439 442 453Subj 356 347 438 453Story 505 433 436 445

Lite

Visible 515 444 434 433Meta 511 434 447 464Subj 379 367 440 459Story 499 440 433 442

KendallrsquosCorr (τ ) 667 546 -222 -415

Table 2 Ablation experiment results No I repre-sents COSMic Vanilla without image features Norc gc represents COSMic Vanilla without coherencelabel embeddings finally No I amp No rc gc repre-sents COSMic Vanilla without coherence label embed-dings and without image features

7 Conclusion

Our work is the first step towards designing genera-tion metrics that respect the information goal of thegenerated text We observe that a small set of ex-amples annotated with coherence relations can pro-vide what is needed for learning a discourse-awaregeneration metric Our findings have implicationsfor designing context-aware multimodal metricswith criteria that are closer to human ratings forevaluating machine-generated multimodal content

We have called attention to the challenge oflearning robust generation metrics that can eval-uate the output of the generation models consid-ering the information goals Our findings sug-gest that fine-tuning ViLBERTmdashoriginally trainedwith millions of imagesmdashwith a smaller sample ofcoherence relations and expert-annotated scoringautomated metrics can score generated captionscloser to a human rating The presented datasetprovides the opportunity for future research in thearea of image description generation designingdiscourse-aware metrics and multimodal contentevaluation We hope that coherence-aware text gen-eration metrics could be used for learning bettergeneration models (such as abstractive summariza-tion or story generation) and could be deployeddirectly in machine learning pipelines to help inoptimizing hyper-parameters Ultimately it is in-tended to have a generalizable model that can usea labeling mechanismmdashnot restricted to coherencelabelsmdash to improve applicability of generation met-rics in different tasks

8 Ethics

This paper describes a research prototype We donot work with sensitive or personal data Our pro-tocol was approved by our ethics board Humansubjects participated voluntarily undertook min-imal risk and were compensated fairly for theirtime The dataset we produced is fully anonymizedSubjects consented to the distribution of their dataas part of their participation in the research Tech-nologists should think carefully before deployingour ideas in production Our work depends onpretrained models such as word and image embed-dings These models are known to reproduce andeven magnify societal bias present in training dataMoreover like many ML NLP methods our meth-ods are likely to perform better for content thatis better represented in training leading to furtherbias against marginalized groups We can hope thatgeneral methods to mitigate harms from ML biascan address these issues

A distinctive complication of our work is the factthat many imagendashtext presentations involve writ-ers expressing subjective opinions By its natureour evaluation metric assesses such subjective textsbased on averages and trends across many userswhich may be problematic Although such judg-ments are ultimately matters of personal taste theyare nevertheless often grounds by which hierarchiesof differences are culturally encoded and enforcedThus a deployed subjective-caption generation sys-tem could well be unfair to users especially if thoseusers are not confident in their own taste or criticaltowards the systemrsquos responses Our evaluationmetric is not sensitive to such harms

Acknowledgements

The authors affiliated with Rutgers University werepartly supported by NSF Award CCF-19349243Thanks to Pitt Cyber for supporting this project andthe authors from the University of Pittsburgh Wealso acknowledge the Center for Research Comput-ing at the University of Pittsburgh for providing therequired computational resources for carrying outexperiments at the University of Pittsburgh

ReferencesMartiacuten Abadi Ashish Agarwal Paul Barham Eugene

Brevdo Zhifeng Chen Craig Citro Greg S CorradoAndy Davis Jeffrey Dean Matthieu Devin SanjayGhemawat Ian Goodfellow Andrew Harp Geoffrey

Irving Michael Isard Yangqing Jia Rafal Jozefow-icz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dandelion Maneacute Rajat Monga Sherry MooreDerek Murray Chris Olah Mike Schuster JonathonShlens Benoit Steiner Ilya Sutskever Kunal TalwarPaul Tucker Vincent Vanhoucke Vijay VasudevanFernanda Vieacutegas Oriol Vinyals Pete Warden Mar-tin Wattenberg Martin Wicke Yuan Yu and Xiao-qiang Zheng 2015 TensorFlow Large-scale ma-chine learning on heterogeneous systems Softwareavailable from tensorfloworg

Abien Fred Agarap 2018 Deep learning using recti-fied linear units (relu) CoRR abs180308375

Malihe Alikhani Piyush Sharma Shengjie Li RaduSoricut and Matthew Stone 2020 Cross-modal co-herence modeling for caption generation In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 6525ndash6535 Online Association for Computational Lin-guistics

Peter Anderson Basura Fernando Mark Johnsonand Stephen Gould 2016a SPICE semanticpropositional image caption evaluation CoRRabs160708822

Peter Anderson Basura Fernando Mark Johnson andStephen Gould 2016b Spice Semantic propo-sitional image caption evaluation In EuropeanConference on Computer Vision pages 382ndash398Springer

Peter Anderson Xiaodong He Chris Buehler DamienTeney Mark Johnson Stephen Gould and LeiZhang 2018 Bottom-up and top-down attention forimage captioning and visual question answering InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR)

Xinlei Chen Hao Fang Tsung-Yi Lin Ramakr-ishna Vedantam Saurabh Gupta Piotr Dollar andC Lawrence Zitnick 2015 Microsoft coco cap-tions Data collection and evaluation server

Yin Cui Guandao Yang Andreas Veit Xun Huangand Serge Belongie 2018 Learning to evaluate im-age captioning In Proceedings of the IEEE con-ference on computer vision and pattern recognitionpages 5804ndash5812

Samuel Cumming Gabriel Greenberg and Rory Kelly2017 Conventions of viewpoint coherence in filmPhilosophersrsquo Imprint 17(1)1ndash29

J Deng W Dong R Socher L-J Li K Li and L Fei-Fei 2009 ImageNet A Large-Scale HierarchicalImage Database In CVPR09

Michael Denkowski and Alon Lavie 2014 Meteor uni-versal Language specific translation evaluation forany target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Haoyuan Gao Junhua Mao Jie Zhou Zhiheng HuangLei Wang and Wei Xu 2015 Are you talking to amachine dataset and methods for multilingual im-age question In Advances in Neural InformationProcessing Systems pages 2296ndash2304

Yifan Gao Chien-Sheng Wu Jingjing Li Shafiq JotySteven CH Hoi Caiming Xiong Irwin King andMichael Lyu 2020 Discern Discourse-aware en-tailment reasoning network for conversational ma-chine reading In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) pages 2439ndash2449 Online As-sociation for Computational Linguistics

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Kaiming He Xiangyu Zhang Shaoqing Ren and JianSun 2015 Deep residual learning for image recog-nition CoRR abs151203385

Tuomo Hiippala Malihe Alikhani Jonas HaverinenTimo Kalliokoski Evanfiya Logacheva SerafinaOrekhova Aino Tuomainen Matthew Stone andJohn A Bateman 2021 AI2D-RST a multimodalcorpus of 1000 primary school science diagramsLang Resour Evaluation 55(3)661ndash688

Jerry R Hobbs 1985 On the coherence and structureof discourse

Xinyue Huang and Adriana Kovashka 2016 Inferringvisual persuasion via body language setting anddeep features In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern RecognitionWorkshops pages 73ndash79

M G Kendall 1938 A new measure of rank correla-tion Biometrika 30(12)81ndash93

Diederik P Kingma and Jimmy Ba 2017 Adam Amethod for stochastic optimization

Ranjay Krishna Yuke Zhu Oliver Groth Justin John-son Kenji Hata Joshua Kravitz Stephanie ChenYannis Kalantidis Li-Jia Li David A ShammaMichael Bernstein and Li Fei-Fei 2016 Visualgenome Connecting language and vision usingcrowdsourced dense image annotations

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019a Integratingtext and image Determining multimodal documentintent in Instagram posts In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 4622ndash4632 Hong KongChina Association for Computational Linguistics

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019b Integrat-ing text and image Determining multimodal doc-ument intent in instagram posts arXiv preprintarXiv190409073

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov and et al 2020a The open imagesdataset v4 International Journal of Computer Vi-sion 128(7)1956ndash1981

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov et al 2020b The open images datasetv4 International Journal of Computer Vision pages1ndash26

Alex Lascarides and Matthew Stone 2009 A formalsemantic analysis of gesture Journal of Semantics26(4)393ndash449

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81 Barcelona SpainAssociation for Computational Linguistics

Jiasen Lu Dhruv Batra Devi Parikh and StefanLee 2019 Vilbert Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks In Advances in Neural Information Process-ing Systems volume 32 Curran Associates Inc

Pranava Madhyastha Josiah Wang and Lucia Specia2019 VIFIDEL Evaluating the visual fidelity ofimage descriptions In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics pages 6539ndash6550 Florence Italy Asso-ciation for Computational Linguistics

William C Mann and Sandra A Thompson 1987Rhetorical structure theory A theory of text orga-nization University of Southern California Infor-mation Sciences Institute Los Angeles

Scott McCloud 1993 Understanding comics The in-visible art William Morrow

Edwin G Ng Bo Pang Piyush Sharma and RaduSoricut 2020 Understanding guided image cap-tioning performance across domains arXiv preprintarXiv201202339

Christian Otto Matthias Springstein Avishek Anandand Ralph Ewerth 2019 Understanding catego-rizing and predicting semantic image-text relationsIn Proceedings of the 2019 on International Con-ference on Multimedia Retrieval pages 168ndash176ACM

Kishore Papineni Salim Roukos Todd Ward and Weijing Zhu 2002 Bleu a method for automatic evalu-ation of machine translation pages 311ndash318

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Advances in Neural Information Pro-cessing Systems 32 pages 8024ndash8035 Curran Asso-ciates Inc

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind Joshi and Bon-nie Webber 2008a The Penn Discourse Tree-Bank 20 In Proceedings of the Sixth Interna-tional Conference on Language Resources and Eval-uation (LRECrsquo08) Marrakech Morocco EuropeanLanguage Resources Association (ELRA)

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind K Joshi and Bon-nie L Webber 2008b The Penn discourse treebank20 In LREC Citeseer

J Pustejovsky and N Krishnaswamy 2020 Situatedmeaning in multimodal dialogue human-robot andhuman-computer interactions

Shaoqing Ren Kaiming He Ross Girshick and JianSun 2016 Faster r-cnn Towards real-time objectdetection with region proposal networks

Deborah Schiffrin 1980 Meta-talk Organizationaland evaluative brackets in discourse SociologicalInquiry 50(3-4)199ndash236

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Piyush Sharma Nan Ding Sebastian Goodman andRadu Soricut 2018 Conceptual captions Acleaned hypernymed image alt-text dataset for au-tomatic image captioning In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages2556ndash2565 Melbourne Australia Association forComputational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez Ł ukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008 Cur-ran Associates Inc

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 CIDEr Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Ramakrishna Vedantam C Lawrence Zitnick andDevi Parikh 2014 Cider Consensus-based imagedescription evaluation CoRR abs14115726

Anthony Viera and Joanne Garrett 2005 Understand-ing interobserver agreement The kappa statisticFamily medicine 37360ndash3

Jiacheng Xu Zhe Gan Yu Cheng and Jingjing Liu2019 Discourse-aware neural extractive text sum-marization arXiv preprint arXiv191014142

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Page 2: arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

robust performance and achieve a higher correla-tion with human judgments than that of previousmetrics Nevertheless they too fail to generalizeto the different kinds of content that successful de-scriptions may exhibit across different goals andcontexts That is they cannot distinguish reason-able descriptions that happen to differ from refer-ence output in their goals and perspective fromproblematic descriptions that hallucinate inappro-priate content or context

To bridge this gap we present a coherence-awareembedding-based generation metric that learns torespect diverse discourse goals without penalizingcaptions that are purposefully generated to fulfilldifferent purposes or communicate background in-formation Figure 1 demonstrates this capability bypresenting an example image and captions with dif-ferent coherence labels together with their scores

Our approach to modeling discourse goals isbased on the framework of discourse coherence the-ory (Hobbs 1985) which characterizes the infer-ences that give discourse units a coherent joint in-terpretation using a constrained inventory of coher-ence relations In particular we use the taxonomyfor imagendashtext coherence developed by Alikhaniet al (2020) which for example includes VisibleStory and Subjective relations between the text andthe image A description and an image stand in aVisible relation if the text includes information thatis recognizably depicted in the image Subjectivecaptions react to the content of the image and Storycaptions provide a free-standing description of thecircumstances depicted in the image similar to theNarration relation in text Our metric is learned inpart from a new dataset of 4000 images with de-scriptions labeled with different coherence labelsin this taxonomy

In inaugurating the study of coherence-awaregeneration metrics we make the following specificcontributions In Section 3 we present two differ-ent annotated datasets for training and testing acoherence-aware metric We present a model toscore a generated caption given the image refer-ence caption and the discourse goals of both thesecaptions (Section 4) We compare this metric to pre-vious ones using a common methodology rankingthe performance of several different caption genera-tion systems on out-of-domain imagesmdashrelying ona new benchmark out-of-domain test set which wepublish providing reference captions for a subsetof OpenImages (Kuznetsova et al 2020b) Our

experiments demonstrate that among all these met-rics our proposed metric has the highest correlationwith human judgments

2 Related work

There are diverse ways of characterizing the con-tributions of text and imagery Gao et al (2015)investigate the genre of image captions and Huangand Kovashka (2016) study the persuasive implicitrelationships between text and images Kruk et al(2019b) study the emotional links between text andimages Otto et al (2019) present an annotateddataset of text and imagery that compares the in-formation load in text and images However webuild on works that study information-level infer-ences between discourse units in different modali-ties such as comic book panels (McCloud 1993)movie plots (Cumming et al 2017) and diagram-matic elements (Hiippala et al 2021) In particularwe use Alikhani et al (2020)rsquos relations that char-acterize inferences between text and images

Coherence-aware models have benefited sev-eral NLP tasks such as gesture interpretation (Las-carides and Stone 2009 Pustejovsky and Krish-naswamy 2020) text summarization (Xu et al2019) machine comprehension (Gao et al 2020)The majority of these works use Rhetorical Struc-ture Theory (RST) (Mann and Thompson 1987)and Penn Discourse TreeBank (PDTB) (Prasadet al 2008b) datasets to learn and predict theserelations between two adjacent text spans In thisline of work we are the first to present a coherence-aware generation metric

The most widely used automatic evaluation met-rics are ngram-based which compute the exactnumber of ngram matches between reference andgenerated text (Cui et al 2018) Examples ofsuch metrics that are commonly used for evaluatingthe output of captioning translation and summa-rization models are BLEU (Papineni et al 2002)ROUGE (Lin 2004) and CIDEr (Vedantam et al2015) The major problem of the n-gram sim-ilarity metrics is that they give no credit to syn-onym matches of reference n-grams even if thosewords are common and used appropriately in thegenerated text Embedding-based metrics such asBLEURT (Sellam et al 2020) and BERTScore(Zhang et al 2020) designed to address this lim-itation are closer to human ratings BLEURT isa data-intensive training scheme that is based onBERT (Devlin et al 2019) fine-tuned on human

Facade of a glass building A pink flower bush in a gar-den

The underside of the Arc deTriomphe

Close-up of a fly sitting on adaisy

Man sitting by his artworklooking at a large statue ofa man on a horse in a royalcourtyard

Woman with an umbrellareading a book sitting in thegrass in front of a city sky-line

Cowboy on a horse and cow-boy on the ground workingtogether to lasso a calf in apen

Black and white artworkpainted on a blue wall

Figure 2 Examples of the ground truth captions that we collected for the COIN dataset (Photo credits from left toright top to bottom Sharron Mollerus Northfielder George M Groutas davebloggs007 Tim Adams BrisbaneCity Council Colin Brown Guilhem Vellut)

ratings of generated text BERTScore howevercomputes the similarity score as the average ofcosine similarities between predicted tokens andtheir top matching reference tokens These metricshowever do not respect the information goal andthe purpose for which the model has generated thetext We address this problem by introducing thefirst coherence-aware generation metric Similarto SPICE (Anderson et al 2016b) and VIFIDEL(Madhyastha et al 2019) we use the informationencoded in images We further propose the addi-tion of coherence relations that facilitate learningwith fewer samples by a multimodal metric usingpre-trained BERT and ViLBERT

3 Data Collection

We collect two datasets human judgments forimage captions that are generated by coherence-aware captioning systems using Conceptual Cap-tions dataset and ground-truth labels for the OpenImages dataset With Conceptual Captions cor-pora we fine-tune ViLBERT with ratings and showthat addition of coherence relations can make au-tomated scoring closer to human scoring We useOpenImages corpora to reinforce that multimodal-ity and coherence relations have significant contri-butions to scoring out-of-domain datasets as well

Protocol We hired two expert linguists for dataannotation and designed an annotation website tofacilitate the annotation procedure They are na-tive English speakers who identify themselves as

of White and Latino ethnicity The code 1 of theannotation website and the details of the protocolis publicly available The study has been approvedby our institutionrsquos human subject board

Conceptual Captions Score Annotation Wehave collected ratings on the quality of different im-age descriptions with coherence labels for a subsetof 1000 images from the Conceptual Captions (CC)training dataset (Ng et al 2020) With this paperwe are publishing this dataset as a benchmark forevaluation metrics that are coherence-aware Theset-up of the data collection is as follows CCimages are input into a caption-generation modelcreated by Alikhani et al (2020) This modelgenerates coherence-aware descriptions for inputimages in 4 different coherence classes of MetaVisible Subjective Story These 4000imagecaption pairs are then presented to humanannotators who are asked to select the correctcoherence label for each pair

bull Meta the caption talks about when whereand how the picture is taken Meta-talk inSchiffrin (1980)

bull Visible the caption is true just by looking atthe picture Restatement relation in Prasadet al (2008a)

bull Subjective the captions is the matter of opin-ion Evaluation relation in Hobbs (1985)

bull Story text and image work like story and il-lustration Occasion relation in Hobbs (1985)

1httpsgithubcomMertermCOSMic

Figure 3 An illustration of different flavors of COSMic that outputs a score for the generated caption given theimage reference caption and the coherence-labels for both the captions (a) COSMic Vanilla uses only globaltextual and visual features while (b) COSMic ViLBERT uses combined visio-linguistic features with both localand global focus This model takes into account the information goals (determined by coherence-labels) for boththe captions when comparing the generated caption to the reference for evaluation

After the annotator selects a specific coherencelabel from the above we ask them to rate the qualityof the captions given the label on a scale of 1 to5 We use these annotations as training data for ourcoherence-aware captioning metric COSMic Wecall this data we annotated RaCCoon (Ratings forConceptual Caption)

To calculate the Cohenrsquos κ agreement measurewe selected 150 images randomly and assignedthem to two annotators The Kappa coefficient isκ = 089 which indicates a substantial agreement(Viera and Garrett 2005)

OpenImages Ground Truth Captions To cre-ate an out of domain test set we asked our anno-tators to write Visible captions for 1000 images2

from the OpenImages dataset (Kuznetsova et al2020a) We call this dataset COIN (Corpus ofOpenImages with Natural descriptions) A sampleof these ground truth captions written by our expertlinguists are presented in Figure 2 We use thisdataset to test COSMic and other learned metricsin Section 5 and present our benchmark results inTable 1

4 Method

The goal of a coherence-aware image captioningmetric is to predict a score for the generated cap-tion given the image reference caption and coher-ence relations of one generated caption and one

2The same subset named T2 was used for theCVPR-2019 Workshop on Conceptual Captionswwwconceptualcaptionscom

reference caption This metric function M can beformalized as predicting a score s as follows

s =M(I g r gc rc θ) (1)

where the metric is defined by parameters θ andwhere the model inputs are defined as I being theimage being captioned g and r the generated andreference captions respectively gc and rc are thecoherence relations for g r respectively

We now describe the architecture of ourcoherence-aware image captioning metric COS-Mic (COherence-Sensitive Metric of imagecaptions) It has two flavors mdash a ViLBERT-basedmodel pre-trained on large multimodal data and abaseline Vanilla version as illustrated in Figure 3Both are trained on RaCCoon training data (Sec-tion 3) with normalized human annotated rating toobtain the modelrsquos target score

41 COSMic ViLBERT

ViLBERT (Lu et al 2019) is a multimodal featurelearning model pre-trained on 33 million Concep-tual Captions image and captions data It is trainedfor masked multi-modal learning and multi-modalalignment prediction and demonstrates strong per-formance on several downstream multimodal taskssuch as VQA VCR grounding and image retrievalFor this reason we use a pre-trained ViLBERT toembed our multimodal inputs shown in Equation 1with changes to incorporate both the captions andcoherence relations

For input image (I) we use the same processas ViLBERT We use a Faster R-CNN (Ren et al2016) model pre-trained on Visual Genome (Kr-ishna et al 2016) to detect objects regions and ex-tract features The sequence of these image featuresis denoted as I prime with 100 bounding box featureswhere each element is R2048 Similar to ViLBERTwe use the special token [IMG] to denote the be-ginning of the bounding box features list

For input captions (g r) and coherence labels(gc gr) the sequence begins with the special token[CLS] followed by input text embeddings Each ofour text inputs are tokenized and embedded usingViLBERTrsquos input text pre-processing and denotedas gprime rprime gprimec g

primer for g r gc and gr respectively

Note that the coherence labels are processed as textinputs such as ldquoVisiblerdquo and ldquoStoryrdquo which allowsthe model to use its pre-trained representations ofthese concepts Each of these input sequences areseparated by the special token [SEP] to form ourinput sequence

Hence our input to ViLBERT is of formv = ([IMG] I prime [CLS] rprime [SEP] gprime [SEP] rprimec [SEP] gprimec)

We use a linear layer with sigmoid activationon ViLBERTrsquos output text logits to compute COS-Micrsquos output metric score (s)

s = Linear(ViLBERT(v)) (2)

During training we fine-tune ViLBERT and theoutput linear layer in an end-to-end fashion by mini-mizing the Mean-Squared error between the outputscore s and the corresponding reference score yon the RaCCoon dataset

42 COSMic VanillaThe COSMic ViLBERT approach above takes ad-vantage of multimodal pre-training on the Concep-tual Captions dataset to embed the image and textinputs As a simpler baseline we now presentCOSMic Vanilla which independently embeds theinput image and text to be later combined for scorecomputation with no end-to-end training

To extract image features we use a ResNet50v2(He et al 2015) model pre-trained on ImageNet(Deng et al 2009) and linearly transform the globalimage representation to 512-dimensional space

eI = Linear1(AveragePool(ResNet(I))) (3)

In our textual feature extraction module weembed g and r independently with a pre-trained

BERT-Large-512 model We use the [CLS] to-ken embedding as 1024 dimensional caption-levelrepresentation in each case and transform them to512-dimensional space

eg = Linear2(BERTCLS(g))

er = Linear2(BERTCLS(r))(4)

In our coherence label embedding module gcand rc are each represented as one-hot vectors suchthat the dimensions correspond to labels Meta Vis-ible Subjective and Story Each is embedded intoa 512-dimensional space

egc = Linear3(gc)

erc = Linear3(rc)(5)

We thus obtain the 5 vectors (each R512)representing one of the inputs of Equation 1We concatenate and use a feed-forward net-work with progressively smaller hidden layersof sizes [512 256 128 64 32 16 8] each withReLU (Agarap 2018) activation The output scores is computed by a final linear layer on top of theabove network

e = concat([eI eg er egc erc ]))

s = Linear4(MLP1(e))(6)

where e isin R2560 and s isin RTo understand the role of each component of this

implementation we further deconstruct each mod-ule in ablation experiments described in Table 2

43 Coherence-aware Captioning SystemsIn order to experiment with COSMic we generateour own captions In this section we describe thecoherence-aware captioning systems used to gener-ate these image captions for the training and testingof COSMic

For our base captioning system we use the state-of-the-art coherence-aware captioning system in-troduced by (Alikhani et al 2020) It uses aTransformer-based (Vaswani et al 2017) encoder-decoder architecture where the encoder inputs are(1) global image features (2) image labels and (3)coherence label The coherence-label also servesas the first input token for the decoder which gen-erates the output captions We set the coherencelabel to the groundtruth relation at training timeand the desired relation at inference time We usethe Conceptual Captions dataset (Sharma et al2018) with machine-generated coherence labels for

System AvgHum

Rating

Metrics

Model CohLabel B1 B2 M RL C S BR BS-

FCOSMicVanilla

COSMicViL-BERT

COSMicVanilla+

COSMicViL-BERT+

BUTD Visible 2191 163 077 049 160 092 030 -877 863 706 796 522 641

Base

Visible 3532 050 025 019 066 020 002 -1114 862 696 777 516 614Meta 3213 041 000 012 063 012 000 -1059 863 548 727 505 602Subj 2830 033 012 011 057 017 000 -1197 849 323 421 358 403Story 2915 029 000 017 058 013 000 -1304 842 533 629 482 527

Lite

Visible 3298 028 011 013 053 011 000 -1101 863 684 784 515 604Meta 2830 026 010 008 055 015 000 -1084 859 548 748 511 565Subj 2298 039 012 019 066 024 003 -1217 849 364 451 379 419Story 2426 036 000 018 062 021 000 -1362 842 568 666 499 519

KendallrsquosCorrelation (τ ) 1000 071 154 036 -036 -571 -052 286 445 571 546 667 764

Table 1 System-level scores for 9 different image captioning systems as evaluated by human annotators andvarious captioning metrics Bottom-Up Top-Down (BUTD) is trained on COCO while others are trained on theConceptual Captions (CC) dataset The evaluation however is conducted on COIN dataset which is out-of-domainfor both COCO and CC This domain shift causes the n-gram based metrics (eg BLEU ROUGE CIDEr) to assignvery low scores to otherwise correct captions (See Table 4) Whereas embedding based metrics (eg BLEURTBERTScore and COSMic) do not suffer from this limitation Since all metrics have different scales instead ofabsolute scores we use Kendall Rank Correlation to measure agreement with human scores Model names areabbreviated as follows B1 Bleu1 B2 Bleu2 M METEOR RL ROUGEL C CIDEr S SPICE BR BLEURTBS-F BERTScore F1 COSMic models with rsquo+rsquo denote application of data augmentation to remove training databias More metrics and detailed results can be found on the code repository

training this captioning system To obtain the co-herence labels above we closely follow (Alikhaniet al 2020) to train a coherence classifier on theClue dataset (Alikhani et al 2020) that providesaround 4K human annotated (image caption rela-tion) triplets We present two caption-generationsystems in this section

Base-systems family A family of 4 captioningsystems is created by setting the coherence-labelto Meta Visible Subjective or Story in the basecaptioning model described above These are con-sidered different captioning systems because theinformation content and discourse goals as con-trolled by the coherence label are different

Lite-systems family We remove the global im-age features from the base modelrsquos input to obtaina smaller light-weight (lite) model Similar to thebase model we obtain a family of 4 captioningsystems by changing the coherence-label

In Section 5 we study the order in which sev-eral image captioning metrics rank these 8 systemsThe goal is to identify the metric that agrees themost with the groundtruth rankings based on hu-man assessments

44 COCO-trained Captioning SystemCOSMicrsquos training data RaCCoon is based onConceptual Captions and it is coherence-aware Totest the modelrsquos generalization capability we use

a captioning system trained on MS COCO (Chenet al 2015) Since COSMic expects an input co-herence label and COCO captions are Visible styleby design we set the label to Visible Specificallywe use the Bottom-Up Top-Down (BUTD) Atten-tion model (Anderson et al 2018) This helpsstudy how well COSMic generalizes to other cap-tioning datasets and coherence-agnostic captioningsystems

5 Experiments

Here we describe the experimental setup to com-pare COSMic with other metrics As outlined inSection 3 and 4 we use the RaCCoon data to trainour models and COIN to test COSMic and othermetrics We have several baseline metrics that wecompare to which can be found on Table 1

51 Model Training Setup

We implement COSMicmdashas described in Sec-tion 4mdashwith PyTorch (Paszke et al 2019) and trainon a GTX1080 GPU We pre-compute BERT3 andResNet4 features using their TensorFlow (Abadiet al 2015) implementations We use the pub-

3httpsgithubcomgoogle-researchbert

4httpswwwtensorfloworgapi_docspythontfkerasapplicationsResNet50V2

lic ViLBERT5 implementation We use a batchsize of 4 and a learning rate of 2times 10minus6 for fine-tuning ViLBERT and use RAdam optimizer andstop the training when the validation score doesnot change for 3 epochs For COSMic Vanillawe train with a batch-size of 10 Adam optimizer(Kingma and Ba 2017) with a base learning rateof 10minus3 that decays by a factor of 10minus2 every 10epochs We observe that the Vanilla convergesin approximately 100 epochs and ViLBERT con-verges in 9 epochs ViLBERT has 250 millionparameters COSMic Vanilla includes 3062913trainable parameters Pre-trained BERT-Large andResNet50V2 have an additional 350 million param-eters The setup for coherence-aware captioningmodels to obtain machine-generated captions forour study is the same as (Alikhani et al 2020)

52 Baseline Captioning Metrics

To benchmark COSMic we compare it with otherlearned metrics In this section we describe thesevarious metrics traditionally used for measuringimage captioning systems None of these metricswere designed to support the coherence relationsof the reference or generated captions These serveas baselines for COSMic

N-gram based The most popular image caption-ing metrics are based on precision and recall of n-grams from generated and reference captions Wecompare with Bleu1 Bleu2 Bleu3 Bleu4 (Guo andHu 2019) ROUGEL (Lin 2004) CIDEr (Vedan-tam et al 2015) and SPICE (Anderson et al2016b) We compute these using their popularopen-source implementation6

BLEURT We use a pre-trained BLEURT model7

as a baseline for our work Unlike N-gram basedapproaches BLEURT uses BERT-based word em-beddings which are robust to variations in surfaceword realizations between the reference and gen-erated captions We do not do any fine-tuning forthis baseline

BERTScore BERTScore8 uses a pre-trainedBERT model to embed the reference and gener-ated captions Text-level similarity scores are then

5httpsgithubcomfacebookresearchvilbert-multi-task

6httpsgithubcomtylincoco-caption7httpsgithubcomgoogle-research

bleurt8httpsgithubcomTiiigerbert_score

computed by matching the tokensrsquo output embed-dings

Please note that for both BERT-based baselinesabove (BLEURT BERTScore) we use the BERT-Large-512 size model

53 COIN-based Evaluation Setup

We use each baseline metric and COSMic to scorethe 8 different image captioning systems describedin Section 4 on the same set of test images withreference captions Note that the range and scaleof each metric is different however they are allmonotonously increasing functions of model qual-ity So in our study we do not analyze the abso-lute score assigned by these metrics but only theirranks We also ask human annotators to rank these8 captioning systems on the same set of test im-ages The ranks assigned by a higher performingmetric will align better with the ranks from humanannotators

Since the captioning systems above are trainedon Conceptual Captions or COCO we use im-agecaption pairs from COIN for an out-of-domainevaluation A subset of 50 random images is usedto rank the captioning systems as described aboveresulting in 400 machine-generated captions totalfor the 8 captioning systems These were thenevaluated by human annotators using the processdescribed in Section 3 The human-scored systemlevel performance for each captioning system onthis test set is reported in Table 1 in ldquoAverage Hu-man Ratingrdquo

We measure the alignment between metric-assigned and human-assigned scores using theKendall (Kendall 1938) correlation coefficient Inorder to calculate the score we first aggregate allthe sample scores and average them Then wecalculate the Kendall tau score using the SciPy171 implementation The score is calculatedbetween two vectors first of which is the aver-age human ratings for 8 models and the secondbeing the investigated metric scores for 8 mod-els in the following order[BaseV isible BaseMetaBaseSubjective BaseStory LiteV isible LiteMetaLiteSubjective LiteStory] Due to the small sam-ple size Kendall correlation is the most suitablecorrelation measure

A key measure of the success of an automaticevaluation metric is whether it makes the same deci-sion about which system is better in a head-to-headevaluation as we would get from a human-subjects

evaluation If each system is evaluated based onits average score then success comes when the av-erage computed metric correlates closely with theaverage human-ranking In particular we measurethe alignment between metric assigned and humanassigned scores using the Kendall score followingthe work of (Sellam et al 2020)

6 Results

Table 1 presents the results of the COIN-basedstudy The last row reports the Kendall correla-tion coefficient between the scores assigned by themetric and humans

All N-gram based metrics such as BLEU andCIDEr fail to adapt to the out-of-domain ground-truth captions from COIN This results in a rela-tively flat distribution of system-level scores con-centrated close to 0 and hence low correlation co-efficients CIDEr has a highly negative Kendallrsquosτ which denotes a strong negative associationwith human judgements This is partly due to low(sim001) and hence noisy CIDEr scores (Figure 4provides example cases that illustrate this argu-ment)

Embedding-based methods BLEURT andBERTScore do not suffer from this limitation re-sulting in more meaningful scoring of systems andhence higher correlation with human scores How-ever by design both these metrics are agnostic tocoherence-labels and the input image COSMicwhich is coherence-aware obtains the highest cor-relation with human scores COSMic ViLBERThas the highest Kendallrsquos correlation among all ofour models COSMic Vanilla performs the sec-ond best among our models and it performs betterthan the rest of the models in terms of Kendallrsquoscorrelation

Data Augmentation The raw RaCCoon trainingdata has a coherence-level bias as demonstrated bythe average COSMic score for each class mdash Visi-ble (0622) Meta (0459) Subjective (0236) andStory (0397) This reflects the human annotatorsrsquobias towards liking Visible captions the most andSubjective captions the least which is expectedHowever training COSMic on this data injects thesame coherence-bias into the model which is un-desirable As presented in Table 1 both flavors ofCOSMic (without the lsquo+rsquo) assign high scores toVisible captioning systems

To mitigate this issue we algorithmically aug-ment the training data to bring the average scoresfor each coherence class to comparable values Weachieve this by pairing images with random cap-tions from the coherence class and assigning thema score of 0 This is a valid training sample becausethe randomly sampled caption does not describe thesaid image and serves as a negative sample Withthese operations the class bias is significantly re-duced mdash Visible (0459) Meta (0439) Subjective(0328) and Story (0425) The COSMic columnsin Table 1 with lsquo+rsquo denote that this data augmen-tation approach improves ranking of captioningsystems leading to better alignment with humanjudgements

Ablation Study Table 2 reports the perfor-mance of COSMic Vanilla without coherence-labels andor the image as model inputs We findthat removal of image features affects COSMicrsquosperformance showing the important contributionof images The performance deteriorates signifi-cantly when the coherence-labels are removed fromthe model (No rc gc column in Table 2) Thisdemonstrates that COSMic successfully integratescoherence-relations in the caption scoring process

Reference two men in scrubs per-forming surgery

mountains in front of aclear blue sky

large brick building next toa green lawn and big trees

a foggy forest

Generated surgeons operating on apatient

mountain range as seenfrom the trail

the front of the house light shining throughthe trees

Figure 4 Illustration of COIN reference captions and corresponding outputs of the Base-Visible model Thoughthe generated captions are correct an n-gram based metric such as CIDEr assigns them a very low score due to thevariations in surface word realizations See Table 1 for average scores over the test set (Photo credits from left toright US Army Africa Gabriel Fr James Bradley Rosmarie Voegtli)

System COSMic

Model CohLabel

Full No I No c No I amp c

Base

Visible 516 447 434 442Meta 505 439 442 453Subj 356 347 438 453Story 505 433 436 445

Lite

Visible 515 444 434 433Meta 511 434 447 464Subj 379 367 440 459Story 499 440 433 442

KendallrsquosCorr (τ ) 667 546 -222 -415

Table 2 Ablation experiment results No I repre-sents COSMic Vanilla without image features Norc gc represents COSMic Vanilla without coherencelabel embeddings finally No I amp No rc gc repre-sents COSMic Vanilla without coherence label embed-dings and without image features

7 Conclusion

Our work is the first step towards designing genera-tion metrics that respect the information goal of thegenerated text We observe that a small set of ex-amples annotated with coherence relations can pro-vide what is needed for learning a discourse-awaregeneration metric Our findings have implicationsfor designing context-aware multimodal metricswith criteria that are closer to human ratings forevaluating machine-generated multimodal content

We have called attention to the challenge oflearning robust generation metrics that can eval-uate the output of the generation models consid-ering the information goals Our findings sug-gest that fine-tuning ViLBERTmdashoriginally trainedwith millions of imagesmdashwith a smaller sample ofcoherence relations and expert-annotated scoringautomated metrics can score generated captionscloser to a human rating The presented datasetprovides the opportunity for future research in thearea of image description generation designingdiscourse-aware metrics and multimodal contentevaluation We hope that coherence-aware text gen-eration metrics could be used for learning bettergeneration models (such as abstractive summariza-tion or story generation) and could be deployeddirectly in machine learning pipelines to help inoptimizing hyper-parameters Ultimately it is in-tended to have a generalizable model that can usea labeling mechanismmdashnot restricted to coherencelabelsmdash to improve applicability of generation met-rics in different tasks

8 Ethics

This paper describes a research prototype We donot work with sensitive or personal data Our pro-tocol was approved by our ethics board Humansubjects participated voluntarily undertook min-imal risk and were compensated fairly for theirtime The dataset we produced is fully anonymizedSubjects consented to the distribution of their dataas part of their participation in the research Tech-nologists should think carefully before deployingour ideas in production Our work depends onpretrained models such as word and image embed-dings These models are known to reproduce andeven magnify societal bias present in training dataMoreover like many ML NLP methods our meth-ods are likely to perform better for content thatis better represented in training leading to furtherbias against marginalized groups We can hope thatgeneral methods to mitigate harms from ML biascan address these issues

A distinctive complication of our work is the factthat many imagendashtext presentations involve writ-ers expressing subjective opinions By its natureour evaluation metric assesses such subjective textsbased on averages and trends across many userswhich may be problematic Although such judg-ments are ultimately matters of personal taste theyare nevertheless often grounds by which hierarchiesof differences are culturally encoded and enforcedThus a deployed subjective-caption generation sys-tem could well be unfair to users especially if thoseusers are not confident in their own taste or criticaltowards the systemrsquos responses Our evaluationmetric is not sensitive to such harms

Acknowledgements

The authors affiliated with Rutgers University werepartly supported by NSF Award CCF-19349243Thanks to Pitt Cyber for supporting this project andthe authors from the University of Pittsburgh Wealso acknowledge the Center for Research Comput-ing at the University of Pittsburgh for providing therequired computational resources for carrying outexperiments at the University of Pittsburgh

ReferencesMartiacuten Abadi Ashish Agarwal Paul Barham Eugene

Brevdo Zhifeng Chen Craig Citro Greg S CorradoAndy Davis Jeffrey Dean Matthieu Devin SanjayGhemawat Ian Goodfellow Andrew Harp Geoffrey

Irving Michael Isard Yangqing Jia Rafal Jozefow-icz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dandelion Maneacute Rajat Monga Sherry MooreDerek Murray Chris Olah Mike Schuster JonathonShlens Benoit Steiner Ilya Sutskever Kunal TalwarPaul Tucker Vincent Vanhoucke Vijay VasudevanFernanda Vieacutegas Oriol Vinyals Pete Warden Mar-tin Wattenberg Martin Wicke Yuan Yu and Xiao-qiang Zheng 2015 TensorFlow Large-scale ma-chine learning on heterogeneous systems Softwareavailable from tensorfloworg

Abien Fred Agarap 2018 Deep learning using recti-fied linear units (relu) CoRR abs180308375

Malihe Alikhani Piyush Sharma Shengjie Li RaduSoricut and Matthew Stone 2020 Cross-modal co-herence modeling for caption generation In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 6525ndash6535 Online Association for Computational Lin-guistics

Peter Anderson Basura Fernando Mark Johnsonand Stephen Gould 2016a SPICE semanticpropositional image caption evaluation CoRRabs160708822

Peter Anderson Basura Fernando Mark Johnson andStephen Gould 2016b Spice Semantic propo-sitional image caption evaluation In EuropeanConference on Computer Vision pages 382ndash398Springer

Peter Anderson Xiaodong He Chris Buehler DamienTeney Mark Johnson Stephen Gould and LeiZhang 2018 Bottom-up and top-down attention forimage captioning and visual question answering InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR)

Xinlei Chen Hao Fang Tsung-Yi Lin Ramakr-ishna Vedantam Saurabh Gupta Piotr Dollar andC Lawrence Zitnick 2015 Microsoft coco cap-tions Data collection and evaluation server

Yin Cui Guandao Yang Andreas Veit Xun Huangand Serge Belongie 2018 Learning to evaluate im-age captioning In Proceedings of the IEEE con-ference on computer vision and pattern recognitionpages 5804ndash5812

Samuel Cumming Gabriel Greenberg and Rory Kelly2017 Conventions of viewpoint coherence in filmPhilosophersrsquo Imprint 17(1)1ndash29

J Deng W Dong R Socher L-J Li K Li and L Fei-Fei 2009 ImageNet A Large-Scale HierarchicalImage Database In CVPR09

Michael Denkowski and Alon Lavie 2014 Meteor uni-versal Language specific translation evaluation forany target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Haoyuan Gao Junhua Mao Jie Zhou Zhiheng HuangLei Wang and Wei Xu 2015 Are you talking to amachine dataset and methods for multilingual im-age question In Advances in Neural InformationProcessing Systems pages 2296ndash2304

Yifan Gao Chien-Sheng Wu Jingjing Li Shafiq JotySteven CH Hoi Caiming Xiong Irwin King andMichael Lyu 2020 Discern Discourse-aware en-tailment reasoning network for conversational ma-chine reading In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) pages 2439ndash2449 Online As-sociation for Computational Linguistics

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Kaiming He Xiangyu Zhang Shaoqing Ren and JianSun 2015 Deep residual learning for image recog-nition CoRR abs151203385

Tuomo Hiippala Malihe Alikhani Jonas HaverinenTimo Kalliokoski Evanfiya Logacheva SerafinaOrekhova Aino Tuomainen Matthew Stone andJohn A Bateman 2021 AI2D-RST a multimodalcorpus of 1000 primary school science diagramsLang Resour Evaluation 55(3)661ndash688

Jerry R Hobbs 1985 On the coherence and structureof discourse

Xinyue Huang and Adriana Kovashka 2016 Inferringvisual persuasion via body language setting anddeep features In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern RecognitionWorkshops pages 73ndash79

M G Kendall 1938 A new measure of rank correla-tion Biometrika 30(12)81ndash93

Diederik P Kingma and Jimmy Ba 2017 Adam Amethod for stochastic optimization

Ranjay Krishna Yuke Zhu Oliver Groth Justin John-son Kenji Hata Joshua Kravitz Stephanie ChenYannis Kalantidis Li-Jia Li David A ShammaMichael Bernstein and Li Fei-Fei 2016 Visualgenome Connecting language and vision usingcrowdsourced dense image annotations

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019a Integratingtext and image Determining multimodal documentintent in Instagram posts In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 4622ndash4632 Hong KongChina Association for Computational Linguistics

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019b Integrat-ing text and image Determining multimodal doc-ument intent in instagram posts arXiv preprintarXiv190409073

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov and et al 2020a The open imagesdataset v4 International Journal of Computer Vi-sion 128(7)1956ndash1981

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov et al 2020b The open images datasetv4 International Journal of Computer Vision pages1ndash26

Alex Lascarides and Matthew Stone 2009 A formalsemantic analysis of gesture Journal of Semantics26(4)393ndash449

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81 Barcelona SpainAssociation for Computational Linguistics

Jiasen Lu Dhruv Batra Devi Parikh and StefanLee 2019 Vilbert Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks In Advances in Neural Information Process-ing Systems volume 32 Curran Associates Inc

Pranava Madhyastha Josiah Wang and Lucia Specia2019 VIFIDEL Evaluating the visual fidelity ofimage descriptions In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics pages 6539ndash6550 Florence Italy Asso-ciation for Computational Linguistics

William C Mann and Sandra A Thompson 1987Rhetorical structure theory A theory of text orga-nization University of Southern California Infor-mation Sciences Institute Los Angeles

Scott McCloud 1993 Understanding comics The in-visible art William Morrow

Edwin G Ng Bo Pang Piyush Sharma and RaduSoricut 2020 Understanding guided image cap-tioning performance across domains arXiv preprintarXiv201202339

Christian Otto Matthias Springstein Avishek Anandand Ralph Ewerth 2019 Understanding catego-rizing and predicting semantic image-text relationsIn Proceedings of the 2019 on International Con-ference on Multimedia Retrieval pages 168ndash176ACM

Kishore Papineni Salim Roukos Todd Ward and Weijing Zhu 2002 Bleu a method for automatic evalu-ation of machine translation pages 311ndash318

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Advances in Neural Information Pro-cessing Systems 32 pages 8024ndash8035 Curran Asso-ciates Inc

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind Joshi and Bon-nie Webber 2008a The Penn Discourse Tree-Bank 20 In Proceedings of the Sixth Interna-tional Conference on Language Resources and Eval-uation (LRECrsquo08) Marrakech Morocco EuropeanLanguage Resources Association (ELRA)

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind K Joshi and Bon-nie L Webber 2008b The Penn discourse treebank20 In LREC Citeseer

J Pustejovsky and N Krishnaswamy 2020 Situatedmeaning in multimodal dialogue human-robot andhuman-computer interactions

Shaoqing Ren Kaiming He Ross Girshick and JianSun 2016 Faster r-cnn Towards real-time objectdetection with region proposal networks

Deborah Schiffrin 1980 Meta-talk Organizationaland evaluative brackets in discourse SociologicalInquiry 50(3-4)199ndash236

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Piyush Sharma Nan Ding Sebastian Goodman andRadu Soricut 2018 Conceptual captions Acleaned hypernymed image alt-text dataset for au-tomatic image captioning In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages2556ndash2565 Melbourne Australia Association forComputational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez Ł ukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008 Cur-ran Associates Inc

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 CIDEr Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Ramakrishna Vedantam C Lawrence Zitnick andDevi Parikh 2014 Cider Consensus-based imagedescription evaluation CoRR abs14115726

Anthony Viera and Joanne Garrett 2005 Understand-ing interobserver agreement The kappa statisticFamily medicine 37360ndash3

Jiacheng Xu Zhe Gan Yu Cheng and Jingjing Liu2019 Discourse-aware neural extractive text sum-marization arXiv preprint arXiv191014142

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Page 3: arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

Facade of a glass building A pink flower bush in a gar-den

The underside of the Arc deTriomphe

Close-up of a fly sitting on adaisy

Man sitting by his artworklooking at a large statue ofa man on a horse in a royalcourtyard

Woman with an umbrellareading a book sitting in thegrass in front of a city sky-line

Cowboy on a horse and cow-boy on the ground workingtogether to lasso a calf in apen

Black and white artworkpainted on a blue wall

Figure 2 Examples of the ground truth captions that we collected for the COIN dataset (Photo credits from left toright top to bottom Sharron Mollerus Northfielder George M Groutas davebloggs007 Tim Adams BrisbaneCity Council Colin Brown Guilhem Vellut)

ratings of generated text BERTScore howevercomputes the similarity score as the average ofcosine similarities between predicted tokens andtheir top matching reference tokens These metricshowever do not respect the information goal andthe purpose for which the model has generated thetext We address this problem by introducing thefirst coherence-aware generation metric Similarto SPICE (Anderson et al 2016b) and VIFIDEL(Madhyastha et al 2019) we use the informationencoded in images We further propose the addi-tion of coherence relations that facilitate learningwith fewer samples by a multimodal metric usingpre-trained BERT and ViLBERT

3 Data Collection

We collect two datasets human judgments forimage captions that are generated by coherence-aware captioning systems using Conceptual Cap-tions dataset and ground-truth labels for the OpenImages dataset With Conceptual Captions cor-pora we fine-tune ViLBERT with ratings and showthat addition of coherence relations can make au-tomated scoring closer to human scoring We useOpenImages corpora to reinforce that multimodal-ity and coherence relations have significant contri-butions to scoring out-of-domain datasets as well

Protocol We hired two expert linguists for dataannotation and designed an annotation website tofacilitate the annotation procedure They are na-tive English speakers who identify themselves as

of White and Latino ethnicity The code 1 of theannotation website and the details of the protocolis publicly available The study has been approvedby our institutionrsquos human subject board

Conceptual Captions Score Annotation Wehave collected ratings on the quality of different im-age descriptions with coherence labels for a subsetof 1000 images from the Conceptual Captions (CC)training dataset (Ng et al 2020) With this paperwe are publishing this dataset as a benchmark forevaluation metrics that are coherence-aware Theset-up of the data collection is as follows CCimages are input into a caption-generation modelcreated by Alikhani et al (2020) This modelgenerates coherence-aware descriptions for inputimages in 4 different coherence classes of MetaVisible Subjective Story These 4000imagecaption pairs are then presented to humanannotators who are asked to select the correctcoherence label for each pair

bull Meta the caption talks about when whereand how the picture is taken Meta-talk inSchiffrin (1980)

bull Visible the caption is true just by looking atthe picture Restatement relation in Prasadet al (2008a)

bull Subjective the captions is the matter of opin-ion Evaluation relation in Hobbs (1985)

bull Story text and image work like story and il-lustration Occasion relation in Hobbs (1985)

1httpsgithubcomMertermCOSMic

Figure 3 An illustration of different flavors of COSMic that outputs a score for the generated caption given theimage reference caption and the coherence-labels for both the captions (a) COSMic Vanilla uses only globaltextual and visual features while (b) COSMic ViLBERT uses combined visio-linguistic features with both localand global focus This model takes into account the information goals (determined by coherence-labels) for boththe captions when comparing the generated caption to the reference for evaluation

After the annotator selects a specific coherencelabel from the above we ask them to rate the qualityof the captions given the label on a scale of 1 to5 We use these annotations as training data for ourcoherence-aware captioning metric COSMic Wecall this data we annotated RaCCoon (Ratings forConceptual Caption)

To calculate the Cohenrsquos κ agreement measurewe selected 150 images randomly and assignedthem to two annotators The Kappa coefficient isκ = 089 which indicates a substantial agreement(Viera and Garrett 2005)

OpenImages Ground Truth Captions To cre-ate an out of domain test set we asked our anno-tators to write Visible captions for 1000 images2

from the OpenImages dataset (Kuznetsova et al2020a) We call this dataset COIN (Corpus ofOpenImages with Natural descriptions) A sampleof these ground truth captions written by our expertlinguists are presented in Figure 2 We use thisdataset to test COSMic and other learned metricsin Section 5 and present our benchmark results inTable 1

4 Method

The goal of a coherence-aware image captioningmetric is to predict a score for the generated cap-tion given the image reference caption and coher-ence relations of one generated caption and one

2The same subset named T2 was used for theCVPR-2019 Workshop on Conceptual Captionswwwconceptualcaptionscom

reference caption This metric function M can beformalized as predicting a score s as follows

s =M(I g r gc rc θ) (1)

where the metric is defined by parameters θ andwhere the model inputs are defined as I being theimage being captioned g and r the generated andreference captions respectively gc and rc are thecoherence relations for g r respectively

We now describe the architecture of ourcoherence-aware image captioning metric COS-Mic (COherence-Sensitive Metric of imagecaptions) It has two flavors mdash a ViLBERT-basedmodel pre-trained on large multimodal data and abaseline Vanilla version as illustrated in Figure 3Both are trained on RaCCoon training data (Sec-tion 3) with normalized human annotated rating toobtain the modelrsquos target score

41 COSMic ViLBERT

ViLBERT (Lu et al 2019) is a multimodal featurelearning model pre-trained on 33 million Concep-tual Captions image and captions data It is trainedfor masked multi-modal learning and multi-modalalignment prediction and demonstrates strong per-formance on several downstream multimodal taskssuch as VQA VCR grounding and image retrievalFor this reason we use a pre-trained ViLBERT toembed our multimodal inputs shown in Equation 1with changes to incorporate both the captions andcoherence relations

For input image (I) we use the same processas ViLBERT We use a Faster R-CNN (Ren et al2016) model pre-trained on Visual Genome (Kr-ishna et al 2016) to detect objects regions and ex-tract features The sequence of these image featuresis denoted as I prime with 100 bounding box featureswhere each element is R2048 Similar to ViLBERTwe use the special token [IMG] to denote the be-ginning of the bounding box features list

For input captions (g r) and coherence labels(gc gr) the sequence begins with the special token[CLS] followed by input text embeddings Each ofour text inputs are tokenized and embedded usingViLBERTrsquos input text pre-processing and denotedas gprime rprime gprimec g

primer for g r gc and gr respectively

Note that the coherence labels are processed as textinputs such as ldquoVisiblerdquo and ldquoStoryrdquo which allowsthe model to use its pre-trained representations ofthese concepts Each of these input sequences areseparated by the special token [SEP] to form ourinput sequence

Hence our input to ViLBERT is of formv = ([IMG] I prime [CLS] rprime [SEP] gprime [SEP] rprimec [SEP] gprimec)

We use a linear layer with sigmoid activationon ViLBERTrsquos output text logits to compute COS-Micrsquos output metric score (s)

s = Linear(ViLBERT(v)) (2)

During training we fine-tune ViLBERT and theoutput linear layer in an end-to-end fashion by mini-mizing the Mean-Squared error between the outputscore s and the corresponding reference score yon the RaCCoon dataset

42 COSMic VanillaThe COSMic ViLBERT approach above takes ad-vantage of multimodal pre-training on the Concep-tual Captions dataset to embed the image and textinputs As a simpler baseline we now presentCOSMic Vanilla which independently embeds theinput image and text to be later combined for scorecomputation with no end-to-end training

To extract image features we use a ResNet50v2(He et al 2015) model pre-trained on ImageNet(Deng et al 2009) and linearly transform the globalimage representation to 512-dimensional space

eI = Linear1(AveragePool(ResNet(I))) (3)

In our textual feature extraction module weembed g and r independently with a pre-trained

BERT-Large-512 model We use the [CLS] to-ken embedding as 1024 dimensional caption-levelrepresentation in each case and transform them to512-dimensional space

eg = Linear2(BERTCLS(g))

er = Linear2(BERTCLS(r))(4)

In our coherence label embedding module gcand rc are each represented as one-hot vectors suchthat the dimensions correspond to labels Meta Vis-ible Subjective and Story Each is embedded intoa 512-dimensional space

egc = Linear3(gc)

erc = Linear3(rc)(5)

We thus obtain the 5 vectors (each R512)representing one of the inputs of Equation 1We concatenate and use a feed-forward net-work with progressively smaller hidden layersof sizes [512 256 128 64 32 16 8] each withReLU (Agarap 2018) activation The output scores is computed by a final linear layer on top of theabove network

e = concat([eI eg er egc erc ]))

s = Linear4(MLP1(e))(6)

where e isin R2560 and s isin RTo understand the role of each component of this

implementation we further deconstruct each mod-ule in ablation experiments described in Table 2

43 Coherence-aware Captioning SystemsIn order to experiment with COSMic we generateour own captions In this section we describe thecoherence-aware captioning systems used to gener-ate these image captions for the training and testingof COSMic

For our base captioning system we use the state-of-the-art coherence-aware captioning system in-troduced by (Alikhani et al 2020) It uses aTransformer-based (Vaswani et al 2017) encoder-decoder architecture where the encoder inputs are(1) global image features (2) image labels and (3)coherence label The coherence-label also servesas the first input token for the decoder which gen-erates the output captions We set the coherencelabel to the groundtruth relation at training timeand the desired relation at inference time We usethe Conceptual Captions dataset (Sharma et al2018) with machine-generated coherence labels for

System AvgHum

Rating

Metrics

Model CohLabel B1 B2 M RL C S BR BS-

FCOSMicVanilla

COSMicViL-BERT

COSMicVanilla+

COSMicViL-BERT+

BUTD Visible 2191 163 077 049 160 092 030 -877 863 706 796 522 641

Base

Visible 3532 050 025 019 066 020 002 -1114 862 696 777 516 614Meta 3213 041 000 012 063 012 000 -1059 863 548 727 505 602Subj 2830 033 012 011 057 017 000 -1197 849 323 421 358 403Story 2915 029 000 017 058 013 000 -1304 842 533 629 482 527

Lite

Visible 3298 028 011 013 053 011 000 -1101 863 684 784 515 604Meta 2830 026 010 008 055 015 000 -1084 859 548 748 511 565Subj 2298 039 012 019 066 024 003 -1217 849 364 451 379 419Story 2426 036 000 018 062 021 000 -1362 842 568 666 499 519

KendallrsquosCorrelation (τ ) 1000 071 154 036 -036 -571 -052 286 445 571 546 667 764

Table 1 System-level scores for 9 different image captioning systems as evaluated by human annotators andvarious captioning metrics Bottom-Up Top-Down (BUTD) is trained on COCO while others are trained on theConceptual Captions (CC) dataset The evaluation however is conducted on COIN dataset which is out-of-domainfor both COCO and CC This domain shift causes the n-gram based metrics (eg BLEU ROUGE CIDEr) to assignvery low scores to otherwise correct captions (See Table 4) Whereas embedding based metrics (eg BLEURTBERTScore and COSMic) do not suffer from this limitation Since all metrics have different scales instead ofabsolute scores we use Kendall Rank Correlation to measure agreement with human scores Model names areabbreviated as follows B1 Bleu1 B2 Bleu2 M METEOR RL ROUGEL C CIDEr S SPICE BR BLEURTBS-F BERTScore F1 COSMic models with rsquo+rsquo denote application of data augmentation to remove training databias More metrics and detailed results can be found on the code repository

training this captioning system To obtain the co-herence labels above we closely follow (Alikhaniet al 2020) to train a coherence classifier on theClue dataset (Alikhani et al 2020) that providesaround 4K human annotated (image caption rela-tion) triplets We present two caption-generationsystems in this section

Base-systems family A family of 4 captioningsystems is created by setting the coherence-labelto Meta Visible Subjective or Story in the basecaptioning model described above These are con-sidered different captioning systems because theinformation content and discourse goals as con-trolled by the coherence label are different

Lite-systems family We remove the global im-age features from the base modelrsquos input to obtaina smaller light-weight (lite) model Similar to thebase model we obtain a family of 4 captioningsystems by changing the coherence-label

In Section 5 we study the order in which sev-eral image captioning metrics rank these 8 systemsThe goal is to identify the metric that agrees themost with the groundtruth rankings based on hu-man assessments

44 COCO-trained Captioning SystemCOSMicrsquos training data RaCCoon is based onConceptual Captions and it is coherence-aware Totest the modelrsquos generalization capability we use

a captioning system trained on MS COCO (Chenet al 2015) Since COSMic expects an input co-herence label and COCO captions are Visible styleby design we set the label to Visible Specificallywe use the Bottom-Up Top-Down (BUTD) Atten-tion model (Anderson et al 2018) This helpsstudy how well COSMic generalizes to other cap-tioning datasets and coherence-agnostic captioningsystems

5 Experiments

Here we describe the experimental setup to com-pare COSMic with other metrics As outlined inSection 3 and 4 we use the RaCCoon data to trainour models and COIN to test COSMic and othermetrics We have several baseline metrics that wecompare to which can be found on Table 1

51 Model Training Setup

We implement COSMicmdashas described in Sec-tion 4mdashwith PyTorch (Paszke et al 2019) and trainon a GTX1080 GPU We pre-compute BERT3 andResNet4 features using their TensorFlow (Abadiet al 2015) implementations We use the pub-

3httpsgithubcomgoogle-researchbert

4httpswwwtensorfloworgapi_docspythontfkerasapplicationsResNet50V2

lic ViLBERT5 implementation We use a batchsize of 4 and a learning rate of 2times 10minus6 for fine-tuning ViLBERT and use RAdam optimizer andstop the training when the validation score doesnot change for 3 epochs For COSMic Vanillawe train with a batch-size of 10 Adam optimizer(Kingma and Ba 2017) with a base learning rateof 10minus3 that decays by a factor of 10minus2 every 10epochs We observe that the Vanilla convergesin approximately 100 epochs and ViLBERT con-verges in 9 epochs ViLBERT has 250 millionparameters COSMic Vanilla includes 3062913trainable parameters Pre-trained BERT-Large andResNet50V2 have an additional 350 million param-eters The setup for coherence-aware captioningmodels to obtain machine-generated captions forour study is the same as (Alikhani et al 2020)

52 Baseline Captioning Metrics

To benchmark COSMic we compare it with otherlearned metrics In this section we describe thesevarious metrics traditionally used for measuringimage captioning systems None of these metricswere designed to support the coherence relationsof the reference or generated captions These serveas baselines for COSMic

N-gram based The most popular image caption-ing metrics are based on precision and recall of n-grams from generated and reference captions Wecompare with Bleu1 Bleu2 Bleu3 Bleu4 (Guo andHu 2019) ROUGEL (Lin 2004) CIDEr (Vedan-tam et al 2015) and SPICE (Anderson et al2016b) We compute these using their popularopen-source implementation6

BLEURT We use a pre-trained BLEURT model7

as a baseline for our work Unlike N-gram basedapproaches BLEURT uses BERT-based word em-beddings which are robust to variations in surfaceword realizations between the reference and gen-erated captions We do not do any fine-tuning forthis baseline

BERTScore BERTScore8 uses a pre-trainedBERT model to embed the reference and gener-ated captions Text-level similarity scores are then

5httpsgithubcomfacebookresearchvilbert-multi-task

6httpsgithubcomtylincoco-caption7httpsgithubcomgoogle-research

bleurt8httpsgithubcomTiiigerbert_score

computed by matching the tokensrsquo output embed-dings

Please note that for both BERT-based baselinesabove (BLEURT BERTScore) we use the BERT-Large-512 size model

53 COIN-based Evaluation Setup

We use each baseline metric and COSMic to scorethe 8 different image captioning systems describedin Section 4 on the same set of test images withreference captions Note that the range and scaleof each metric is different however they are allmonotonously increasing functions of model qual-ity So in our study we do not analyze the abso-lute score assigned by these metrics but only theirranks We also ask human annotators to rank these8 captioning systems on the same set of test im-ages The ranks assigned by a higher performingmetric will align better with the ranks from humanannotators

Since the captioning systems above are trainedon Conceptual Captions or COCO we use im-agecaption pairs from COIN for an out-of-domainevaluation A subset of 50 random images is usedto rank the captioning systems as described aboveresulting in 400 machine-generated captions totalfor the 8 captioning systems These were thenevaluated by human annotators using the processdescribed in Section 3 The human-scored systemlevel performance for each captioning system onthis test set is reported in Table 1 in ldquoAverage Hu-man Ratingrdquo

We measure the alignment between metric-assigned and human-assigned scores using theKendall (Kendall 1938) correlation coefficient Inorder to calculate the score we first aggregate allthe sample scores and average them Then wecalculate the Kendall tau score using the SciPy171 implementation The score is calculatedbetween two vectors first of which is the aver-age human ratings for 8 models and the secondbeing the investigated metric scores for 8 mod-els in the following order[BaseV isible BaseMetaBaseSubjective BaseStory LiteV isible LiteMetaLiteSubjective LiteStory] Due to the small sam-ple size Kendall correlation is the most suitablecorrelation measure

A key measure of the success of an automaticevaluation metric is whether it makes the same deci-sion about which system is better in a head-to-headevaluation as we would get from a human-subjects

evaluation If each system is evaluated based onits average score then success comes when the av-erage computed metric correlates closely with theaverage human-ranking In particular we measurethe alignment between metric assigned and humanassigned scores using the Kendall score followingthe work of (Sellam et al 2020)

6 Results

Table 1 presents the results of the COIN-basedstudy The last row reports the Kendall correla-tion coefficient between the scores assigned by themetric and humans

All N-gram based metrics such as BLEU andCIDEr fail to adapt to the out-of-domain ground-truth captions from COIN This results in a rela-tively flat distribution of system-level scores con-centrated close to 0 and hence low correlation co-efficients CIDEr has a highly negative Kendallrsquosτ which denotes a strong negative associationwith human judgements This is partly due to low(sim001) and hence noisy CIDEr scores (Figure 4provides example cases that illustrate this argu-ment)

Embedding-based methods BLEURT andBERTScore do not suffer from this limitation re-sulting in more meaningful scoring of systems andhence higher correlation with human scores How-ever by design both these metrics are agnostic tocoherence-labels and the input image COSMicwhich is coherence-aware obtains the highest cor-relation with human scores COSMic ViLBERThas the highest Kendallrsquos correlation among all ofour models COSMic Vanilla performs the sec-ond best among our models and it performs betterthan the rest of the models in terms of Kendallrsquoscorrelation

Data Augmentation The raw RaCCoon trainingdata has a coherence-level bias as demonstrated bythe average COSMic score for each class mdash Visi-ble (0622) Meta (0459) Subjective (0236) andStory (0397) This reflects the human annotatorsrsquobias towards liking Visible captions the most andSubjective captions the least which is expectedHowever training COSMic on this data injects thesame coherence-bias into the model which is un-desirable As presented in Table 1 both flavors ofCOSMic (without the lsquo+rsquo) assign high scores toVisible captioning systems

To mitigate this issue we algorithmically aug-ment the training data to bring the average scoresfor each coherence class to comparable values Weachieve this by pairing images with random cap-tions from the coherence class and assigning thema score of 0 This is a valid training sample becausethe randomly sampled caption does not describe thesaid image and serves as a negative sample Withthese operations the class bias is significantly re-duced mdash Visible (0459) Meta (0439) Subjective(0328) and Story (0425) The COSMic columnsin Table 1 with lsquo+rsquo denote that this data augmen-tation approach improves ranking of captioningsystems leading to better alignment with humanjudgements

Ablation Study Table 2 reports the perfor-mance of COSMic Vanilla without coherence-labels andor the image as model inputs We findthat removal of image features affects COSMicrsquosperformance showing the important contributionof images The performance deteriorates signifi-cantly when the coherence-labels are removed fromthe model (No rc gc column in Table 2) Thisdemonstrates that COSMic successfully integratescoherence-relations in the caption scoring process

Reference two men in scrubs per-forming surgery

mountains in front of aclear blue sky

large brick building next toa green lawn and big trees

a foggy forest

Generated surgeons operating on apatient

mountain range as seenfrom the trail

the front of the house light shining throughthe trees

Figure 4 Illustration of COIN reference captions and corresponding outputs of the Base-Visible model Thoughthe generated captions are correct an n-gram based metric such as CIDEr assigns them a very low score due to thevariations in surface word realizations See Table 1 for average scores over the test set (Photo credits from left toright US Army Africa Gabriel Fr James Bradley Rosmarie Voegtli)

System COSMic

Model CohLabel

Full No I No c No I amp c

Base

Visible 516 447 434 442Meta 505 439 442 453Subj 356 347 438 453Story 505 433 436 445

Lite

Visible 515 444 434 433Meta 511 434 447 464Subj 379 367 440 459Story 499 440 433 442

KendallrsquosCorr (τ ) 667 546 -222 -415

Table 2 Ablation experiment results No I repre-sents COSMic Vanilla without image features Norc gc represents COSMic Vanilla without coherencelabel embeddings finally No I amp No rc gc repre-sents COSMic Vanilla without coherence label embed-dings and without image features

7 Conclusion

Our work is the first step towards designing genera-tion metrics that respect the information goal of thegenerated text We observe that a small set of ex-amples annotated with coherence relations can pro-vide what is needed for learning a discourse-awaregeneration metric Our findings have implicationsfor designing context-aware multimodal metricswith criteria that are closer to human ratings forevaluating machine-generated multimodal content

We have called attention to the challenge oflearning robust generation metrics that can eval-uate the output of the generation models consid-ering the information goals Our findings sug-gest that fine-tuning ViLBERTmdashoriginally trainedwith millions of imagesmdashwith a smaller sample ofcoherence relations and expert-annotated scoringautomated metrics can score generated captionscloser to a human rating The presented datasetprovides the opportunity for future research in thearea of image description generation designingdiscourse-aware metrics and multimodal contentevaluation We hope that coherence-aware text gen-eration metrics could be used for learning bettergeneration models (such as abstractive summariza-tion or story generation) and could be deployeddirectly in machine learning pipelines to help inoptimizing hyper-parameters Ultimately it is in-tended to have a generalizable model that can usea labeling mechanismmdashnot restricted to coherencelabelsmdash to improve applicability of generation met-rics in different tasks

8 Ethics

This paper describes a research prototype We donot work with sensitive or personal data Our pro-tocol was approved by our ethics board Humansubjects participated voluntarily undertook min-imal risk and were compensated fairly for theirtime The dataset we produced is fully anonymizedSubjects consented to the distribution of their dataas part of their participation in the research Tech-nologists should think carefully before deployingour ideas in production Our work depends onpretrained models such as word and image embed-dings These models are known to reproduce andeven magnify societal bias present in training dataMoreover like many ML NLP methods our meth-ods are likely to perform better for content thatis better represented in training leading to furtherbias against marginalized groups We can hope thatgeneral methods to mitigate harms from ML biascan address these issues

A distinctive complication of our work is the factthat many imagendashtext presentations involve writ-ers expressing subjective opinions By its natureour evaluation metric assesses such subjective textsbased on averages and trends across many userswhich may be problematic Although such judg-ments are ultimately matters of personal taste theyare nevertheless often grounds by which hierarchiesof differences are culturally encoded and enforcedThus a deployed subjective-caption generation sys-tem could well be unfair to users especially if thoseusers are not confident in their own taste or criticaltowards the systemrsquos responses Our evaluationmetric is not sensitive to such harms

Acknowledgements

The authors affiliated with Rutgers University werepartly supported by NSF Award CCF-19349243Thanks to Pitt Cyber for supporting this project andthe authors from the University of Pittsburgh Wealso acknowledge the Center for Research Comput-ing at the University of Pittsburgh for providing therequired computational resources for carrying outexperiments at the University of Pittsburgh

ReferencesMartiacuten Abadi Ashish Agarwal Paul Barham Eugene

Brevdo Zhifeng Chen Craig Citro Greg S CorradoAndy Davis Jeffrey Dean Matthieu Devin SanjayGhemawat Ian Goodfellow Andrew Harp Geoffrey

Irving Michael Isard Yangqing Jia Rafal Jozefow-icz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dandelion Maneacute Rajat Monga Sherry MooreDerek Murray Chris Olah Mike Schuster JonathonShlens Benoit Steiner Ilya Sutskever Kunal TalwarPaul Tucker Vincent Vanhoucke Vijay VasudevanFernanda Vieacutegas Oriol Vinyals Pete Warden Mar-tin Wattenberg Martin Wicke Yuan Yu and Xiao-qiang Zheng 2015 TensorFlow Large-scale ma-chine learning on heterogeneous systems Softwareavailable from tensorfloworg

Abien Fred Agarap 2018 Deep learning using recti-fied linear units (relu) CoRR abs180308375

Malihe Alikhani Piyush Sharma Shengjie Li RaduSoricut and Matthew Stone 2020 Cross-modal co-herence modeling for caption generation In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 6525ndash6535 Online Association for Computational Lin-guistics

Peter Anderson Basura Fernando Mark Johnsonand Stephen Gould 2016a SPICE semanticpropositional image caption evaluation CoRRabs160708822

Peter Anderson Basura Fernando Mark Johnson andStephen Gould 2016b Spice Semantic propo-sitional image caption evaluation In EuropeanConference on Computer Vision pages 382ndash398Springer

Peter Anderson Xiaodong He Chris Buehler DamienTeney Mark Johnson Stephen Gould and LeiZhang 2018 Bottom-up and top-down attention forimage captioning and visual question answering InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR)

Xinlei Chen Hao Fang Tsung-Yi Lin Ramakr-ishna Vedantam Saurabh Gupta Piotr Dollar andC Lawrence Zitnick 2015 Microsoft coco cap-tions Data collection and evaluation server

Yin Cui Guandao Yang Andreas Veit Xun Huangand Serge Belongie 2018 Learning to evaluate im-age captioning In Proceedings of the IEEE con-ference on computer vision and pattern recognitionpages 5804ndash5812

Samuel Cumming Gabriel Greenberg and Rory Kelly2017 Conventions of viewpoint coherence in filmPhilosophersrsquo Imprint 17(1)1ndash29

J Deng W Dong R Socher L-J Li K Li and L Fei-Fei 2009 ImageNet A Large-Scale HierarchicalImage Database In CVPR09

Michael Denkowski and Alon Lavie 2014 Meteor uni-versal Language specific translation evaluation forany target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Haoyuan Gao Junhua Mao Jie Zhou Zhiheng HuangLei Wang and Wei Xu 2015 Are you talking to amachine dataset and methods for multilingual im-age question In Advances in Neural InformationProcessing Systems pages 2296ndash2304

Yifan Gao Chien-Sheng Wu Jingjing Li Shafiq JotySteven CH Hoi Caiming Xiong Irwin King andMichael Lyu 2020 Discern Discourse-aware en-tailment reasoning network for conversational ma-chine reading In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) pages 2439ndash2449 Online As-sociation for Computational Linguistics

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Kaiming He Xiangyu Zhang Shaoqing Ren and JianSun 2015 Deep residual learning for image recog-nition CoRR abs151203385

Tuomo Hiippala Malihe Alikhani Jonas HaverinenTimo Kalliokoski Evanfiya Logacheva SerafinaOrekhova Aino Tuomainen Matthew Stone andJohn A Bateman 2021 AI2D-RST a multimodalcorpus of 1000 primary school science diagramsLang Resour Evaluation 55(3)661ndash688

Jerry R Hobbs 1985 On the coherence and structureof discourse

Xinyue Huang and Adriana Kovashka 2016 Inferringvisual persuasion via body language setting anddeep features In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern RecognitionWorkshops pages 73ndash79

M G Kendall 1938 A new measure of rank correla-tion Biometrika 30(12)81ndash93

Diederik P Kingma and Jimmy Ba 2017 Adam Amethod for stochastic optimization

Ranjay Krishna Yuke Zhu Oliver Groth Justin John-son Kenji Hata Joshua Kravitz Stephanie ChenYannis Kalantidis Li-Jia Li David A ShammaMichael Bernstein and Li Fei-Fei 2016 Visualgenome Connecting language and vision usingcrowdsourced dense image annotations

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019a Integratingtext and image Determining multimodal documentintent in Instagram posts In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 4622ndash4632 Hong KongChina Association for Computational Linguistics

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019b Integrat-ing text and image Determining multimodal doc-ument intent in instagram posts arXiv preprintarXiv190409073

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov and et al 2020a The open imagesdataset v4 International Journal of Computer Vi-sion 128(7)1956ndash1981

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov et al 2020b The open images datasetv4 International Journal of Computer Vision pages1ndash26

Alex Lascarides and Matthew Stone 2009 A formalsemantic analysis of gesture Journal of Semantics26(4)393ndash449

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81 Barcelona SpainAssociation for Computational Linguistics

Jiasen Lu Dhruv Batra Devi Parikh and StefanLee 2019 Vilbert Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks In Advances in Neural Information Process-ing Systems volume 32 Curran Associates Inc

Pranava Madhyastha Josiah Wang and Lucia Specia2019 VIFIDEL Evaluating the visual fidelity ofimage descriptions In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics pages 6539ndash6550 Florence Italy Asso-ciation for Computational Linguistics

William C Mann and Sandra A Thompson 1987Rhetorical structure theory A theory of text orga-nization University of Southern California Infor-mation Sciences Institute Los Angeles

Scott McCloud 1993 Understanding comics The in-visible art William Morrow

Edwin G Ng Bo Pang Piyush Sharma and RaduSoricut 2020 Understanding guided image cap-tioning performance across domains arXiv preprintarXiv201202339

Christian Otto Matthias Springstein Avishek Anandand Ralph Ewerth 2019 Understanding catego-rizing and predicting semantic image-text relationsIn Proceedings of the 2019 on International Con-ference on Multimedia Retrieval pages 168ndash176ACM

Kishore Papineni Salim Roukos Todd Ward and Weijing Zhu 2002 Bleu a method for automatic evalu-ation of machine translation pages 311ndash318

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Advances in Neural Information Pro-cessing Systems 32 pages 8024ndash8035 Curran Asso-ciates Inc

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind Joshi and Bon-nie Webber 2008a The Penn Discourse Tree-Bank 20 In Proceedings of the Sixth Interna-tional Conference on Language Resources and Eval-uation (LRECrsquo08) Marrakech Morocco EuropeanLanguage Resources Association (ELRA)

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind K Joshi and Bon-nie L Webber 2008b The Penn discourse treebank20 In LREC Citeseer

J Pustejovsky and N Krishnaswamy 2020 Situatedmeaning in multimodal dialogue human-robot andhuman-computer interactions

Shaoqing Ren Kaiming He Ross Girshick and JianSun 2016 Faster r-cnn Towards real-time objectdetection with region proposal networks

Deborah Schiffrin 1980 Meta-talk Organizationaland evaluative brackets in discourse SociologicalInquiry 50(3-4)199ndash236

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Piyush Sharma Nan Ding Sebastian Goodman andRadu Soricut 2018 Conceptual captions Acleaned hypernymed image alt-text dataset for au-tomatic image captioning In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages2556ndash2565 Melbourne Australia Association forComputational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez Ł ukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008 Cur-ran Associates Inc

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 CIDEr Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Ramakrishna Vedantam C Lawrence Zitnick andDevi Parikh 2014 Cider Consensus-based imagedescription evaluation CoRR abs14115726

Anthony Viera and Joanne Garrett 2005 Understand-ing interobserver agreement The kappa statisticFamily medicine 37360ndash3

Jiacheng Xu Zhe Gan Yu Cheng and Jingjing Liu2019 Discourse-aware neural extractive text sum-marization arXiv preprint arXiv191014142

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Page 4: arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

Figure 3 An illustration of different flavors of COSMic that outputs a score for the generated caption given theimage reference caption and the coherence-labels for both the captions (a) COSMic Vanilla uses only globaltextual and visual features while (b) COSMic ViLBERT uses combined visio-linguistic features with both localand global focus This model takes into account the information goals (determined by coherence-labels) for boththe captions when comparing the generated caption to the reference for evaluation

After the annotator selects a specific coherencelabel from the above we ask them to rate the qualityof the captions given the label on a scale of 1 to5 We use these annotations as training data for ourcoherence-aware captioning metric COSMic Wecall this data we annotated RaCCoon (Ratings forConceptual Caption)

To calculate the Cohenrsquos κ agreement measurewe selected 150 images randomly and assignedthem to two annotators The Kappa coefficient isκ = 089 which indicates a substantial agreement(Viera and Garrett 2005)

OpenImages Ground Truth Captions To cre-ate an out of domain test set we asked our anno-tators to write Visible captions for 1000 images2

from the OpenImages dataset (Kuznetsova et al2020a) We call this dataset COIN (Corpus ofOpenImages with Natural descriptions) A sampleof these ground truth captions written by our expertlinguists are presented in Figure 2 We use thisdataset to test COSMic and other learned metricsin Section 5 and present our benchmark results inTable 1

4 Method

The goal of a coherence-aware image captioningmetric is to predict a score for the generated cap-tion given the image reference caption and coher-ence relations of one generated caption and one

2The same subset named T2 was used for theCVPR-2019 Workshop on Conceptual Captionswwwconceptualcaptionscom

reference caption This metric function M can beformalized as predicting a score s as follows

s =M(I g r gc rc θ) (1)

where the metric is defined by parameters θ andwhere the model inputs are defined as I being theimage being captioned g and r the generated andreference captions respectively gc and rc are thecoherence relations for g r respectively

We now describe the architecture of ourcoherence-aware image captioning metric COS-Mic (COherence-Sensitive Metric of imagecaptions) It has two flavors mdash a ViLBERT-basedmodel pre-trained on large multimodal data and abaseline Vanilla version as illustrated in Figure 3Both are trained on RaCCoon training data (Sec-tion 3) with normalized human annotated rating toobtain the modelrsquos target score

41 COSMic ViLBERT

ViLBERT (Lu et al 2019) is a multimodal featurelearning model pre-trained on 33 million Concep-tual Captions image and captions data It is trainedfor masked multi-modal learning and multi-modalalignment prediction and demonstrates strong per-formance on several downstream multimodal taskssuch as VQA VCR grounding and image retrievalFor this reason we use a pre-trained ViLBERT toembed our multimodal inputs shown in Equation 1with changes to incorporate both the captions andcoherence relations

For input image (I) we use the same processas ViLBERT We use a Faster R-CNN (Ren et al2016) model pre-trained on Visual Genome (Kr-ishna et al 2016) to detect objects regions and ex-tract features The sequence of these image featuresis denoted as I prime with 100 bounding box featureswhere each element is R2048 Similar to ViLBERTwe use the special token [IMG] to denote the be-ginning of the bounding box features list

For input captions (g r) and coherence labels(gc gr) the sequence begins with the special token[CLS] followed by input text embeddings Each ofour text inputs are tokenized and embedded usingViLBERTrsquos input text pre-processing and denotedas gprime rprime gprimec g

primer for g r gc and gr respectively

Note that the coherence labels are processed as textinputs such as ldquoVisiblerdquo and ldquoStoryrdquo which allowsthe model to use its pre-trained representations ofthese concepts Each of these input sequences areseparated by the special token [SEP] to form ourinput sequence

Hence our input to ViLBERT is of formv = ([IMG] I prime [CLS] rprime [SEP] gprime [SEP] rprimec [SEP] gprimec)

We use a linear layer with sigmoid activationon ViLBERTrsquos output text logits to compute COS-Micrsquos output metric score (s)

s = Linear(ViLBERT(v)) (2)

During training we fine-tune ViLBERT and theoutput linear layer in an end-to-end fashion by mini-mizing the Mean-Squared error between the outputscore s and the corresponding reference score yon the RaCCoon dataset

42 COSMic VanillaThe COSMic ViLBERT approach above takes ad-vantage of multimodal pre-training on the Concep-tual Captions dataset to embed the image and textinputs As a simpler baseline we now presentCOSMic Vanilla which independently embeds theinput image and text to be later combined for scorecomputation with no end-to-end training

To extract image features we use a ResNet50v2(He et al 2015) model pre-trained on ImageNet(Deng et al 2009) and linearly transform the globalimage representation to 512-dimensional space

eI = Linear1(AveragePool(ResNet(I))) (3)

In our textual feature extraction module weembed g and r independently with a pre-trained

BERT-Large-512 model We use the [CLS] to-ken embedding as 1024 dimensional caption-levelrepresentation in each case and transform them to512-dimensional space

eg = Linear2(BERTCLS(g))

er = Linear2(BERTCLS(r))(4)

In our coherence label embedding module gcand rc are each represented as one-hot vectors suchthat the dimensions correspond to labels Meta Vis-ible Subjective and Story Each is embedded intoa 512-dimensional space

egc = Linear3(gc)

erc = Linear3(rc)(5)

We thus obtain the 5 vectors (each R512)representing one of the inputs of Equation 1We concatenate and use a feed-forward net-work with progressively smaller hidden layersof sizes [512 256 128 64 32 16 8] each withReLU (Agarap 2018) activation The output scores is computed by a final linear layer on top of theabove network

e = concat([eI eg er egc erc ]))

s = Linear4(MLP1(e))(6)

where e isin R2560 and s isin RTo understand the role of each component of this

implementation we further deconstruct each mod-ule in ablation experiments described in Table 2

43 Coherence-aware Captioning SystemsIn order to experiment with COSMic we generateour own captions In this section we describe thecoherence-aware captioning systems used to gener-ate these image captions for the training and testingof COSMic

For our base captioning system we use the state-of-the-art coherence-aware captioning system in-troduced by (Alikhani et al 2020) It uses aTransformer-based (Vaswani et al 2017) encoder-decoder architecture where the encoder inputs are(1) global image features (2) image labels and (3)coherence label The coherence-label also servesas the first input token for the decoder which gen-erates the output captions We set the coherencelabel to the groundtruth relation at training timeand the desired relation at inference time We usethe Conceptual Captions dataset (Sharma et al2018) with machine-generated coherence labels for

System AvgHum

Rating

Metrics

Model CohLabel B1 B2 M RL C S BR BS-

FCOSMicVanilla

COSMicViL-BERT

COSMicVanilla+

COSMicViL-BERT+

BUTD Visible 2191 163 077 049 160 092 030 -877 863 706 796 522 641

Base

Visible 3532 050 025 019 066 020 002 -1114 862 696 777 516 614Meta 3213 041 000 012 063 012 000 -1059 863 548 727 505 602Subj 2830 033 012 011 057 017 000 -1197 849 323 421 358 403Story 2915 029 000 017 058 013 000 -1304 842 533 629 482 527

Lite

Visible 3298 028 011 013 053 011 000 -1101 863 684 784 515 604Meta 2830 026 010 008 055 015 000 -1084 859 548 748 511 565Subj 2298 039 012 019 066 024 003 -1217 849 364 451 379 419Story 2426 036 000 018 062 021 000 -1362 842 568 666 499 519

KendallrsquosCorrelation (τ ) 1000 071 154 036 -036 -571 -052 286 445 571 546 667 764

Table 1 System-level scores for 9 different image captioning systems as evaluated by human annotators andvarious captioning metrics Bottom-Up Top-Down (BUTD) is trained on COCO while others are trained on theConceptual Captions (CC) dataset The evaluation however is conducted on COIN dataset which is out-of-domainfor both COCO and CC This domain shift causes the n-gram based metrics (eg BLEU ROUGE CIDEr) to assignvery low scores to otherwise correct captions (See Table 4) Whereas embedding based metrics (eg BLEURTBERTScore and COSMic) do not suffer from this limitation Since all metrics have different scales instead ofabsolute scores we use Kendall Rank Correlation to measure agreement with human scores Model names areabbreviated as follows B1 Bleu1 B2 Bleu2 M METEOR RL ROUGEL C CIDEr S SPICE BR BLEURTBS-F BERTScore F1 COSMic models with rsquo+rsquo denote application of data augmentation to remove training databias More metrics and detailed results can be found on the code repository

training this captioning system To obtain the co-herence labels above we closely follow (Alikhaniet al 2020) to train a coherence classifier on theClue dataset (Alikhani et al 2020) that providesaround 4K human annotated (image caption rela-tion) triplets We present two caption-generationsystems in this section

Base-systems family A family of 4 captioningsystems is created by setting the coherence-labelto Meta Visible Subjective or Story in the basecaptioning model described above These are con-sidered different captioning systems because theinformation content and discourse goals as con-trolled by the coherence label are different

Lite-systems family We remove the global im-age features from the base modelrsquos input to obtaina smaller light-weight (lite) model Similar to thebase model we obtain a family of 4 captioningsystems by changing the coherence-label

In Section 5 we study the order in which sev-eral image captioning metrics rank these 8 systemsThe goal is to identify the metric that agrees themost with the groundtruth rankings based on hu-man assessments

44 COCO-trained Captioning SystemCOSMicrsquos training data RaCCoon is based onConceptual Captions and it is coherence-aware Totest the modelrsquos generalization capability we use

a captioning system trained on MS COCO (Chenet al 2015) Since COSMic expects an input co-herence label and COCO captions are Visible styleby design we set the label to Visible Specificallywe use the Bottom-Up Top-Down (BUTD) Atten-tion model (Anderson et al 2018) This helpsstudy how well COSMic generalizes to other cap-tioning datasets and coherence-agnostic captioningsystems

5 Experiments

Here we describe the experimental setup to com-pare COSMic with other metrics As outlined inSection 3 and 4 we use the RaCCoon data to trainour models and COIN to test COSMic and othermetrics We have several baseline metrics that wecompare to which can be found on Table 1

51 Model Training Setup

We implement COSMicmdashas described in Sec-tion 4mdashwith PyTorch (Paszke et al 2019) and trainon a GTX1080 GPU We pre-compute BERT3 andResNet4 features using their TensorFlow (Abadiet al 2015) implementations We use the pub-

3httpsgithubcomgoogle-researchbert

4httpswwwtensorfloworgapi_docspythontfkerasapplicationsResNet50V2

lic ViLBERT5 implementation We use a batchsize of 4 and a learning rate of 2times 10minus6 for fine-tuning ViLBERT and use RAdam optimizer andstop the training when the validation score doesnot change for 3 epochs For COSMic Vanillawe train with a batch-size of 10 Adam optimizer(Kingma and Ba 2017) with a base learning rateof 10minus3 that decays by a factor of 10minus2 every 10epochs We observe that the Vanilla convergesin approximately 100 epochs and ViLBERT con-verges in 9 epochs ViLBERT has 250 millionparameters COSMic Vanilla includes 3062913trainable parameters Pre-trained BERT-Large andResNet50V2 have an additional 350 million param-eters The setup for coherence-aware captioningmodels to obtain machine-generated captions forour study is the same as (Alikhani et al 2020)

52 Baseline Captioning Metrics

To benchmark COSMic we compare it with otherlearned metrics In this section we describe thesevarious metrics traditionally used for measuringimage captioning systems None of these metricswere designed to support the coherence relationsof the reference or generated captions These serveas baselines for COSMic

N-gram based The most popular image caption-ing metrics are based on precision and recall of n-grams from generated and reference captions Wecompare with Bleu1 Bleu2 Bleu3 Bleu4 (Guo andHu 2019) ROUGEL (Lin 2004) CIDEr (Vedan-tam et al 2015) and SPICE (Anderson et al2016b) We compute these using their popularopen-source implementation6

BLEURT We use a pre-trained BLEURT model7

as a baseline for our work Unlike N-gram basedapproaches BLEURT uses BERT-based word em-beddings which are robust to variations in surfaceword realizations between the reference and gen-erated captions We do not do any fine-tuning forthis baseline

BERTScore BERTScore8 uses a pre-trainedBERT model to embed the reference and gener-ated captions Text-level similarity scores are then

5httpsgithubcomfacebookresearchvilbert-multi-task

6httpsgithubcomtylincoco-caption7httpsgithubcomgoogle-research

bleurt8httpsgithubcomTiiigerbert_score

computed by matching the tokensrsquo output embed-dings

Please note that for both BERT-based baselinesabove (BLEURT BERTScore) we use the BERT-Large-512 size model

53 COIN-based Evaluation Setup

We use each baseline metric and COSMic to scorethe 8 different image captioning systems describedin Section 4 on the same set of test images withreference captions Note that the range and scaleof each metric is different however they are allmonotonously increasing functions of model qual-ity So in our study we do not analyze the abso-lute score assigned by these metrics but only theirranks We also ask human annotators to rank these8 captioning systems on the same set of test im-ages The ranks assigned by a higher performingmetric will align better with the ranks from humanannotators

Since the captioning systems above are trainedon Conceptual Captions or COCO we use im-agecaption pairs from COIN for an out-of-domainevaluation A subset of 50 random images is usedto rank the captioning systems as described aboveresulting in 400 machine-generated captions totalfor the 8 captioning systems These were thenevaluated by human annotators using the processdescribed in Section 3 The human-scored systemlevel performance for each captioning system onthis test set is reported in Table 1 in ldquoAverage Hu-man Ratingrdquo

We measure the alignment between metric-assigned and human-assigned scores using theKendall (Kendall 1938) correlation coefficient Inorder to calculate the score we first aggregate allthe sample scores and average them Then wecalculate the Kendall tau score using the SciPy171 implementation The score is calculatedbetween two vectors first of which is the aver-age human ratings for 8 models and the secondbeing the investigated metric scores for 8 mod-els in the following order[BaseV isible BaseMetaBaseSubjective BaseStory LiteV isible LiteMetaLiteSubjective LiteStory] Due to the small sam-ple size Kendall correlation is the most suitablecorrelation measure

A key measure of the success of an automaticevaluation metric is whether it makes the same deci-sion about which system is better in a head-to-headevaluation as we would get from a human-subjects

evaluation If each system is evaluated based onits average score then success comes when the av-erage computed metric correlates closely with theaverage human-ranking In particular we measurethe alignment between metric assigned and humanassigned scores using the Kendall score followingthe work of (Sellam et al 2020)

6 Results

Table 1 presents the results of the COIN-basedstudy The last row reports the Kendall correla-tion coefficient between the scores assigned by themetric and humans

All N-gram based metrics such as BLEU andCIDEr fail to adapt to the out-of-domain ground-truth captions from COIN This results in a rela-tively flat distribution of system-level scores con-centrated close to 0 and hence low correlation co-efficients CIDEr has a highly negative Kendallrsquosτ which denotes a strong negative associationwith human judgements This is partly due to low(sim001) and hence noisy CIDEr scores (Figure 4provides example cases that illustrate this argu-ment)

Embedding-based methods BLEURT andBERTScore do not suffer from this limitation re-sulting in more meaningful scoring of systems andhence higher correlation with human scores How-ever by design both these metrics are agnostic tocoherence-labels and the input image COSMicwhich is coherence-aware obtains the highest cor-relation with human scores COSMic ViLBERThas the highest Kendallrsquos correlation among all ofour models COSMic Vanilla performs the sec-ond best among our models and it performs betterthan the rest of the models in terms of Kendallrsquoscorrelation

Data Augmentation The raw RaCCoon trainingdata has a coherence-level bias as demonstrated bythe average COSMic score for each class mdash Visi-ble (0622) Meta (0459) Subjective (0236) andStory (0397) This reflects the human annotatorsrsquobias towards liking Visible captions the most andSubjective captions the least which is expectedHowever training COSMic on this data injects thesame coherence-bias into the model which is un-desirable As presented in Table 1 both flavors ofCOSMic (without the lsquo+rsquo) assign high scores toVisible captioning systems

To mitigate this issue we algorithmically aug-ment the training data to bring the average scoresfor each coherence class to comparable values Weachieve this by pairing images with random cap-tions from the coherence class and assigning thema score of 0 This is a valid training sample becausethe randomly sampled caption does not describe thesaid image and serves as a negative sample Withthese operations the class bias is significantly re-duced mdash Visible (0459) Meta (0439) Subjective(0328) and Story (0425) The COSMic columnsin Table 1 with lsquo+rsquo denote that this data augmen-tation approach improves ranking of captioningsystems leading to better alignment with humanjudgements

Ablation Study Table 2 reports the perfor-mance of COSMic Vanilla without coherence-labels andor the image as model inputs We findthat removal of image features affects COSMicrsquosperformance showing the important contributionof images The performance deteriorates signifi-cantly when the coherence-labels are removed fromthe model (No rc gc column in Table 2) Thisdemonstrates that COSMic successfully integratescoherence-relations in the caption scoring process

Reference two men in scrubs per-forming surgery

mountains in front of aclear blue sky

large brick building next toa green lawn and big trees

a foggy forest

Generated surgeons operating on apatient

mountain range as seenfrom the trail

the front of the house light shining throughthe trees

Figure 4 Illustration of COIN reference captions and corresponding outputs of the Base-Visible model Thoughthe generated captions are correct an n-gram based metric such as CIDEr assigns them a very low score due to thevariations in surface word realizations See Table 1 for average scores over the test set (Photo credits from left toright US Army Africa Gabriel Fr James Bradley Rosmarie Voegtli)

System COSMic

Model CohLabel

Full No I No c No I amp c

Base

Visible 516 447 434 442Meta 505 439 442 453Subj 356 347 438 453Story 505 433 436 445

Lite

Visible 515 444 434 433Meta 511 434 447 464Subj 379 367 440 459Story 499 440 433 442

KendallrsquosCorr (τ ) 667 546 -222 -415

Table 2 Ablation experiment results No I repre-sents COSMic Vanilla without image features Norc gc represents COSMic Vanilla without coherencelabel embeddings finally No I amp No rc gc repre-sents COSMic Vanilla without coherence label embed-dings and without image features

7 Conclusion

Our work is the first step towards designing genera-tion metrics that respect the information goal of thegenerated text We observe that a small set of ex-amples annotated with coherence relations can pro-vide what is needed for learning a discourse-awaregeneration metric Our findings have implicationsfor designing context-aware multimodal metricswith criteria that are closer to human ratings forevaluating machine-generated multimodal content

We have called attention to the challenge oflearning robust generation metrics that can eval-uate the output of the generation models consid-ering the information goals Our findings sug-gest that fine-tuning ViLBERTmdashoriginally trainedwith millions of imagesmdashwith a smaller sample ofcoherence relations and expert-annotated scoringautomated metrics can score generated captionscloser to a human rating The presented datasetprovides the opportunity for future research in thearea of image description generation designingdiscourse-aware metrics and multimodal contentevaluation We hope that coherence-aware text gen-eration metrics could be used for learning bettergeneration models (such as abstractive summariza-tion or story generation) and could be deployeddirectly in machine learning pipelines to help inoptimizing hyper-parameters Ultimately it is in-tended to have a generalizable model that can usea labeling mechanismmdashnot restricted to coherencelabelsmdash to improve applicability of generation met-rics in different tasks

8 Ethics

This paper describes a research prototype We donot work with sensitive or personal data Our pro-tocol was approved by our ethics board Humansubjects participated voluntarily undertook min-imal risk and were compensated fairly for theirtime The dataset we produced is fully anonymizedSubjects consented to the distribution of their dataas part of their participation in the research Tech-nologists should think carefully before deployingour ideas in production Our work depends onpretrained models such as word and image embed-dings These models are known to reproduce andeven magnify societal bias present in training dataMoreover like many ML NLP methods our meth-ods are likely to perform better for content thatis better represented in training leading to furtherbias against marginalized groups We can hope thatgeneral methods to mitigate harms from ML biascan address these issues

A distinctive complication of our work is the factthat many imagendashtext presentations involve writ-ers expressing subjective opinions By its natureour evaluation metric assesses such subjective textsbased on averages and trends across many userswhich may be problematic Although such judg-ments are ultimately matters of personal taste theyare nevertheless often grounds by which hierarchiesof differences are culturally encoded and enforcedThus a deployed subjective-caption generation sys-tem could well be unfair to users especially if thoseusers are not confident in their own taste or criticaltowards the systemrsquos responses Our evaluationmetric is not sensitive to such harms

Acknowledgements

The authors affiliated with Rutgers University werepartly supported by NSF Award CCF-19349243Thanks to Pitt Cyber for supporting this project andthe authors from the University of Pittsburgh Wealso acknowledge the Center for Research Comput-ing at the University of Pittsburgh for providing therequired computational resources for carrying outexperiments at the University of Pittsburgh

ReferencesMartiacuten Abadi Ashish Agarwal Paul Barham Eugene

Brevdo Zhifeng Chen Craig Citro Greg S CorradoAndy Davis Jeffrey Dean Matthieu Devin SanjayGhemawat Ian Goodfellow Andrew Harp Geoffrey

Irving Michael Isard Yangqing Jia Rafal Jozefow-icz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dandelion Maneacute Rajat Monga Sherry MooreDerek Murray Chris Olah Mike Schuster JonathonShlens Benoit Steiner Ilya Sutskever Kunal TalwarPaul Tucker Vincent Vanhoucke Vijay VasudevanFernanda Vieacutegas Oriol Vinyals Pete Warden Mar-tin Wattenberg Martin Wicke Yuan Yu and Xiao-qiang Zheng 2015 TensorFlow Large-scale ma-chine learning on heterogeneous systems Softwareavailable from tensorfloworg

Abien Fred Agarap 2018 Deep learning using recti-fied linear units (relu) CoRR abs180308375

Malihe Alikhani Piyush Sharma Shengjie Li RaduSoricut and Matthew Stone 2020 Cross-modal co-herence modeling for caption generation In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 6525ndash6535 Online Association for Computational Lin-guistics

Peter Anderson Basura Fernando Mark Johnsonand Stephen Gould 2016a SPICE semanticpropositional image caption evaluation CoRRabs160708822

Peter Anderson Basura Fernando Mark Johnson andStephen Gould 2016b Spice Semantic propo-sitional image caption evaluation In EuropeanConference on Computer Vision pages 382ndash398Springer

Peter Anderson Xiaodong He Chris Buehler DamienTeney Mark Johnson Stephen Gould and LeiZhang 2018 Bottom-up and top-down attention forimage captioning and visual question answering InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR)

Xinlei Chen Hao Fang Tsung-Yi Lin Ramakr-ishna Vedantam Saurabh Gupta Piotr Dollar andC Lawrence Zitnick 2015 Microsoft coco cap-tions Data collection and evaluation server

Yin Cui Guandao Yang Andreas Veit Xun Huangand Serge Belongie 2018 Learning to evaluate im-age captioning In Proceedings of the IEEE con-ference on computer vision and pattern recognitionpages 5804ndash5812

Samuel Cumming Gabriel Greenberg and Rory Kelly2017 Conventions of viewpoint coherence in filmPhilosophersrsquo Imprint 17(1)1ndash29

J Deng W Dong R Socher L-J Li K Li and L Fei-Fei 2009 ImageNet A Large-Scale HierarchicalImage Database In CVPR09

Michael Denkowski and Alon Lavie 2014 Meteor uni-versal Language specific translation evaluation forany target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Haoyuan Gao Junhua Mao Jie Zhou Zhiheng HuangLei Wang and Wei Xu 2015 Are you talking to amachine dataset and methods for multilingual im-age question In Advances in Neural InformationProcessing Systems pages 2296ndash2304

Yifan Gao Chien-Sheng Wu Jingjing Li Shafiq JotySteven CH Hoi Caiming Xiong Irwin King andMichael Lyu 2020 Discern Discourse-aware en-tailment reasoning network for conversational ma-chine reading In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) pages 2439ndash2449 Online As-sociation for Computational Linguistics

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Kaiming He Xiangyu Zhang Shaoqing Ren and JianSun 2015 Deep residual learning for image recog-nition CoRR abs151203385

Tuomo Hiippala Malihe Alikhani Jonas HaverinenTimo Kalliokoski Evanfiya Logacheva SerafinaOrekhova Aino Tuomainen Matthew Stone andJohn A Bateman 2021 AI2D-RST a multimodalcorpus of 1000 primary school science diagramsLang Resour Evaluation 55(3)661ndash688

Jerry R Hobbs 1985 On the coherence and structureof discourse

Xinyue Huang and Adriana Kovashka 2016 Inferringvisual persuasion via body language setting anddeep features In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern RecognitionWorkshops pages 73ndash79

M G Kendall 1938 A new measure of rank correla-tion Biometrika 30(12)81ndash93

Diederik P Kingma and Jimmy Ba 2017 Adam Amethod for stochastic optimization

Ranjay Krishna Yuke Zhu Oliver Groth Justin John-son Kenji Hata Joshua Kravitz Stephanie ChenYannis Kalantidis Li-Jia Li David A ShammaMichael Bernstein and Li Fei-Fei 2016 Visualgenome Connecting language and vision usingcrowdsourced dense image annotations

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019a Integratingtext and image Determining multimodal documentintent in Instagram posts In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 4622ndash4632 Hong KongChina Association for Computational Linguistics

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019b Integrat-ing text and image Determining multimodal doc-ument intent in instagram posts arXiv preprintarXiv190409073

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov and et al 2020a The open imagesdataset v4 International Journal of Computer Vi-sion 128(7)1956ndash1981

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov et al 2020b The open images datasetv4 International Journal of Computer Vision pages1ndash26

Alex Lascarides and Matthew Stone 2009 A formalsemantic analysis of gesture Journal of Semantics26(4)393ndash449

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81 Barcelona SpainAssociation for Computational Linguistics

Jiasen Lu Dhruv Batra Devi Parikh and StefanLee 2019 Vilbert Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks In Advances in Neural Information Process-ing Systems volume 32 Curran Associates Inc

Pranava Madhyastha Josiah Wang and Lucia Specia2019 VIFIDEL Evaluating the visual fidelity ofimage descriptions In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics pages 6539ndash6550 Florence Italy Asso-ciation for Computational Linguistics

William C Mann and Sandra A Thompson 1987Rhetorical structure theory A theory of text orga-nization University of Southern California Infor-mation Sciences Institute Los Angeles

Scott McCloud 1993 Understanding comics The in-visible art William Morrow

Edwin G Ng Bo Pang Piyush Sharma and RaduSoricut 2020 Understanding guided image cap-tioning performance across domains arXiv preprintarXiv201202339

Christian Otto Matthias Springstein Avishek Anandand Ralph Ewerth 2019 Understanding catego-rizing and predicting semantic image-text relationsIn Proceedings of the 2019 on International Con-ference on Multimedia Retrieval pages 168ndash176ACM

Kishore Papineni Salim Roukos Todd Ward and Weijing Zhu 2002 Bleu a method for automatic evalu-ation of machine translation pages 311ndash318

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Advances in Neural Information Pro-cessing Systems 32 pages 8024ndash8035 Curran Asso-ciates Inc

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind Joshi and Bon-nie Webber 2008a The Penn Discourse Tree-Bank 20 In Proceedings of the Sixth Interna-tional Conference on Language Resources and Eval-uation (LRECrsquo08) Marrakech Morocco EuropeanLanguage Resources Association (ELRA)

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind K Joshi and Bon-nie L Webber 2008b The Penn discourse treebank20 In LREC Citeseer

J Pustejovsky and N Krishnaswamy 2020 Situatedmeaning in multimodal dialogue human-robot andhuman-computer interactions

Shaoqing Ren Kaiming He Ross Girshick and JianSun 2016 Faster r-cnn Towards real-time objectdetection with region proposal networks

Deborah Schiffrin 1980 Meta-talk Organizationaland evaluative brackets in discourse SociologicalInquiry 50(3-4)199ndash236

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Piyush Sharma Nan Ding Sebastian Goodman andRadu Soricut 2018 Conceptual captions Acleaned hypernymed image alt-text dataset for au-tomatic image captioning In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages2556ndash2565 Melbourne Australia Association forComputational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez Ł ukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008 Cur-ran Associates Inc

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 CIDEr Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Ramakrishna Vedantam C Lawrence Zitnick andDevi Parikh 2014 Cider Consensus-based imagedescription evaluation CoRR abs14115726

Anthony Viera and Joanne Garrett 2005 Understand-ing interobserver agreement The kappa statisticFamily medicine 37360ndash3

Jiacheng Xu Zhe Gan Yu Cheng and Jingjing Liu2019 Discourse-aware neural extractive text sum-marization arXiv preprint arXiv191014142

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Page 5: arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

For input image (I) we use the same processas ViLBERT We use a Faster R-CNN (Ren et al2016) model pre-trained on Visual Genome (Kr-ishna et al 2016) to detect objects regions and ex-tract features The sequence of these image featuresis denoted as I prime with 100 bounding box featureswhere each element is R2048 Similar to ViLBERTwe use the special token [IMG] to denote the be-ginning of the bounding box features list

For input captions (g r) and coherence labels(gc gr) the sequence begins with the special token[CLS] followed by input text embeddings Each ofour text inputs are tokenized and embedded usingViLBERTrsquos input text pre-processing and denotedas gprime rprime gprimec g

primer for g r gc and gr respectively

Note that the coherence labels are processed as textinputs such as ldquoVisiblerdquo and ldquoStoryrdquo which allowsthe model to use its pre-trained representations ofthese concepts Each of these input sequences areseparated by the special token [SEP] to form ourinput sequence

Hence our input to ViLBERT is of formv = ([IMG] I prime [CLS] rprime [SEP] gprime [SEP] rprimec [SEP] gprimec)

We use a linear layer with sigmoid activationon ViLBERTrsquos output text logits to compute COS-Micrsquos output metric score (s)

s = Linear(ViLBERT(v)) (2)

During training we fine-tune ViLBERT and theoutput linear layer in an end-to-end fashion by mini-mizing the Mean-Squared error between the outputscore s and the corresponding reference score yon the RaCCoon dataset

42 COSMic VanillaThe COSMic ViLBERT approach above takes ad-vantage of multimodal pre-training on the Concep-tual Captions dataset to embed the image and textinputs As a simpler baseline we now presentCOSMic Vanilla which independently embeds theinput image and text to be later combined for scorecomputation with no end-to-end training

To extract image features we use a ResNet50v2(He et al 2015) model pre-trained on ImageNet(Deng et al 2009) and linearly transform the globalimage representation to 512-dimensional space

eI = Linear1(AveragePool(ResNet(I))) (3)

In our textual feature extraction module weembed g and r independently with a pre-trained

BERT-Large-512 model We use the [CLS] to-ken embedding as 1024 dimensional caption-levelrepresentation in each case and transform them to512-dimensional space

eg = Linear2(BERTCLS(g))

er = Linear2(BERTCLS(r))(4)

In our coherence label embedding module gcand rc are each represented as one-hot vectors suchthat the dimensions correspond to labels Meta Vis-ible Subjective and Story Each is embedded intoa 512-dimensional space

egc = Linear3(gc)

erc = Linear3(rc)(5)

We thus obtain the 5 vectors (each R512)representing one of the inputs of Equation 1We concatenate and use a feed-forward net-work with progressively smaller hidden layersof sizes [512 256 128 64 32 16 8] each withReLU (Agarap 2018) activation The output scores is computed by a final linear layer on top of theabove network

e = concat([eI eg er egc erc ]))

s = Linear4(MLP1(e))(6)

where e isin R2560 and s isin RTo understand the role of each component of this

implementation we further deconstruct each mod-ule in ablation experiments described in Table 2

43 Coherence-aware Captioning SystemsIn order to experiment with COSMic we generateour own captions In this section we describe thecoherence-aware captioning systems used to gener-ate these image captions for the training and testingof COSMic

For our base captioning system we use the state-of-the-art coherence-aware captioning system in-troduced by (Alikhani et al 2020) It uses aTransformer-based (Vaswani et al 2017) encoder-decoder architecture where the encoder inputs are(1) global image features (2) image labels and (3)coherence label The coherence-label also servesas the first input token for the decoder which gen-erates the output captions We set the coherencelabel to the groundtruth relation at training timeand the desired relation at inference time We usethe Conceptual Captions dataset (Sharma et al2018) with machine-generated coherence labels for

System AvgHum

Rating

Metrics

Model CohLabel B1 B2 M RL C S BR BS-

FCOSMicVanilla

COSMicViL-BERT

COSMicVanilla+

COSMicViL-BERT+

BUTD Visible 2191 163 077 049 160 092 030 -877 863 706 796 522 641

Base

Visible 3532 050 025 019 066 020 002 -1114 862 696 777 516 614Meta 3213 041 000 012 063 012 000 -1059 863 548 727 505 602Subj 2830 033 012 011 057 017 000 -1197 849 323 421 358 403Story 2915 029 000 017 058 013 000 -1304 842 533 629 482 527

Lite

Visible 3298 028 011 013 053 011 000 -1101 863 684 784 515 604Meta 2830 026 010 008 055 015 000 -1084 859 548 748 511 565Subj 2298 039 012 019 066 024 003 -1217 849 364 451 379 419Story 2426 036 000 018 062 021 000 -1362 842 568 666 499 519

KendallrsquosCorrelation (τ ) 1000 071 154 036 -036 -571 -052 286 445 571 546 667 764

Table 1 System-level scores for 9 different image captioning systems as evaluated by human annotators andvarious captioning metrics Bottom-Up Top-Down (BUTD) is trained on COCO while others are trained on theConceptual Captions (CC) dataset The evaluation however is conducted on COIN dataset which is out-of-domainfor both COCO and CC This domain shift causes the n-gram based metrics (eg BLEU ROUGE CIDEr) to assignvery low scores to otherwise correct captions (See Table 4) Whereas embedding based metrics (eg BLEURTBERTScore and COSMic) do not suffer from this limitation Since all metrics have different scales instead ofabsolute scores we use Kendall Rank Correlation to measure agreement with human scores Model names areabbreviated as follows B1 Bleu1 B2 Bleu2 M METEOR RL ROUGEL C CIDEr S SPICE BR BLEURTBS-F BERTScore F1 COSMic models with rsquo+rsquo denote application of data augmentation to remove training databias More metrics and detailed results can be found on the code repository

training this captioning system To obtain the co-herence labels above we closely follow (Alikhaniet al 2020) to train a coherence classifier on theClue dataset (Alikhani et al 2020) that providesaround 4K human annotated (image caption rela-tion) triplets We present two caption-generationsystems in this section

Base-systems family A family of 4 captioningsystems is created by setting the coherence-labelto Meta Visible Subjective or Story in the basecaptioning model described above These are con-sidered different captioning systems because theinformation content and discourse goals as con-trolled by the coherence label are different

Lite-systems family We remove the global im-age features from the base modelrsquos input to obtaina smaller light-weight (lite) model Similar to thebase model we obtain a family of 4 captioningsystems by changing the coherence-label

In Section 5 we study the order in which sev-eral image captioning metrics rank these 8 systemsThe goal is to identify the metric that agrees themost with the groundtruth rankings based on hu-man assessments

44 COCO-trained Captioning SystemCOSMicrsquos training data RaCCoon is based onConceptual Captions and it is coherence-aware Totest the modelrsquos generalization capability we use

a captioning system trained on MS COCO (Chenet al 2015) Since COSMic expects an input co-herence label and COCO captions are Visible styleby design we set the label to Visible Specificallywe use the Bottom-Up Top-Down (BUTD) Atten-tion model (Anderson et al 2018) This helpsstudy how well COSMic generalizes to other cap-tioning datasets and coherence-agnostic captioningsystems

5 Experiments

Here we describe the experimental setup to com-pare COSMic with other metrics As outlined inSection 3 and 4 we use the RaCCoon data to trainour models and COIN to test COSMic and othermetrics We have several baseline metrics that wecompare to which can be found on Table 1

51 Model Training Setup

We implement COSMicmdashas described in Sec-tion 4mdashwith PyTorch (Paszke et al 2019) and trainon a GTX1080 GPU We pre-compute BERT3 andResNet4 features using their TensorFlow (Abadiet al 2015) implementations We use the pub-

3httpsgithubcomgoogle-researchbert

4httpswwwtensorfloworgapi_docspythontfkerasapplicationsResNet50V2

lic ViLBERT5 implementation We use a batchsize of 4 and a learning rate of 2times 10minus6 for fine-tuning ViLBERT and use RAdam optimizer andstop the training when the validation score doesnot change for 3 epochs For COSMic Vanillawe train with a batch-size of 10 Adam optimizer(Kingma and Ba 2017) with a base learning rateof 10minus3 that decays by a factor of 10minus2 every 10epochs We observe that the Vanilla convergesin approximately 100 epochs and ViLBERT con-verges in 9 epochs ViLBERT has 250 millionparameters COSMic Vanilla includes 3062913trainable parameters Pre-trained BERT-Large andResNet50V2 have an additional 350 million param-eters The setup for coherence-aware captioningmodels to obtain machine-generated captions forour study is the same as (Alikhani et al 2020)

52 Baseline Captioning Metrics

To benchmark COSMic we compare it with otherlearned metrics In this section we describe thesevarious metrics traditionally used for measuringimage captioning systems None of these metricswere designed to support the coherence relationsof the reference or generated captions These serveas baselines for COSMic

N-gram based The most popular image caption-ing metrics are based on precision and recall of n-grams from generated and reference captions Wecompare with Bleu1 Bleu2 Bleu3 Bleu4 (Guo andHu 2019) ROUGEL (Lin 2004) CIDEr (Vedan-tam et al 2015) and SPICE (Anderson et al2016b) We compute these using their popularopen-source implementation6

BLEURT We use a pre-trained BLEURT model7

as a baseline for our work Unlike N-gram basedapproaches BLEURT uses BERT-based word em-beddings which are robust to variations in surfaceword realizations between the reference and gen-erated captions We do not do any fine-tuning forthis baseline

BERTScore BERTScore8 uses a pre-trainedBERT model to embed the reference and gener-ated captions Text-level similarity scores are then

5httpsgithubcomfacebookresearchvilbert-multi-task

6httpsgithubcomtylincoco-caption7httpsgithubcomgoogle-research

bleurt8httpsgithubcomTiiigerbert_score

computed by matching the tokensrsquo output embed-dings

Please note that for both BERT-based baselinesabove (BLEURT BERTScore) we use the BERT-Large-512 size model

53 COIN-based Evaluation Setup

We use each baseline metric and COSMic to scorethe 8 different image captioning systems describedin Section 4 on the same set of test images withreference captions Note that the range and scaleof each metric is different however they are allmonotonously increasing functions of model qual-ity So in our study we do not analyze the abso-lute score assigned by these metrics but only theirranks We also ask human annotators to rank these8 captioning systems on the same set of test im-ages The ranks assigned by a higher performingmetric will align better with the ranks from humanannotators

Since the captioning systems above are trainedon Conceptual Captions or COCO we use im-agecaption pairs from COIN for an out-of-domainevaluation A subset of 50 random images is usedto rank the captioning systems as described aboveresulting in 400 machine-generated captions totalfor the 8 captioning systems These were thenevaluated by human annotators using the processdescribed in Section 3 The human-scored systemlevel performance for each captioning system onthis test set is reported in Table 1 in ldquoAverage Hu-man Ratingrdquo

We measure the alignment between metric-assigned and human-assigned scores using theKendall (Kendall 1938) correlation coefficient Inorder to calculate the score we first aggregate allthe sample scores and average them Then wecalculate the Kendall tau score using the SciPy171 implementation The score is calculatedbetween two vectors first of which is the aver-age human ratings for 8 models and the secondbeing the investigated metric scores for 8 mod-els in the following order[BaseV isible BaseMetaBaseSubjective BaseStory LiteV isible LiteMetaLiteSubjective LiteStory] Due to the small sam-ple size Kendall correlation is the most suitablecorrelation measure

A key measure of the success of an automaticevaluation metric is whether it makes the same deci-sion about which system is better in a head-to-headevaluation as we would get from a human-subjects

evaluation If each system is evaluated based onits average score then success comes when the av-erage computed metric correlates closely with theaverage human-ranking In particular we measurethe alignment between metric assigned and humanassigned scores using the Kendall score followingthe work of (Sellam et al 2020)

6 Results

Table 1 presents the results of the COIN-basedstudy The last row reports the Kendall correla-tion coefficient between the scores assigned by themetric and humans

All N-gram based metrics such as BLEU andCIDEr fail to adapt to the out-of-domain ground-truth captions from COIN This results in a rela-tively flat distribution of system-level scores con-centrated close to 0 and hence low correlation co-efficients CIDEr has a highly negative Kendallrsquosτ which denotes a strong negative associationwith human judgements This is partly due to low(sim001) and hence noisy CIDEr scores (Figure 4provides example cases that illustrate this argu-ment)

Embedding-based methods BLEURT andBERTScore do not suffer from this limitation re-sulting in more meaningful scoring of systems andhence higher correlation with human scores How-ever by design both these metrics are agnostic tocoherence-labels and the input image COSMicwhich is coherence-aware obtains the highest cor-relation with human scores COSMic ViLBERThas the highest Kendallrsquos correlation among all ofour models COSMic Vanilla performs the sec-ond best among our models and it performs betterthan the rest of the models in terms of Kendallrsquoscorrelation

Data Augmentation The raw RaCCoon trainingdata has a coherence-level bias as demonstrated bythe average COSMic score for each class mdash Visi-ble (0622) Meta (0459) Subjective (0236) andStory (0397) This reflects the human annotatorsrsquobias towards liking Visible captions the most andSubjective captions the least which is expectedHowever training COSMic on this data injects thesame coherence-bias into the model which is un-desirable As presented in Table 1 both flavors ofCOSMic (without the lsquo+rsquo) assign high scores toVisible captioning systems

To mitigate this issue we algorithmically aug-ment the training data to bring the average scoresfor each coherence class to comparable values Weachieve this by pairing images with random cap-tions from the coherence class and assigning thema score of 0 This is a valid training sample becausethe randomly sampled caption does not describe thesaid image and serves as a negative sample Withthese operations the class bias is significantly re-duced mdash Visible (0459) Meta (0439) Subjective(0328) and Story (0425) The COSMic columnsin Table 1 with lsquo+rsquo denote that this data augmen-tation approach improves ranking of captioningsystems leading to better alignment with humanjudgements

Ablation Study Table 2 reports the perfor-mance of COSMic Vanilla without coherence-labels andor the image as model inputs We findthat removal of image features affects COSMicrsquosperformance showing the important contributionof images The performance deteriorates signifi-cantly when the coherence-labels are removed fromthe model (No rc gc column in Table 2) Thisdemonstrates that COSMic successfully integratescoherence-relations in the caption scoring process

Reference two men in scrubs per-forming surgery

mountains in front of aclear blue sky

large brick building next toa green lawn and big trees

a foggy forest

Generated surgeons operating on apatient

mountain range as seenfrom the trail

the front of the house light shining throughthe trees

Figure 4 Illustration of COIN reference captions and corresponding outputs of the Base-Visible model Thoughthe generated captions are correct an n-gram based metric such as CIDEr assigns them a very low score due to thevariations in surface word realizations See Table 1 for average scores over the test set (Photo credits from left toright US Army Africa Gabriel Fr James Bradley Rosmarie Voegtli)

System COSMic

Model CohLabel

Full No I No c No I amp c

Base

Visible 516 447 434 442Meta 505 439 442 453Subj 356 347 438 453Story 505 433 436 445

Lite

Visible 515 444 434 433Meta 511 434 447 464Subj 379 367 440 459Story 499 440 433 442

KendallrsquosCorr (τ ) 667 546 -222 -415

Table 2 Ablation experiment results No I repre-sents COSMic Vanilla without image features Norc gc represents COSMic Vanilla without coherencelabel embeddings finally No I amp No rc gc repre-sents COSMic Vanilla without coherence label embed-dings and without image features

7 Conclusion

Our work is the first step towards designing genera-tion metrics that respect the information goal of thegenerated text We observe that a small set of ex-amples annotated with coherence relations can pro-vide what is needed for learning a discourse-awaregeneration metric Our findings have implicationsfor designing context-aware multimodal metricswith criteria that are closer to human ratings forevaluating machine-generated multimodal content

We have called attention to the challenge oflearning robust generation metrics that can eval-uate the output of the generation models consid-ering the information goals Our findings sug-gest that fine-tuning ViLBERTmdashoriginally trainedwith millions of imagesmdashwith a smaller sample ofcoherence relations and expert-annotated scoringautomated metrics can score generated captionscloser to a human rating The presented datasetprovides the opportunity for future research in thearea of image description generation designingdiscourse-aware metrics and multimodal contentevaluation We hope that coherence-aware text gen-eration metrics could be used for learning bettergeneration models (such as abstractive summariza-tion or story generation) and could be deployeddirectly in machine learning pipelines to help inoptimizing hyper-parameters Ultimately it is in-tended to have a generalizable model that can usea labeling mechanismmdashnot restricted to coherencelabelsmdash to improve applicability of generation met-rics in different tasks

8 Ethics

This paper describes a research prototype We donot work with sensitive or personal data Our pro-tocol was approved by our ethics board Humansubjects participated voluntarily undertook min-imal risk and were compensated fairly for theirtime The dataset we produced is fully anonymizedSubjects consented to the distribution of their dataas part of their participation in the research Tech-nologists should think carefully before deployingour ideas in production Our work depends onpretrained models such as word and image embed-dings These models are known to reproduce andeven magnify societal bias present in training dataMoreover like many ML NLP methods our meth-ods are likely to perform better for content thatis better represented in training leading to furtherbias against marginalized groups We can hope thatgeneral methods to mitigate harms from ML biascan address these issues

A distinctive complication of our work is the factthat many imagendashtext presentations involve writ-ers expressing subjective opinions By its natureour evaluation metric assesses such subjective textsbased on averages and trends across many userswhich may be problematic Although such judg-ments are ultimately matters of personal taste theyare nevertheless often grounds by which hierarchiesof differences are culturally encoded and enforcedThus a deployed subjective-caption generation sys-tem could well be unfair to users especially if thoseusers are not confident in their own taste or criticaltowards the systemrsquos responses Our evaluationmetric is not sensitive to such harms

Acknowledgements

The authors affiliated with Rutgers University werepartly supported by NSF Award CCF-19349243Thanks to Pitt Cyber for supporting this project andthe authors from the University of Pittsburgh Wealso acknowledge the Center for Research Comput-ing at the University of Pittsburgh for providing therequired computational resources for carrying outexperiments at the University of Pittsburgh

ReferencesMartiacuten Abadi Ashish Agarwal Paul Barham Eugene

Brevdo Zhifeng Chen Craig Citro Greg S CorradoAndy Davis Jeffrey Dean Matthieu Devin SanjayGhemawat Ian Goodfellow Andrew Harp Geoffrey

Irving Michael Isard Yangqing Jia Rafal Jozefow-icz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dandelion Maneacute Rajat Monga Sherry MooreDerek Murray Chris Olah Mike Schuster JonathonShlens Benoit Steiner Ilya Sutskever Kunal TalwarPaul Tucker Vincent Vanhoucke Vijay VasudevanFernanda Vieacutegas Oriol Vinyals Pete Warden Mar-tin Wattenberg Martin Wicke Yuan Yu and Xiao-qiang Zheng 2015 TensorFlow Large-scale ma-chine learning on heterogeneous systems Softwareavailable from tensorfloworg

Abien Fred Agarap 2018 Deep learning using recti-fied linear units (relu) CoRR abs180308375

Malihe Alikhani Piyush Sharma Shengjie Li RaduSoricut and Matthew Stone 2020 Cross-modal co-herence modeling for caption generation In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 6525ndash6535 Online Association for Computational Lin-guistics

Peter Anderson Basura Fernando Mark Johnsonand Stephen Gould 2016a SPICE semanticpropositional image caption evaluation CoRRabs160708822

Peter Anderson Basura Fernando Mark Johnson andStephen Gould 2016b Spice Semantic propo-sitional image caption evaluation In EuropeanConference on Computer Vision pages 382ndash398Springer

Peter Anderson Xiaodong He Chris Buehler DamienTeney Mark Johnson Stephen Gould and LeiZhang 2018 Bottom-up and top-down attention forimage captioning and visual question answering InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR)

Xinlei Chen Hao Fang Tsung-Yi Lin Ramakr-ishna Vedantam Saurabh Gupta Piotr Dollar andC Lawrence Zitnick 2015 Microsoft coco cap-tions Data collection and evaluation server

Yin Cui Guandao Yang Andreas Veit Xun Huangand Serge Belongie 2018 Learning to evaluate im-age captioning In Proceedings of the IEEE con-ference on computer vision and pattern recognitionpages 5804ndash5812

Samuel Cumming Gabriel Greenberg and Rory Kelly2017 Conventions of viewpoint coherence in filmPhilosophersrsquo Imprint 17(1)1ndash29

J Deng W Dong R Socher L-J Li K Li and L Fei-Fei 2009 ImageNet A Large-Scale HierarchicalImage Database In CVPR09

Michael Denkowski and Alon Lavie 2014 Meteor uni-versal Language specific translation evaluation forany target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Haoyuan Gao Junhua Mao Jie Zhou Zhiheng HuangLei Wang and Wei Xu 2015 Are you talking to amachine dataset and methods for multilingual im-age question In Advances in Neural InformationProcessing Systems pages 2296ndash2304

Yifan Gao Chien-Sheng Wu Jingjing Li Shafiq JotySteven CH Hoi Caiming Xiong Irwin King andMichael Lyu 2020 Discern Discourse-aware en-tailment reasoning network for conversational ma-chine reading In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) pages 2439ndash2449 Online As-sociation for Computational Linguistics

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Kaiming He Xiangyu Zhang Shaoqing Ren and JianSun 2015 Deep residual learning for image recog-nition CoRR abs151203385

Tuomo Hiippala Malihe Alikhani Jonas HaverinenTimo Kalliokoski Evanfiya Logacheva SerafinaOrekhova Aino Tuomainen Matthew Stone andJohn A Bateman 2021 AI2D-RST a multimodalcorpus of 1000 primary school science diagramsLang Resour Evaluation 55(3)661ndash688

Jerry R Hobbs 1985 On the coherence and structureof discourse

Xinyue Huang and Adriana Kovashka 2016 Inferringvisual persuasion via body language setting anddeep features In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern RecognitionWorkshops pages 73ndash79

M G Kendall 1938 A new measure of rank correla-tion Biometrika 30(12)81ndash93

Diederik P Kingma and Jimmy Ba 2017 Adam Amethod for stochastic optimization

Ranjay Krishna Yuke Zhu Oliver Groth Justin John-son Kenji Hata Joshua Kravitz Stephanie ChenYannis Kalantidis Li-Jia Li David A ShammaMichael Bernstein and Li Fei-Fei 2016 Visualgenome Connecting language and vision usingcrowdsourced dense image annotations

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019a Integratingtext and image Determining multimodal documentintent in Instagram posts In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 4622ndash4632 Hong KongChina Association for Computational Linguistics

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019b Integrat-ing text and image Determining multimodal doc-ument intent in instagram posts arXiv preprintarXiv190409073

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov and et al 2020a The open imagesdataset v4 International Journal of Computer Vi-sion 128(7)1956ndash1981

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov et al 2020b The open images datasetv4 International Journal of Computer Vision pages1ndash26

Alex Lascarides and Matthew Stone 2009 A formalsemantic analysis of gesture Journal of Semantics26(4)393ndash449

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81 Barcelona SpainAssociation for Computational Linguistics

Jiasen Lu Dhruv Batra Devi Parikh and StefanLee 2019 Vilbert Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks In Advances in Neural Information Process-ing Systems volume 32 Curran Associates Inc

Pranava Madhyastha Josiah Wang and Lucia Specia2019 VIFIDEL Evaluating the visual fidelity ofimage descriptions In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics pages 6539ndash6550 Florence Italy Asso-ciation for Computational Linguistics

William C Mann and Sandra A Thompson 1987Rhetorical structure theory A theory of text orga-nization University of Southern California Infor-mation Sciences Institute Los Angeles

Scott McCloud 1993 Understanding comics The in-visible art William Morrow

Edwin G Ng Bo Pang Piyush Sharma and RaduSoricut 2020 Understanding guided image cap-tioning performance across domains arXiv preprintarXiv201202339

Christian Otto Matthias Springstein Avishek Anandand Ralph Ewerth 2019 Understanding catego-rizing and predicting semantic image-text relationsIn Proceedings of the 2019 on International Con-ference on Multimedia Retrieval pages 168ndash176ACM

Kishore Papineni Salim Roukos Todd Ward and Weijing Zhu 2002 Bleu a method for automatic evalu-ation of machine translation pages 311ndash318

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Advances in Neural Information Pro-cessing Systems 32 pages 8024ndash8035 Curran Asso-ciates Inc

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind Joshi and Bon-nie Webber 2008a The Penn Discourse Tree-Bank 20 In Proceedings of the Sixth Interna-tional Conference on Language Resources and Eval-uation (LRECrsquo08) Marrakech Morocco EuropeanLanguage Resources Association (ELRA)

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind K Joshi and Bon-nie L Webber 2008b The Penn discourse treebank20 In LREC Citeseer

J Pustejovsky and N Krishnaswamy 2020 Situatedmeaning in multimodal dialogue human-robot andhuman-computer interactions

Shaoqing Ren Kaiming He Ross Girshick and JianSun 2016 Faster r-cnn Towards real-time objectdetection with region proposal networks

Deborah Schiffrin 1980 Meta-talk Organizationaland evaluative brackets in discourse SociologicalInquiry 50(3-4)199ndash236

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Piyush Sharma Nan Ding Sebastian Goodman andRadu Soricut 2018 Conceptual captions Acleaned hypernymed image alt-text dataset for au-tomatic image captioning In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages2556ndash2565 Melbourne Australia Association forComputational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez Ł ukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008 Cur-ran Associates Inc

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 CIDEr Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Ramakrishna Vedantam C Lawrence Zitnick andDevi Parikh 2014 Cider Consensus-based imagedescription evaluation CoRR abs14115726

Anthony Viera and Joanne Garrett 2005 Understand-ing interobserver agreement The kappa statisticFamily medicine 37360ndash3

Jiacheng Xu Zhe Gan Yu Cheng and Jingjing Liu2019 Discourse-aware neural extractive text sum-marization arXiv preprint arXiv191014142

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Page 6: arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

System AvgHum

Rating

Metrics

Model CohLabel B1 B2 M RL C S BR BS-

FCOSMicVanilla

COSMicViL-BERT

COSMicVanilla+

COSMicViL-BERT+

BUTD Visible 2191 163 077 049 160 092 030 -877 863 706 796 522 641

Base

Visible 3532 050 025 019 066 020 002 -1114 862 696 777 516 614Meta 3213 041 000 012 063 012 000 -1059 863 548 727 505 602Subj 2830 033 012 011 057 017 000 -1197 849 323 421 358 403Story 2915 029 000 017 058 013 000 -1304 842 533 629 482 527

Lite

Visible 3298 028 011 013 053 011 000 -1101 863 684 784 515 604Meta 2830 026 010 008 055 015 000 -1084 859 548 748 511 565Subj 2298 039 012 019 066 024 003 -1217 849 364 451 379 419Story 2426 036 000 018 062 021 000 -1362 842 568 666 499 519

KendallrsquosCorrelation (τ ) 1000 071 154 036 -036 -571 -052 286 445 571 546 667 764

Table 1 System-level scores for 9 different image captioning systems as evaluated by human annotators andvarious captioning metrics Bottom-Up Top-Down (BUTD) is trained on COCO while others are trained on theConceptual Captions (CC) dataset The evaluation however is conducted on COIN dataset which is out-of-domainfor both COCO and CC This domain shift causes the n-gram based metrics (eg BLEU ROUGE CIDEr) to assignvery low scores to otherwise correct captions (See Table 4) Whereas embedding based metrics (eg BLEURTBERTScore and COSMic) do not suffer from this limitation Since all metrics have different scales instead ofabsolute scores we use Kendall Rank Correlation to measure agreement with human scores Model names areabbreviated as follows B1 Bleu1 B2 Bleu2 M METEOR RL ROUGEL C CIDEr S SPICE BR BLEURTBS-F BERTScore F1 COSMic models with rsquo+rsquo denote application of data augmentation to remove training databias More metrics and detailed results can be found on the code repository

training this captioning system To obtain the co-herence labels above we closely follow (Alikhaniet al 2020) to train a coherence classifier on theClue dataset (Alikhani et al 2020) that providesaround 4K human annotated (image caption rela-tion) triplets We present two caption-generationsystems in this section

Base-systems family A family of 4 captioningsystems is created by setting the coherence-labelto Meta Visible Subjective or Story in the basecaptioning model described above These are con-sidered different captioning systems because theinformation content and discourse goals as con-trolled by the coherence label are different

Lite-systems family We remove the global im-age features from the base modelrsquos input to obtaina smaller light-weight (lite) model Similar to thebase model we obtain a family of 4 captioningsystems by changing the coherence-label

In Section 5 we study the order in which sev-eral image captioning metrics rank these 8 systemsThe goal is to identify the metric that agrees themost with the groundtruth rankings based on hu-man assessments

44 COCO-trained Captioning SystemCOSMicrsquos training data RaCCoon is based onConceptual Captions and it is coherence-aware Totest the modelrsquos generalization capability we use

a captioning system trained on MS COCO (Chenet al 2015) Since COSMic expects an input co-herence label and COCO captions are Visible styleby design we set the label to Visible Specificallywe use the Bottom-Up Top-Down (BUTD) Atten-tion model (Anderson et al 2018) This helpsstudy how well COSMic generalizes to other cap-tioning datasets and coherence-agnostic captioningsystems

5 Experiments

Here we describe the experimental setup to com-pare COSMic with other metrics As outlined inSection 3 and 4 we use the RaCCoon data to trainour models and COIN to test COSMic and othermetrics We have several baseline metrics that wecompare to which can be found on Table 1

51 Model Training Setup

We implement COSMicmdashas described in Sec-tion 4mdashwith PyTorch (Paszke et al 2019) and trainon a GTX1080 GPU We pre-compute BERT3 andResNet4 features using their TensorFlow (Abadiet al 2015) implementations We use the pub-

3httpsgithubcomgoogle-researchbert

4httpswwwtensorfloworgapi_docspythontfkerasapplicationsResNet50V2

lic ViLBERT5 implementation We use a batchsize of 4 and a learning rate of 2times 10minus6 for fine-tuning ViLBERT and use RAdam optimizer andstop the training when the validation score doesnot change for 3 epochs For COSMic Vanillawe train with a batch-size of 10 Adam optimizer(Kingma and Ba 2017) with a base learning rateof 10minus3 that decays by a factor of 10minus2 every 10epochs We observe that the Vanilla convergesin approximately 100 epochs and ViLBERT con-verges in 9 epochs ViLBERT has 250 millionparameters COSMic Vanilla includes 3062913trainable parameters Pre-trained BERT-Large andResNet50V2 have an additional 350 million param-eters The setup for coherence-aware captioningmodels to obtain machine-generated captions forour study is the same as (Alikhani et al 2020)

52 Baseline Captioning Metrics

To benchmark COSMic we compare it with otherlearned metrics In this section we describe thesevarious metrics traditionally used for measuringimage captioning systems None of these metricswere designed to support the coherence relationsof the reference or generated captions These serveas baselines for COSMic

N-gram based The most popular image caption-ing metrics are based on precision and recall of n-grams from generated and reference captions Wecompare with Bleu1 Bleu2 Bleu3 Bleu4 (Guo andHu 2019) ROUGEL (Lin 2004) CIDEr (Vedan-tam et al 2015) and SPICE (Anderson et al2016b) We compute these using their popularopen-source implementation6

BLEURT We use a pre-trained BLEURT model7

as a baseline for our work Unlike N-gram basedapproaches BLEURT uses BERT-based word em-beddings which are robust to variations in surfaceword realizations between the reference and gen-erated captions We do not do any fine-tuning forthis baseline

BERTScore BERTScore8 uses a pre-trainedBERT model to embed the reference and gener-ated captions Text-level similarity scores are then

5httpsgithubcomfacebookresearchvilbert-multi-task

6httpsgithubcomtylincoco-caption7httpsgithubcomgoogle-research

bleurt8httpsgithubcomTiiigerbert_score

computed by matching the tokensrsquo output embed-dings

Please note that for both BERT-based baselinesabove (BLEURT BERTScore) we use the BERT-Large-512 size model

53 COIN-based Evaluation Setup

We use each baseline metric and COSMic to scorethe 8 different image captioning systems describedin Section 4 on the same set of test images withreference captions Note that the range and scaleof each metric is different however they are allmonotonously increasing functions of model qual-ity So in our study we do not analyze the abso-lute score assigned by these metrics but only theirranks We also ask human annotators to rank these8 captioning systems on the same set of test im-ages The ranks assigned by a higher performingmetric will align better with the ranks from humanannotators

Since the captioning systems above are trainedon Conceptual Captions or COCO we use im-agecaption pairs from COIN for an out-of-domainevaluation A subset of 50 random images is usedto rank the captioning systems as described aboveresulting in 400 machine-generated captions totalfor the 8 captioning systems These were thenevaluated by human annotators using the processdescribed in Section 3 The human-scored systemlevel performance for each captioning system onthis test set is reported in Table 1 in ldquoAverage Hu-man Ratingrdquo

We measure the alignment between metric-assigned and human-assigned scores using theKendall (Kendall 1938) correlation coefficient Inorder to calculate the score we first aggregate allthe sample scores and average them Then wecalculate the Kendall tau score using the SciPy171 implementation The score is calculatedbetween two vectors first of which is the aver-age human ratings for 8 models and the secondbeing the investigated metric scores for 8 mod-els in the following order[BaseV isible BaseMetaBaseSubjective BaseStory LiteV isible LiteMetaLiteSubjective LiteStory] Due to the small sam-ple size Kendall correlation is the most suitablecorrelation measure

A key measure of the success of an automaticevaluation metric is whether it makes the same deci-sion about which system is better in a head-to-headevaluation as we would get from a human-subjects

evaluation If each system is evaluated based onits average score then success comes when the av-erage computed metric correlates closely with theaverage human-ranking In particular we measurethe alignment between metric assigned and humanassigned scores using the Kendall score followingthe work of (Sellam et al 2020)

6 Results

Table 1 presents the results of the COIN-basedstudy The last row reports the Kendall correla-tion coefficient between the scores assigned by themetric and humans

All N-gram based metrics such as BLEU andCIDEr fail to adapt to the out-of-domain ground-truth captions from COIN This results in a rela-tively flat distribution of system-level scores con-centrated close to 0 and hence low correlation co-efficients CIDEr has a highly negative Kendallrsquosτ which denotes a strong negative associationwith human judgements This is partly due to low(sim001) and hence noisy CIDEr scores (Figure 4provides example cases that illustrate this argu-ment)

Embedding-based methods BLEURT andBERTScore do not suffer from this limitation re-sulting in more meaningful scoring of systems andhence higher correlation with human scores How-ever by design both these metrics are agnostic tocoherence-labels and the input image COSMicwhich is coherence-aware obtains the highest cor-relation with human scores COSMic ViLBERThas the highest Kendallrsquos correlation among all ofour models COSMic Vanilla performs the sec-ond best among our models and it performs betterthan the rest of the models in terms of Kendallrsquoscorrelation

Data Augmentation The raw RaCCoon trainingdata has a coherence-level bias as demonstrated bythe average COSMic score for each class mdash Visi-ble (0622) Meta (0459) Subjective (0236) andStory (0397) This reflects the human annotatorsrsquobias towards liking Visible captions the most andSubjective captions the least which is expectedHowever training COSMic on this data injects thesame coherence-bias into the model which is un-desirable As presented in Table 1 both flavors ofCOSMic (without the lsquo+rsquo) assign high scores toVisible captioning systems

To mitigate this issue we algorithmically aug-ment the training data to bring the average scoresfor each coherence class to comparable values Weachieve this by pairing images with random cap-tions from the coherence class and assigning thema score of 0 This is a valid training sample becausethe randomly sampled caption does not describe thesaid image and serves as a negative sample Withthese operations the class bias is significantly re-duced mdash Visible (0459) Meta (0439) Subjective(0328) and Story (0425) The COSMic columnsin Table 1 with lsquo+rsquo denote that this data augmen-tation approach improves ranking of captioningsystems leading to better alignment with humanjudgements

Ablation Study Table 2 reports the perfor-mance of COSMic Vanilla without coherence-labels andor the image as model inputs We findthat removal of image features affects COSMicrsquosperformance showing the important contributionof images The performance deteriorates signifi-cantly when the coherence-labels are removed fromthe model (No rc gc column in Table 2) Thisdemonstrates that COSMic successfully integratescoherence-relations in the caption scoring process

Reference two men in scrubs per-forming surgery

mountains in front of aclear blue sky

large brick building next toa green lawn and big trees

a foggy forest

Generated surgeons operating on apatient

mountain range as seenfrom the trail

the front of the house light shining throughthe trees

Figure 4 Illustration of COIN reference captions and corresponding outputs of the Base-Visible model Thoughthe generated captions are correct an n-gram based metric such as CIDEr assigns them a very low score due to thevariations in surface word realizations See Table 1 for average scores over the test set (Photo credits from left toright US Army Africa Gabriel Fr James Bradley Rosmarie Voegtli)

System COSMic

Model CohLabel

Full No I No c No I amp c

Base

Visible 516 447 434 442Meta 505 439 442 453Subj 356 347 438 453Story 505 433 436 445

Lite

Visible 515 444 434 433Meta 511 434 447 464Subj 379 367 440 459Story 499 440 433 442

KendallrsquosCorr (τ ) 667 546 -222 -415

Table 2 Ablation experiment results No I repre-sents COSMic Vanilla without image features Norc gc represents COSMic Vanilla without coherencelabel embeddings finally No I amp No rc gc repre-sents COSMic Vanilla without coherence label embed-dings and without image features

7 Conclusion

Our work is the first step towards designing genera-tion metrics that respect the information goal of thegenerated text We observe that a small set of ex-amples annotated with coherence relations can pro-vide what is needed for learning a discourse-awaregeneration metric Our findings have implicationsfor designing context-aware multimodal metricswith criteria that are closer to human ratings forevaluating machine-generated multimodal content

We have called attention to the challenge oflearning robust generation metrics that can eval-uate the output of the generation models consid-ering the information goals Our findings sug-gest that fine-tuning ViLBERTmdashoriginally trainedwith millions of imagesmdashwith a smaller sample ofcoherence relations and expert-annotated scoringautomated metrics can score generated captionscloser to a human rating The presented datasetprovides the opportunity for future research in thearea of image description generation designingdiscourse-aware metrics and multimodal contentevaluation We hope that coherence-aware text gen-eration metrics could be used for learning bettergeneration models (such as abstractive summariza-tion or story generation) and could be deployeddirectly in machine learning pipelines to help inoptimizing hyper-parameters Ultimately it is in-tended to have a generalizable model that can usea labeling mechanismmdashnot restricted to coherencelabelsmdash to improve applicability of generation met-rics in different tasks

8 Ethics

This paper describes a research prototype We donot work with sensitive or personal data Our pro-tocol was approved by our ethics board Humansubjects participated voluntarily undertook min-imal risk and were compensated fairly for theirtime The dataset we produced is fully anonymizedSubjects consented to the distribution of their dataas part of their participation in the research Tech-nologists should think carefully before deployingour ideas in production Our work depends onpretrained models such as word and image embed-dings These models are known to reproduce andeven magnify societal bias present in training dataMoreover like many ML NLP methods our meth-ods are likely to perform better for content thatis better represented in training leading to furtherbias against marginalized groups We can hope thatgeneral methods to mitigate harms from ML biascan address these issues

A distinctive complication of our work is the factthat many imagendashtext presentations involve writ-ers expressing subjective opinions By its natureour evaluation metric assesses such subjective textsbased on averages and trends across many userswhich may be problematic Although such judg-ments are ultimately matters of personal taste theyare nevertheless often grounds by which hierarchiesof differences are culturally encoded and enforcedThus a deployed subjective-caption generation sys-tem could well be unfair to users especially if thoseusers are not confident in their own taste or criticaltowards the systemrsquos responses Our evaluationmetric is not sensitive to such harms

Acknowledgements

The authors affiliated with Rutgers University werepartly supported by NSF Award CCF-19349243Thanks to Pitt Cyber for supporting this project andthe authors from the University of Pittsburgh Wealso acknowledge the Center for Research Comput-ing at the University of Pittsburgh for providing therequired computational resources for carrying outexperiments at the University of Pittsburgh

ReferencesMartiacuten Abadi Ashish Agarwal Paul Barham Eugene

Brevdo Zhifeng Chen Craig Citro Greg S CorradoAndy Davis Jeffrey Dean Matthieu Devin SanjayGhemawat Ian Goodfellow Andrew Harp Geoffrey

Irving Michael Isard Yangqing Jia Rafal Jozefow-icz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dandelion Maneacute Rajat Monga Sherry MooreDerek Murray Chris Olah Mike Schuster JonathonShlens Benoit Steiner Ilya Sutskever Kunal TalwarPaul Tucker Vincent Vanhoucke Vijay VasudevanFernanda Vieacutegas Oriol Vinyals Pete Warden Mar-tin Wattenberg Martin Wicke Yuan Yu and Xiao-qiang Zheng 2015 TensorFlow Large-scale ma-chine learning on heterogeneous systems Softwareavailable from tensorfloworg

Abien Fred Agarap 2018 Deep learning using recti-fied linear units (relu) CoRR abs180308375

Malihe Alikhani Piyush Sharma Shengjie Li RaduSoricut and Matthew Stone 2020 Cross-modal co-herence modeling for caption generation In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 6525ndash6535 Online Association for Computational Lin-guistics

Peter Anderson Basura Fernando Mark Johnsonand Stephen Gould 2016a SPICE semanticpropositional image caption evaluation CoRRabs160708822

Peter Anderson Basura Fernando Mark Johnson andStephen Gould 2016b Spice Semantic propo-sitional image caption evaluation In EuropeanConference on Computer Vision pages 382ndash398Springer

Peter Anderson Xiaodong He Chris Buehler DamienTeney Mark Johnson Stephen Gould and LeiZhang 2018 Bottom-up and top-down attention forimage captioning and visual question answering InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR)

Xinlei Chen Hao Fang Tsung-Yi Lin Ramakr-ishna Vedantam Saurabh Gupta Piotr Dollar andC Lawrence Zitnick 2015 Microsoft coco cap-tions Data collection and evaluation server

Yin Cui Guandao Yang Andreas Veit Xun Huangand Serge Belongie 2018 Learning to evaluate im-age captioning In Proceedings of the IEEE con-ference on computer vision and pattern recognitionpages 5804ndash5812

Samuel Cumming Gabriel Greenberg and Rory Kelly2017 Conventions of viewpoint coherence in filmPhilosophersrsquo Imprint 17(1)1ndash29

J Deng W Dong R Socher L-J Li K Li and L Fei-Fei 2009 ImageNet A Large-Scale HierarchicalImage Database In CVPR09

Michael Denkowski and Alon Lavie 2014 Meteor uni-versal Language specific translation evaluation forany target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Haoyuan Gao Junhua Mao Jie Zhou Zhiheng HuangLei Wang and Wei Xu 2015 Are you talking to amachine dataset and methods for multilingual im-age question In Advances in Neural InformationProcessing Systems pages 2296ndash2304

Yifan Gao Chien-Sheng Wu Jingjing Li Shafiq JotySteven CH Hoi Caiming Xiong Irwin King andMichael Lyu 2020 Discern Discourse-aware en-tailment reasoning network for conversational ma-chine reading In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) pages 2439ndash2449 Online As-sociation for Computational Linguistics

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Kaiming He Xiangyu Zhang Shaoqing Ren and JianSun 2015 Deep residual learning for image recog-nition CoRR abs151203385

Tuomo Hiippala Malihe Alikhani Jonas HaverinenTimo Kalliokoski Evanfiya Logacheva SerafinaOrekhova Aino Tuomainen Matthew Stone andJohn A Bateman 2021 AI2D-RST a multimodalcorpus of 1000 primary school science diagramsLang Resour Evaluation 55(3)661ndash688

Jerry R Hobbs 1985 On the coherence and structureof discourse

Xinyue Huang and Adriana Kovashka 2016 Inferringvisual persuasion via body language setting anddeep features In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern RecognitionWorkshops pages 73ndash79

M G Kendall 1938 A new measure of rank correla-tion Biometrika 30(12)81ndash93

Diederik P Kingma and Jimmy Ba 2017 Adam Amethod for stochastic optimization

Ranjay Krishna Yuke Zhu Oliver Groth Justin John-son Kenji Hata Joshua Kravitz Stephanie ChenYannis Kalantidis Li-Jia Li David A ShammaMichael Bernstein and Li Fei-Fei 2016 Visualgenome Connecting language and vision usingcrowdsourced dense image annotations

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019a Integratingtext and image Determining multimodal documentintent in Instagram posts In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 4622ndash4632 Hong KongChina Association for Computational Linguistics

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019b Integrat-ing text and image Determining multimodal doc-ument intent in instagram posts arXiv preprintarXiv190409073

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov and et al 2020a The open imagesdataset v4 International Journal of Computer Vi-sion 128(7)1956ndash1981

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov et al 2020b The open images datasetv4 International Journal of Computer Vision pages1ndash26

Alex Lascarides and Matthew Stone 2009 A formalsemantic analysis of gesture Journal of Semantics26(4)393ndash449

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81 Barcelona SpainAssociation for Computational Linguistics

Jiasen Lu Dhruv Batra Devi Parikh and StefanLee 2019 Vilbert Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks In Advances in Neural Information Process-ing Systems volume 32 Curran Associates Inc

Pranava Madhyastha Josiah Wang and Lucia Specia2019 VIFIDEL Evaluating the visual fidelity ofimage descriptions In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics pages 6539ndash6550 Florence Italy Asso-ciation for Computational Linguistics

William C Mann and Sandra A Thompson 1987Rhetorical structure theory A theory of text orga-nization University of Southern California Infor-mation Sciences Institute Los Angeles

Scott McCloud 1993 Understanding comics The in-visible art William Morrow

Edwin G Ng Bo Pang Piyush Sharma and RaduSoricut 2020 Understanding guided image cap-tioning performance across domains arXiv preprintarXiv201202339

Christian Otto Matthias Springstein Avishek Anandand Ralph Ewerth 2019 Understanding catego-rizing and predicting semantic image-text relationsIn Proceedings of the 2019 on International Con-ference on Multimedia Retrieval pages 168ndash176ACM

Kishore Papineni Salim Roukos Todd Ward and Weijing Zhu 2002 Bleu a method for automatic evalu-ation of machine translation pages 311ndash318

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Advances in Neural Information Pro-cessing Systems 32 pages 8024ndash8035 Curran Asso-ciates Inc

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind Joshi and Bon-nie Webber 2008a The Penn Discourse Tree-Bank 20 In Proceedings of the Sixth Interna-tional Conference on Language Resources and Eval-uation (LRECrsquo08) Marrakech Morocco EuropeanLanguage Resources Association (ELRA)

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind K Joshi and Bon-nie L Webber 2008b The Penn discourse treebank20 In LREC Citeseer

J Pustejovsky and N Krishnaswamy 2020 Situatedmeaning in multimodal dialogue human-robot andhuman-computer interactions

Shaoqing Ren Kaiming He Ross Girshick and JianSun 2016 Faster r-cnn Towards real-time objectdetection with region proposal networks

Deborah Schiffrin 1980 Meta-talk Organizationaland evaluative brackets in discourse SociologicalInquiry 50(3-4)199ndash236

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Piyush Sharma Nan Ding Sebastian Goodman andRadu Soricut 2018 Conceptual captions Acleaned hypernymed image alt-text dataset for au-tomatic image captioning In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages2556ndash2565 Melbourne Australia Association forComputational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez Ł ukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008 Cur-ran Associates Inc

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 CIDEr Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Ramakrishna Vedantam C Lawrence Zitnick andDevi Parikh 2014 Cider Consensus-based imagedescription evaluation CoRR abs14115726

Anthony Viera and Joanne Garrett 2005 Understand-ing interobserver agreement The kappa statisticFamily medicine 37360ndash3

Jiacheng Xu Zhe Gan Yu Cheng and Jingjing Liu2019 Discourse-aware neural extractive text sum-marization arXiv preprint arXiv191014142

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Page 7: arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

lic ViLBERT5 implementation We use a batchsize of 4 and a learning rate of 2times 10minus6 for fine-tuning ViLBERT and use RAdam optimizer andstop the training when the validation score doesnot change for 3 epochs For COSMic Vanillawe train with a batch-size of 10 Adam optimizer(Kingma and Ba 2017) with a base learning rateof 10minus3 that decays by a factor of 10minus2 every 10epochs We observe that the Vanilla convergesin approximately 100 epochs and ViLBERT con-verges in 9 epochs ViLBERT has 250 millionparameters COSMic Vanilla includes 3062913trainable parameters Pre-trained BERT-Large andResNet50V2 have an additional 350 million param-eters The setup for coherence-aware captioningmodels to obtain machine-generated captions forour study is the same as (Alikhani et al 2020)

52 Baseline Captioning Metrics

To benchmark COSMic we compare it with otherlearned metrics In this section we describe thesevarious metrics traditionally used for measuringimage captioning systems None of these metricswere designed to support the coherence relationsof the reference or generated captions These serveas baselines for COSMic

N-gram based The most popular image caption-ing metrics are based on precision and recall of n-grams from generated and reference captions Wecompare with Bleu1 Bleu2 Bleu3 Bleu4 (Guo andHu 2019) ROUGEL (Lin 2004) CIDEr (Vedan-tam et al 2015) and SPICE (Anderson et al2016b) We compute these using their popularopen-source implementation6

BLEURT We use a pre-trained BLEURT model7

as a baseline for our work Unlike N-gram basedapproaches BLEURT uses BERT-based word em-beddings which are robust to variations in surfaceword realizations between the reference and gen-erated captions We do not do any fine-tuning forthis baseline

BERTScore BERTScore8 uses a pre-trainedBERT model to embed the reference and gener-ated captions Text-level similarity scores are then

5httpsgithubcomfacebookresearchvilbert-multi-task

6httpsgithubcomtylincoco-caption7httpsgithubcomgoogle-research

bleurt8httpsgithubcomTiiigerbert_score

computed by matching the tokensrsquo output embed-dings

Please note that for both BERT-based baselinesabove (BLEURT BERTScore) we use the BERT-Large-512 size model

53 COIN-based Evaluation Setup

We use each baseline metric and COSMic to scorethe 8 different image captioning systems describedin Section 4 on the same set of test images withreference captions Note that the range and scaleof each metric is different however they are allmonotonously increasing functions of model qual-ity So in our study we do not analyze the abso-lute score assigned by these metrics but only theirranks We also ask human annotators to rank these8 captioning systems on the same set of test im-ages The ranks assigned by a higher performingmetric will align better with the ranks from humanannotators

Since the captioning systems above are trainedon Conceptual Captions or COCO we use im-agecaption pairs from COIN for an out-of-domainevaluation A subset of 50 random images is usedto rank the captioning systems as described aboveresulting in 400 machine-generated captions totalfor the 8 captioning systems These were thenevaluated by human annotators using the processdescribed in Section 3 The human-scored systemlevel performance for each captioning system onthis test set is reported in Table 1 in ldquoAverage Hu-man Ratingrdquo

We measure the alignment between metric-assigned and human-assigned scores using theKendall (Kendall 1938) correlation coefficient Inorder to calculate the score we first aggregate allthe sample scores and average them Then wecalculate the Kendall tau score using the SciPy171 implementation The score is calculatedbetween two vectors first of which is the aver-age human ratings for 8 models and the secondbeing the investigated metric scores for 8 mod-els in the following order[BaseV isible BaseMetaBaseSubjective BaseStory LiteV isible LiteMetaLiteSubjective LiteStory] Due to the small sam-ple size Kendall correlation is the most suitablecorrelation measure

A key measure of the success of an automaticevaluation metric is whether it makes the same deci-sion about which system is better in a head-to-headevaluation as we would get from a human-subjects

evaluation If each system is evaluated based onits average score then success comes when the av-erage computed metric correlates closely with theaverage human-ranking In particular we measurethe alignment between metric assigned and humanassigned scores using the Kendall score followingthe work of (Sellam et al 2020)

6 Results

Table 1 presents the results of the COIN-basedstudy The last row reports the Kendall correla-tion coefficient between the scores assigned by themetric and humans

All N-gram based metrics such as BLEU andCIDEr fail to adapt to the out-of-domain ground-truth captions from COIN This results in a rela-tively flat distribution of system-level scores con-centrated close to 0 and hence low correlation co-efficients CIDEr has a highly negative Kendallrsquosτ which denotes a strong negative associationwith human judgements This is partly due to low(sim001) and hence noisy CIDEr scores (Figure 4provides example cases that illustrate this argu-ment)

Embedding-based methods BLEURT andBERTScore do not suffer from this limitation re-sulting in more meaningful scoring of systems andhence higher correlation with human scores How-ever by design both these metrics are agnostic tocoherence-labels and the input image COSMicwhich is coherence-aware obtains the highest cor-relation with human scores COSMic ViLBERThas the highest Kendallrsquos correlation among all ofour models COSMic Vanilla performs the sec-ond best among our models and it performs betterthan the rest of the models in terms of Kendallrsquoscorrelation

Data Augmentation The raw RaCCoon trainingdata has a coherence-level bias as demonstrated bythe average COSMic score for each class mdash Visi-ble (0622) Meta (0459) Subjective (0236) andStory (0397) This reflects the human annotatorsrsquobias towards liking Visible captions the most andSubjective captions the least which is expectedHowever training COSMic on this data injects thesame coherence-bias into the model which is un-desirable As presented in Table 1 both flavors ofCOSMic (without the lsquo+rsquo) assign high scores toVisible captioning systems

To mitigate this issue we algorithmically aug-ment the training data to bring the average scoresfor each coherence class to comparable values Weachieve this by pairing images with random cap-tions from the coherence class and assigning thema score of 0 This is a valid training sample becausethe randomly sampled caption does not describe thesaid image and serves as a negative sample Withthese operations the class bias is significantly re-duced mdash Visible (0459) Meta (0439) Subjective(0328) and Story (0425) The COSMic columnsin Table 1 with lsquo+rsquo denote that this data augmen-tation approach improves ranking of captioningsystems leading to better alignment with humanjudgements

Ablation Study Table 2 reports the perfor-mance of COSMic Vanilla without coherence-labels andor the image as model inputs We findthat removal of image features affects COSMicrsquosperformance showing the important contributionof images The performance deteriorates signifi-cantly when the coherence-labels are removed fromthe model (No rc gc column in Table 2) Thisdemonstrates that COSMic successfully integratescoherence-relations in the caption scoring process

Reference two men in scrubs per-forming surgery

mountains in front of aclear blue sky

large brick building next toa green lawn and big trees

a foggy forest

Generated surgeons operating on apatient

mountain range as seenfrom the trail

the front of the house light shining throughthe trees

Figure 4 Illustration of COIN reference captions and corresponding outputs of the Base-Visible model Thoughthe generated captions are correct an n-gram based metric such as CIDEr assigns them a very low score due to thevariations in surface word realizations See Table 1 for average scores over the test set (Photo credits from left toright US Army Africa Gabriel Fr James Bradley Rosmarie Voegtli)

System COSMic

Model CohLabel

Full No I No c No I amp c

Base

Visible 516 447 434 442Meta 505 439 442 453Subj 356 347 438 453Story 505 433 436 445

Lite

Visible 515 444 434 433Meta 511 434 447 464Subj 379 367 440 459Story 499 440 433 442

KendallrsquosCorr (τ ) 667 546 -222 -415

Table 2 Ablation experiment results No I repre-sents COSMic Vanilla without image features Norc gc represents COSMic Vanilla without coherencelabel embeddings finally No I amp No rc gc repre-sents COSMic Vanilla without coherence label embed-dings and without image features

7 Conclusion

Our work is the first step towards designing genera-tion metrics that respect the information goal of thegenerated text We observe that a small set of ex-amples annotated with coherence relations can pro-vide what is needed for learning a discourse-awaregeneration metric Our findings have implicationsfor designing context-aware multimodal metricswith criteria that are closer to human ratings forevaluating machine-generated multimodal content

We have called attention to the challenge oflearning robust generation metrics that can eval-uate the output of the generation models consid-ering the information goals Our findings sug-gest that fine-tuning ViLBERTmdashoriginally trainedwith millions of imagesmdashwith a smaller sample ofcoherence relations and expert-annotated scoringautomated metrics can score generated captionscloser to a human rating The presented datasetprovides the opportunity for future research in thearea of image description generation designingdiscourse-aware metrics and multimodal contentevaluation We hope that coherence-aware text gen-eration metrics could be used for learning bettergeneration models (such as abstractive summariza-tion or story generation) and could be deployeddirectly in machine learning pipelines to help inoptimizing hyper-parameters Ultimately it is in-tended to have a generalizable model that can usea labeling mechanismmdashnot restricted to coherencelabelsmdash to improve applicability of generation met-rics in different tasks

8 Ethics

This paper describes a research prototype We donot work with sensitive or personal data Our pro-tocol was approved by our ethics board Humansubjects participated voluntarily undertook min-imal risk and were compensated fairly for theirtime The dataset we produced is fully anonymizedSubjects consented to the distribution of their dataas part of their participation in the research Tech-nologists should think carefully before deployingour ideas in production Our work depends onpretrained models such as word and image embed-dings These models are known to reproduce andeven magnify societal bias present in training dataMoreover like many ML NLP methods our meth-ods are likely to perform better for content thatis better represented in training leading to furtherbias against marginalized groups We can hope thatgeneral methods to mitigate harms from ML biascan address these issues

A distinctive complication of our work is the factthat many imagendashtext presentations involve writ-ers expressing subjective opinions By its natureour evaluation metric assesses such subjective textsbased on averages and trends across many userswhich may be problematic Although such judg-ments are ultimately matters of personal taste theyare nevertheless often grounds by which hierarchiesof differences are culturally encoded and enforcedThus a deployed subjective-caption generation sys-tem could well be unfair to users especially if thoseusers are not confident in their own taste or criticaltowards the systemrsquos responses Our evaluationmetric is not sensitive to such harms

Acknowledgements

The authors affiliated with Rutgers University werepartly supported by NSF Award CCF-19349243Thanks to Pitt Cyber for supporting this project andthe authors from the University of Pittsburgh Wealso acknowledge the Center for Research Comput-ing at the University of Pittsburgh for providing therequired computational resources for carrying outexperiments at the University of Pittsburgh

ReferencesMartiacuten Abadi Ashish Agarwal Paul Barham Eugene

Brevdo Zhifeng Chen Craig Citro Greg S CorradoAndy Davis Jeffrey Dean Matthieu Devin SanjayGhemawat Ian Goodfellow Andrew Harp Geoffrey

Irving Michael Isard Yangqing Jia Rafal Jozefow-icz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dandelion Maneacute Rajat Monga Sherry MooreDerek Murray Chris Olah Mike Schuster JonathonShlens Benoit Steiner Ilya Sutskever Kunal TalwarPaul Tucker Vincent Vanhoucke Vijay VasudevanFernanda Vieacutegas Oriol Vinyals Pete Warden Mar-tin Wattenberg Martin Wicke Yuan Yu and Xiao-qiang Zheng 2015 TensorFlow Large-scale ma-chine learning on heterogeneous systems Softwareavailable from tensorfloworg

Abien Fred Agarap 2018 Deep learning using recti-fied linear units (relu) CoRR abs180308375

Malihe Alikhani Piyush Sharma Shengjie Li RaduSoricut and Matthew Stone 2020 Cross-modal co-herence modeling for caption generation In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 6525ndash6535 Online Association for Computational Lin-guistics

Peter Anderson Basura Fernando Mark Johnsonand Stephen Gould 2016a SPICE semanticpropositional image caption evaluation CoRRabs160708822

Peter Anderson Basura Fernando Mark Johnson andStephen Gould 2016b Spice Semantic propo-sitional image caption evaluation In EuropeanConference on Computer Vision pages 382ndash398Springer

Peter Anderson Xiaodong He Chris Buehler DamienTeney Mark Johnson Stephen Gould and LeiZhang 2018 Bottom-up and top-down attention forimage captioning and visual question answering InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR)

Xinlei Chen Hao Fang Tsung-Yi Lin Ramakr-ishna Vedantam Saurabh Gupta Piotr Dollar andC Lawrence Zitnick 2015 Microsoft coco cap-tions Data collection and evaluation server

Yin Cui Guandao Yang Andreas Veit Xun Huangand Serge Belongie 2018 Learning to evaluate im-age captioning In Proceedings of the IEEE con-ference on computer vision and pattern recognitionpages 5804ndash5812

Samuel Cumming Gabriel Greenberg and Rory Kelly2017 Conventions of viewpoint coherence in filmPhilosophersrsquo Imprint 17(1)1ndash29

J Deng W Dong R Socher L-J Li K Li and L Fei-Fei 2009 ImageNet A Large-Scale HierarchicalImage Database In CVPR09

Michael Denkowski and Alon Lavie 2014 Meteor uni-versal Language specific translation evaluation forany target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Haoyuan Gao Junhua Mao Jie Zhou Zhiheng HuangLei Wang and Wei Xu 2015 Are you talking to amachine dataset and methods for multilingual im-age question In Advances in Neural InformationProcessing Systems pages 2296ndash2304

Yifan Gao Chien-Sheng Wu Jingjing Li Shafiq JotySteven CH Hoi Caiming Xiong Irwin King andMichael Lyu 2020 Discern Discourse-aware en-tailment reasoning network for conversational ma-chine reading In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) pages 2439ndash2449 Online As-sociation for Computational Linguistics

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Kaiming He Xiangyu Zhang Shaoqing Ren and JianSun 2015 Deep residual learning for image recog-nition CoRR abs151203385

Tuomo Hiippala Malihe Alikhani Jonas HaverinenTimo Kalliokoski Evanfiya Logacheva SerafinaOrekhova Aino Tuomainen Matthew Stone andJohn A Bateman 2021 AI2D-RST a multimodalcorpus of 1000 primary school science diagramsLang Resour Evaluation 55(3)661ndash688

Jerry R Hobbs 1985 On the coherence and structureof discourse

Xinyue Huang and Adriana Kovashka 2016 Inferringvisual persuasion via body language setting anddeep features In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern RecognitionWorkshops pages 73ndash79

M G Kendall 1938 A new measure of rank correla-tion Biometrika 30(12)81ndash93

Diederik P Kingma and Jimmy Ba 2017 Adam Amethod for stochastic optimization

Ranjay Krishna Yuke Zhu Oliver Groth Justin John-son Kenji Hata Joshua Kravitz Stephanie ChenYannis Kalantidis Li-Jia Li David A ShammaMichael Bernstein and Li Fei-Fei 2016 Visualgenome Connecting language and vision usingcrowdsourced dense image annotations

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019a Integratingtext and image Determining multimodal documentintent in Instagram posts In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 4622ndash4632 Hong KongChina Association for Computational Linguistics

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019b Integrat-ing text and image Determining multimodal doc-ument intent in instagram posts arXiv preprintarXiv190409073

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov and et al 2020a The open imagesdataset v4 International Journal of Computer Vi-sion 128(7)1956ndash1981

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov et al 2020b The open images datasetv4 International Journal of Computer Vision pages1ndash26

Alex Lascarides and Matthew Stone 2009 A formalsemantic analysis of gesture Journal of Semantics26(4)393ndash449

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81 Barcelona SpainAssociation for Computational Linguistics

Jiasen Lu Dhruv Batra Devi Parikh and StefanLee 2019 Vilbert Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks In Advances in Neural Information Process-ing Systems volume 32 Curran Associates Inc

Pranava Madhyastha Josiah Wang and Lucia Specia2019 VIFIDEL Evaluating the visual fidelity ofimage descriptions In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics pages 6539ndash6550 Florence Italy Asso-ciation for Computational Linguistics

William C Mann and Sandra A Thompson 1987Rhetorical structure theory A theory of text orga-nization University of Southern California Infor-mation Sciences Institute Los Angeles

Scott McCloud 1993 Understanding comics The in-visible art William Morrow

Edwin G Ng Bo Pang Piyush Sharma and RaduSoricut 2020 Understanding guided image cap-tioning performance across domains arXiv preprintarXiv201202339

Christian Otto Matthias Springstein Avishek Anandand Ralph Ewerth 2019 Understanding catego-rizing and predicting semantic image-text relationsIn Proceedings of the 2019 on International Con-ference on Multimedia Retrieval pages 168ndash176ACM

Kishore Papineni Salim Roukos Todd Ward and Weijing Zhu 2002 Bleu a method for automatic evalu-ation of machine translation pages 311ndash318

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Advances in Neural Information Pro-cessing Systems 32 pages 8024ndash8035 Curran Asso-ciates Inc

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind Joshi and Bon-nie Webber 2008a The Penn Discourse Tree-Bank 20 In Proceedings of the Sixth Interna-tional Conference on Language Resources and Eval-uation (LRECrsquo08) Marrakech Morocco EuropeanLanguage Resources Association (ELRA)

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind K Joshi and Bon-nie L Webber 2008b The Penn discourse treebank20 In LREC Citeseer

J Pustejovsky and N Krishnaswamy 2020 Situatedmeaning in multimodal dialogue human-robot andhuman-computer interactions

Shaoqing Ren Kaiming He Ross Girshick and JianSun 2016 Faster r-cnn Towards real-time objectdetection with region proposal networks

Deborah Schiffrin 1980 Meta-talk Organizationaland evaluative brackets in discourse SociologicalInquiry 50(3-4)199ndash236

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Piyush Sharma Nan Ding Sebastian Goodman andRadu Soricut 2018 Conceptual captions Acleaned hypernymed image alt-text dataset for au-tomatic image captioning In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages2556ndash2565 Melbourne Australia Association forComputational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez Ł ukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008 Cur-ran Associates Inc

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 CIDEr Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Ramakrishna Vedantam C Lawrence Zitnick andDevi Parikh 2014 Cider Consensus-based imagedescription evaluation CoRR abs14115726

Anthony Viera and Joanne Garrett 2005 Understand-ing interobserver agreement The kappa statisticFamily medicine 37360ndash3

Jiacheng Xu Zhe Gan Yu Cheng and Jingjing Liu2019 Discourse-aware neural extractive text sum-marization arXiv preprint arXiv191014142

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Page 8: arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

evaluation If each system is evaluated based onits average score then success comes when the av-erage computed metric correlates closely with theaverage human-ranking In particular we measurethe alignment between metric assigned and humanassigned scores using the Kendall score followingthe work of (Sellam et al 2020)

6 Results

Table 1 presents the results of the COIN-basedstudy The last row reports the Kendall correla-tion coefficient between the scores assigned by themetric and humans

All N-gram based metrics such as BLEU andCIDEr fail to adapt to the out-of-domain ground-truth captions from COIN This results in a rela-tively flat distribution of system-level scores con-centrated close to 0 and hence low correlation co-efficients CIDEr has a highly negative Kendallrsquosτ which denotes a strong negative associationwith human judgements This is partly due to low(sim001) and hence noisy CIDEr scores (Figure 4provides example cases that illustrate this argu-ment)

Embedding-based methods BLEURT andBERTScore do not suffer from this limitation re-sulting in more meaningful scoring of systems andhence higher correlation with human scores How-ever by design both these metrics are agnostic tocoherence-labels and the input image COSMicwhich is coherence-aware obtains the highest cor-relation with human scores COSMic ViLBERThas the highest Kendallrsquos correlation among all ofour models COSMic Vanilla performs the sec-ond best among our models and it performs betterthan the rest of the models in terms of Kendallrsquoscorrelation

Data Augmentation The raw RaCCoon trainingdata has a coherence-level bias as demonstrated bythe average COSMic score for each class mdash Visi-ble (0622) Meta (0459) Subjective (0236) andStory (0397) This reflects the human annotatorsrsquobias towards liking Visible captions the most andSubjective captions the least which is expectedHowever training COSMic on this data injects thesame coherence-bias into the model which is un-desirable As presented in Table 1 both flavors ofCOSMic (without the lsquo+rsquo) assign high scores toVisible captioning systems

To mitigate this issue we algorithmically aug-ment the training data to bring the average scoresfor each coherence class to comparable values Weachieve this by pairing images with random cap-tions from the coherence class and assigning thema score of 0 This is a valid training sample becausethe randomly sampled caption does not describe thesaid image and serves as a negative sample Withthese operations the class bias is significantly re-duced mdash Visible (0459) Meta (0439) Subjective(0328) and Story (0425) The COSMic columnsin Table 1 with lsquo+rsquo denote that this data augmen-tation approach improves ranking of captioningsystems leading to better alignment with humanjudgements

Ablation Study Table 2 reports the perfor-mance of COSMic Vanilla without coherence-labels andor the image as model inputs We findthat removal of image features affects COSMicrsquosperformance showing the important contributionof images The performance deteriorates signifi-cantly when the coherence-labels are removed fromthe model (No rc gc column in Table 2) Thisdemonstrates that COSMic successfully integratescoherence-relations in the caption scoring process

Reference two men in scrubs per-forming surgery

mountains in front of aclear blue sky

large brick building next toa green lawn and big trees

a foggy forest

Generated surgeons operating on apatient

mountain range as seenfrom the trail

the front of the house light shining throughthe trees

Figure 4 Illustration of COIN reference captions and corresponding outputs of the Base-Visible model Thoughthe generated captions are correct an n-gram based metric such as CIDEr assigns them a very low score due to thevariations in surface word realizations See Table 1 for average scores over the test set (Photo credits from left toright US Army Africa Gabriel Fr James Bradley Rosmarie Voegtli)

System COSMic

Model CohLabel

Full No I No c No I amp c

Base

Visible 516 447 434 442Meta 505 439 442 453Subj 356 347 438 453Story 505 433 436 445

Lite

Visible 515 444 434 433Meta 511 434 447 464Subj 379 367 440 459Story 499 440 433 442

KendallrsquosCorr (τ ) 667 546 -222 -415

Table 2 Ablation experiment results No I repre-sents COSMic Vanilla without image features Norc gc represents COSMic Vanilla without coherencelabel embeddings finally No I amp No rc gc repre-sents COSMic Vanilla without coherence label embed-dings and without image features

7 Conclusion

Our work is the first step towards designing genera-tion metrics that respect the information goal of thegenerated text We observe that a small set of ex-amples annotated with coherence relations can pro-vide what is needed for learning a discourse-awaregeneration metric Our findings have implicationsfor designing context-aware multimodal metricswith criteria that are closer to human ratings forevaluating machine-generated multimodal content

We have called attention to the challenge oflearning robust generation metrics that can eval-uate the output of the generation models consid-ering the information goals Our findings sug-gest that fine-tuning ViLBERTmdashoriginally trainedwith millions of imagesmdashwith a smaller sample ofcoherence relations and expert-annotated scoringautomated metrics can score generated captionscloser to a human rating The presented datasetprovides the opportunity for future research in thearea of image description generation designingdiscourse-aware metrics and multimodal contentevaluation We hope that coherence-aware text gen-eration metrics could be used for learning bettergeneration models (such as abstractive summariza-tion or story generation) and could be deployeddirectly in machine learning pipelines to help inoptimizing hyper-parameters Ultimately it is in-tended to have a generalizable model that can usea labeling mechanismmdashnot restricted to coherencelabelsmdash to improve applicability of generation met-rics in different tasks

8 Ethics

This paper describes a research prototype We donot work with sensitive or personal data Our pro-tocol was approved by our ethics board Humansubjects participated voluntarily undertook min-imal risk and were compensated fairly for theirtime The dataset we produced is fully anonymizedSubjects consented to the distribution of their dataas part of their participation in the research Tech-nologists should think carefully before deployingour ideas in production Our work depends onpretrained models such as word and image embed-dings These models are known to reproduce andeven magnify societal bias present in training dataMoreover like many ML NLP methods our meth-ods are likely to perform better for content thatis better represented in training leading to furtherbias against marginalized groups We can hope thatgeneral methods to mitigate harms from ML biascan address these issues

A distinctive complication of our work is the factthat many imagendashtext presentations involve writ-ers expressing subjective opinions By its natureour evaluation metric assesses such subjective textsbased on averages and trends across many userswhich may be problematic Although such judg-ments are ultimately matters of personal taste theyare nevertheless often grounds by which hierarchiesof differences are culturally encoded and enforcedThus a deployed subjective-caption generation sys-tem could well be unfair to users especially if thoseusers are not confident in their own taste or criticaltowards the systemrsquos responses Our evaluationmetric is not sensitive to such harms

Acknowledgements

The authors affiliated with Rutgers University werepartly supported by NSF Award CCF-19349243Thanks to Pitt Cyber for supporting this project andthe authors from the University of Pittsburgh Wealso acknowledge the Center for Research Comput-ing at the University of Pittsburgh for providing therequired computational resources for carrying outexperiments at the University of Pittsburgh

ReferencesMartiacuten Abadi Ashish Agarwal Paul Barham Eugene

Brevdo Zhifeng Chen Craig Citro Greg S CorradoAndy Davis Jeffrey Dean Matthieu Devin SanjayGhemawat Ian Goodfellow Andrew Harp Geoffrey

Irving Michael Isard Yangqing Jia Rafal Jozefow-icz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dandelion Maneacute Rajat Monga Sherry MooreDerek Murray Chris Olah Mike Schuster JonathonShlens Benoit Steiner Ilya Sutskever Kunal TalwarPaul Tucker Vincent Vanhoucke Vijay VasudevanFernanda Vieacutegas Oriol Vinyals Pete Warden Mar-tin Wattenberg Martin Wicke Yuan Yu and Xiao-qiang Zheng 2015 TensorFlow Large-scale ma-chine learning on heterogeneous systems Softwareavailable from tensorfloworg

Abien Fred Agarap 2018 Deep learning using recti-fied linear units (relu) CoRR abs180308375

Malihe Alikhani Piyush Sharma Shengjie Li RaduSoricut and Matthew Stone 2020 Cross-modal co-herence modeling for caption generation In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 6525ndash6535 Online Association for Computational Lin-guistics

Peter Anderson Basura Fernando Mark Johnsonand Stephen Gould 2016a SPICE semanticpropositional image caption evaluation CoRRabs160708822

Peter Anderson Basura Fernando Mark Johnson andStephen Gould 2016b Spice Semantic propo-sitional image caption evaluation In EuropeanConference on Computer Vision pages 382ndash398Springer

Peter Anderson Xiaodong He Chris Buehler DamienTeney Mark Johnson Stephen Gould and LeiZhang 2018 Bottom-up and top-down attention forimage captioning and visual question answering InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR)

Xinlei Chen Hao Fang Tsung-Yi Lin Ramakr-ishna Vedantam Saurabh Gupta Piotr Dollar andC Lawrence Zitnick 2015 Microsoft coco cap-tions Data collection and evaluation server

Yin Cui Guandao Yang Andreas Veit Xun Huangand Serge Belongie 2018 Learning to evaluate im-age captioning In Proceedings of the IEEE con-ference on computer vision and pattern recognitionpages 5804ndash5812

Samuel Cumming Gabriel Greenberg and Rory Kelly2017 Conventions of viewpoint coherence in filmPhilosophersrsquo Imprint 17(1)1ndash29

J Deng W Dong R Socher L-J Li K Li and L Fei-Fei 2009 ImageNet A Large-Scale HierarchicalImage Database In CVPR09

Michael Denkowski and Alon Lavie 2014 Meteor uni-versal Language specific translation evaluation forany target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Haoyuan Gao Junhua Mao Jie Zhou Zhiheng HuangLei Wang and Wei Xu 2015 Are you talking to amachine dataset and methods for multilingual im-age question In Advances in Neural InformationProcessing Systems pages 2296ndash2304

Yifan Gao Chien-Sheng Wu Jingjing Li Shafiq JotySteven CH Hoi Caiming Xiong Irwin King andMichael Lyu 2020 Discern Discourse-aware en-tailment reasoning network for conversational ma-chine reading In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) pages 2439ndash2449 Online As-sociation for Computational Linguistics

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Kaiming He Xiangyu Zhang Shaoqing Ren and JianSun 2015 Deep residual learning for image recog-nition CoRR abs151203385

Tuomo Hiippala Malihe Alikhani Jonas HaverinenTimo Kalliokoski Evanfiya Logacheva SerafinaOrekhova Aino Tuomainen Matthew Stone andJohn A Bateman 2021 AI2D-RST a multimodalcorpus of 1000 primary school science diagramsLang Resour Evaluation 55(3)661ndash688

Jerry R Hobbs 1985 On the coherence and structureof discourse

Xinyue Huang and Adriana Kovashka 2016 Inferringvisual persuasion via body language setting anddeep features In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern RecognitionWorkshops pages 73ndash79

M G Kendall 1938 A new measure of rank correla-tion Biometrika 30(12)81ndash93

Diederik P Kingma and Jimmy Ba 2017 Adam Amethod for stochastic optimization

Ranjay Krishna Yuke Zhu Oliver Groth Justin John-son Kenji Hata Joshua Kravitz Stephanie ChenYannis Kalantidis Li-Jia Li David A ShammaMichael Bernstein and Li Fei-Fei 2016 Visualgenome Connecting language and vision usingcrowdsourced dense image annotations

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019a Integratingtext and image Determining multimodal documentintent in Instagram posts In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 4622ndash4632 Hong KongChina Association for Computational Linguistics

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019b Integrat-ing text and image Determining multimodal doc-ument intent in instagram posts arXiv preprintarXiv190409073

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov and et al 2020a The open imagesdataset v4 International Journal of Computer Vi-sion 128(7)1956ndash1981

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov et al 2020b The open images datasetv4 International Journal of Computer Vision pages1ndash26

Alex Lascarides and Matthew Stone 2009 A formalsemantic analysis of gesture Journal of Semantics26(4)393ndash449

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81 Barcelona SpainAssociation for Computational Linguistics

Jiasen Lu Dhruv Batra Devi Parikh and StefanLee 2019 Vilbert Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks In Advances in Neural Information Process-ing Systems volume 32 Curran Associates Inc

Pranava Madhyastha Josiah Wang and Lucia Specia2019 VIFIDEL Evaluating the visual fidelity ofimage descriptions In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics pages 6539ndash6550 Florence Italy Asso-ciation for Computational Linguistics

William C Mann and Sandra A Thompson 1987Rhetorical structure theory A theory of text orga-nization University of Southern California Infor-mation Sciences Institute Los Angeles

Scott McCloud 1993 Understanding comics The in-visible art William Morrow

Edwin G Ng Bo Pang Piyush Sharma and RaduSoricut 2020 Understanding guided image cap-tioning performance across domains arXiv preprintarXiv201202339

Christian Otto Matthias Springstein Avishek Anandand Ralph Ewerth 2019 Understanding catego-rizing and predicting semantic image-text relationsIn Proceedings of the 2019 on International Con-ference on Multimedia Retrieval pages 168ndash176ACM

Kishore Papineni Salim Roukos Todd Ward and Weijing Zhu 2002 Bleu a method for automatic evalu-ation of machine translation pages 311ndash318

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Advances in Neural Information Pro-cessing Systems 32 pages 8024ndash8035 Curran Asso-ciates Inc

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind Joshi and Bon-nie Webber 2008a The Penn Discourse Tree-Bank 20 In Proceedings of the Sixth Interna-tional Conference on Language Resources and Eval-uation (LRECrsquo08) Marrakech Morocco EuropeanLanguage Resources Association (ELRA)

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind K Joshi and Bon-nie L Webber 2008b The Penn discourse treebank20 In LREC Citeseer

J Pustejovsky and N Krishnaswamy 2020 Situatedmeaning in multimodal dialogue human-robot andhuman-computer interactions

Shaoqing Ren Kaiming He Ross Girshick and JianSun 2016 Faster r-cnn Towards real-time objectdetection with region proposal networks

Deborah Schiffrin 1980 Meta-talk Organizationaland evaluative brackets in discourse SociologicalInquiry 50(3-4)199ndash236

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Piyush Sharma Nan Ding Sebastian Goodman andRadu Soricut 2018 Conceptual captions Acleaned hypernymed image alt-text dataset for au-tomatic image captioning In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages2556ndash2565 Melbourne Australia Association forComputational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez Ł ukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008 Cur-ran Associates Inc

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 CIDEr Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Ramakrishna Vedantam C Lawrence Zitnick andDevi Parikh 2014 Cider Consensus-based imagedescription evaluation CoRR abs14115726

Anthony Viera and Joanne Garrett 2005 Understand-ing interobserver agreement The kappa statisticFamily medicine 37360ndash3

Jiacheng Xu Zhe Gan Yu Cheng and Jingjing Liu2019 Discourse-aware neural extractive text sum-marization arXiv preprint arXiv191014142

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Page 9: arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

System COSMic

Model CohLabel

Full No I No c No I amp c

Base

Visible 516 447 434 442Meta 505 439 442 453Subj 356 347 438 453Story 505 433 436 445

Lite

Visible 515 444 434 433Meta 511 434 447 464Subj 379 367 440 459Story 499 440 433 442

KendallrsquosCorr (τ ) 667 546 -222 -415

Table 2 Ablation experiment results No I repre-sents COSMic Vanilla without image features Norc gc represents COSMic Vanilla without coherencelabel embeddings finally No I amp No rc gc repre-sents COSMic Vanilla without coherence label embed-dings and without image features

7 Conclusion

Our work is the first step towards designing genera-tion metrics that respect the information goal of thegenerated text We observe that a small set of ex-amples annotated with coherence relations can pro-vide what is needed for learning a discourse-awaregeneration metric Our findings have implicationsfor designing context-aware multimodal metricswith criteria that are closer to human ratings forevaluating machine-generated multimodal content

We have called attention to the challenge oflearning robust generation metrics that can eval-uate the output of the generation models consid-ering the information goals Our findings sug-gest that fine-tuning ViLBERTmdashoriginally trainedwith millions of imagesmdashwith a smaller sample ofcoherence relations and expert-annotated scoringautomated metrics can score generated captionscloser to a human rating The presented datasetprovides the opportunity for future research in thearea of image description generation designingdiscourse-aware metrics and multimodal contentevaluation We hope that coherence-aware text gen-eration metrics could be used for learning bettergeneration models (such as abstractive summariza-tion or story generation) and could be deployeddirectly in machine learning pipelines to help inoptimizing hyper-parameters Ultimately it is in-tended to have a generalizable model that can usea labeling mechanismmdashnot restricted to coherencelabelsmdash to improve applicability of generation met-rics in different tasks

8 Ethics

This paper describes a research prototype We donot work with sensitive or personal data Our pro-tocol was approved by our ethics board Humansubjects participated voluntarily undertook min-imal risk and were compensated fairly for theirtime The dataset we produced is fully anonymizedSubjects consented to the distribution of their dataas part of their participation in the research Tech-nologists should think carefully before deployingour ideas in production Our work depends onpretrained models such as word and image embed-dings These models are known to reproduce andeven magnify societal bias present in training dataMoreover like many ML NLP methods our meth-ods are likely to perform better for content thatis better represented in training leading to furtherbias against marginalized groups We can hope thatgeneral methods to mitigate harms from ML biascan address these issues

A distinctive complication of our work is the factthat many imagendashtext presentations involve writ-ers expressing subjective opinions By its natureour evaluation metric assesses such subjective textsbased on averages and trends across many userswhich may be problematic Although such judg-ments are ultimately matters of personal taste theyare nevertheless often grounds by which hierarchiesof differences are culturally encoded and enforcedThus a deployed subjective-caption generation sys-tem could well be unfair to users especially if thoseusers are not confident in their own taste or criticaltowards the systemrsquos responses Our evaluationmetric is not sensitive to such harms

Acknowledgements

The authors affiliated with Rutgers University werepartly supported by NSF Award CCF-19349243Thanks to Pitt Cyber for supporting this project andthe authors from the University of Pittsburgh Wealso acknowledge the Center for Research Comput-ing at the University of Pittsburgh for providing therequired computational resources for carrying outexperiments at the University of Pittsburgh

ReferencesMartiacuten Abadi Ashish Agarwal Paul Barham Eugene

Brevdo Zhifeng Chen Craig Citro Greg S CorradoAndy Davis Jeffrey Dean Matthieu Devin SanjayGhemawat Ian Goodfellow Andrew Harp Geoffrey

Irving Michael Isard Yangqing Jia Rafal Jozefow-icz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dandelion Maneacute Rajat Monga Sherry MooreDerek Murray Chris Olah Mike Schuster JonathonShlens Benoit Steiner Ilya Sutskever Kunal TalwarPaul Tucker Vincent Vanhoucke Vijay VasudevanFernanda Vieacutegas Oriol Vinyals Pete Warden Mar-tin Wattenberg Martin Wicke Yuan Yu and Xiao-qiang Zheng 2015 TensorFlow Large-scale ma-chine learning on heterogeneous systems Softwareavailable from tensorfloworg

Abien Fred Agarap 2018 Deep learning using recti-fied linear units (relu) CoRR abs180308375

Malihe Alikhani Piyush Sharma Shengjie Li RaduSoricut and Matthew Stone 2020 Cross-modal co-herence modeling for caption generation In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 6525ndash6535 Online Association for Computational Lin-guistics

Peter Anderson Basura Fernando Mark Johnsonand Stephen Gould 2016a SPICE semanticpropositional image caption evaluation CoRRabs160708822

Peter Anderson Basura Fernando Mark Johnson andStephen Gould 2016b Spice Semantic propo-sitional image caption evaluation In EuropeanConference on Computer Vision pages 382ndash398Springer

Peter Anderson Xiaodong He Chris Buehler DamienTeney Mark Johnson Stephen Gould and LeiZhang 2018 Bottom-up and top-down attention forimage captioning and visual question answering InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR)

Xinlei Chen Hao Fang Tsung-Yi Lin Ramakr-ishna Vedantam Saurabh Gupta Piotr Dollar andC Lawrence Zitnick 2015 Microsoft coco cap-tions Data collection and evaluation server

Yin Cui Guandao Yang Andreas Veit Xun Huangand Serge Belongie 2018 Learning to evaluate im-age captioning In Proceedings of the IEEE con-ference on computer vision and pattern recognitionpages 5804ndash5812

Samuel Cumming Gabriel Greenberg and Rory Kelly2017 Conventions of viewpoint coherence in filmPhilosophersrsquo Imprint 17(1)1ndash29

J Deng W Dong R Socher L-J Li K Li and L Fei-Fei 2009 ImageNet A Large-Scale HierarchicalImage Database In CVPR09

Michael Denkowski and Alon Lavie 2014 Meteor uni-versal Language specific translation evaluation forany target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Haoyuan Gao Junhua Mao Jie Zhou Zhiheng HuangLei Wang and Wei Xu 2015 Are you talking to amachine dataset and methods for multilingual im-age question In Advances in Neural InformationProcessing Systems pages 2296ndash2304

Yifan Gao Chien-Sheng Wu Jingjing Li Shafiq JotySteven CH Hoi Caiming Xiong Irwin King andMichael Lyu 2020 Discern Discourse-aware en-tailment reasoning network for conversational ma-chine reading In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) pages 2439ndash2449 Online As-sociation for Computational Linguistics

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Kaiming He Xiangyu Zhang Shaoqing Ren and JianSun 2015 Deep residual learning for image recog-nition CoRR abs151203385

Tuomo Hiippala Malihe Alikhani Jonas HaverinenTimo Kalliokoski Evanfiya Logacheva SerafinaOrekhova Aino Tuomainen Matthew Stone andJohn A Bateman 2021 AI2D-RST a multimodalcorpus of 1000 primary school science diagramsLang Resour Evaluation 55(3)661ndash688

Jerry R Hobbs 1985 On the coherence and structureof discourse

Xinyue Huang and Adriana Kovashka 2016 Inferringvisual persuasion via body language setting anddeep features In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern RecognitionWorkshops pages 73ndash79

M G Kendall 1938 A new measure of rank correla-tion Biometrika 30(12)81ndash93

Diederik P Kingma and Jimmy Ba 2017 Adam Amethod for stochastic optimization

Ranjay Krishna Yuke Zhu Oliver Groth Justin John-son Kenji Hata Joshua Kravitz Stephanie ChenYannis Kalantidis Li-Jia Li David A ShammaMichael Bernstein and Li Fei-Fei 2016 Visualgenome Connecting language and vision usingcrowdsourced dense image annotations

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019a Integratingtext and image Determining multimodal documentintent in Instagram posts In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 4622ndash4632 Hong KongChina Association for Computational Linguistics

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019b Integrat-ing text and image Determining multimodal doc-ument intent in instagram posts arXiv preprintarXiv190409073

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov and et al 2020a The open imagesdataset v4 International Journal of Computer Vi-sion 128(7)1956ndash1981

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov et al 2020b The open images datasetv4 International Journal of Computer Vision pages1ndash26

Alex Lascarides and Matthew Stone 2009 A formalsemantic analysis of gesture Journal of Semantics26(4)393ndash449

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81 Barcelona SpainAssociation for Computational Linguistics

Jiasen Lu Dhruv Batra Devi Parikh and StefanLee 2019 Vilbert Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks In Advances in Neural Information Process-ing Systems volume 32 Curran Associates Inc

Pranava Madhyastha Josiah Wang and Lucia Specia2019 VIFIDEL Evaluating the visual fidelity ofimage descriptions In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics pages 6539ndash6550 Florence Italy Asso-ciation for Computational Linguistics

William C Mann and Sandra A Thompson 1987Rhetorical structure theory A theory of text orga-nization University of Southern California Infor-mation Sciences Institute Los Angeles

Scott McCloud 1993 Understanding comics The in-visible art William Morrow

Edwin G Ng Bo Pang Piyush Sharma and RaduSoricut 2020 Understanding guided image cap-tioning performance across domains arXiv preprintarXiv201202339

Christian Otto Matthias Springstein Avishek Anandand Ralph Ewerth 2019 Understanding catego-rizing and predicting semantic image-text relationsIn Proceedings of the 2019 on International Con-ference on Multimedia Retrieval pages 168ndash176ACM

Kishore Papineni Salim Roukos Todd Ward and Weijing Zhu 2002 Bleu a method for automatic evalu-ation of machine translation pages 311ndash318

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Advances in Neural Information Pro-cessing Systems 32 pages 8024ndash8035 Curran Asso-ciates Inc

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind Joshi and Bon-nie Webber 2008a The Penn Discourse Tree-Bank 20 In Proceedings of the Sixth Interna-tional Conference on Language Resources and Eval-uation (LRECrsquo08) Marrakech Morocco EuropeanLanguage Resources Association (ELRA)

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind K Joshi and Bon-nie L Webber 2008b The Penn discourse treebank20 In LREC Citeseer

J Pustejovsky and N Krishnaswamy 2020 Situatedmeaning in multimodal dialogue human-robot andhuman-computer interactions

Shaoqing Ren Kaiming He Ross Girshick and JianSun 2016 Faster r-cnn Towards real-time objectdetection with region proposal networks

Deborah Schiffrin 1980 Meta-talk Organizationaland evaluative brackets in discourse SociologicalInquiry 50(3-4)199ndash236

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Piyush Sharma Nan Ding Sebastian Goodman andRadu Soricut 2018 Conceptual captions Acleaned hypernymed image alt-text dataset for au-tomatic image captioning In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages2556ndash2565 Melbourne Australia Association forComputational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez Ł ukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008 Cur-ran Associates Inc

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 CIDEr Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Ramakrishna Vedantam C Lawrence Zitnick andDevi Parikh 2014 Cider Consensus-based imagedescription evaluation CoRR abs14115726

Anthony Viera and Joanne Garrett 2005 Understand-ing interobserver agreement The kappa statisticFamily medicine 37360ndash3

Jiacheng Xu Zhe Gan Yu Cheng and Jingjing Liu2019 Discourse-aware neural extractive text sum-marization arXiv preprint arXiv191014142

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Page 10: arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

Irving Michael Isard Yangqing Jia Rafal Jozefow-icz Lukasz Kaiser Manjunath Kudlur Josh Leven-berg Dandelion Maneacute Rajat Monga Sherry MooreDerek Murray Chris Olah Mike Schuster JonathonShlens Benoit Steiner Ilya Sutskever Kunal TalwarPaul Tucker Vincent Vanhoucke Vijay VasudevanFernanda Vieacutegas Oriol Vinyals Pete Warden Mar-tin Wattenberg Martin Wicke Yuan Yu and Xiao-qiang Zheng 2015 TensorFlow Large-scale ma-chine learning on heterogeneous systems Softwareavailable from tensorfloworg

Abien Fred Agarap 2018 Deep learning using recti-fied linear units (relu) CoRR abs180308375

Malihe Alikhani Piyush Sharma Shengjie Li RaduSoricut and Matthew Stone 2020 Cross-modal co-herence modeling for caption generation In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 6525ndash6535 Online Association for Computational Lin-guistics

Peter Anderson Basura Fernando Mark Johnsonand Stephen Gould 2016a SPICE semanticpropositional image caption evaluation CoRRabs160708822

Peter Anderson Basura Fernando Mark Johnson andStephen Gould 2016b Spice Semantic propo-sitional image caption evaluation In EuropeanConference on Computer Vision pages 382ndash398Springer

Peter Anderson Xiaodong He Chris Buehler DamienTeney Mark Johnson Stephen Gould and LeiZhang 2018 Bottom-up and top-down attention forimage captioning and visual question answering InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR)

Xinlei Chen Hao Fang Tsung-Yi Lin Ramakr-ishna Vedantam Saurabh Gupta Piotr Dollar andC Lawrence Zitnick 2015 Microsoft coco cap-tions Data collection and evaluation server

Yin Cui Guandao Yang Andreas Veit Xun Huangand Serge Belongie 2018 Learning to evaluate im-age captioning In Proceedings of the IEEE con-ference on computer vision and pattern recognitionpages 5804ndash5812

Samuel Cumming Gabriel Greenberg and Rory Kelly2017 Conventions of viewpoint coherence in filmPhilosophersrsquo Imprint 17(1)1ndash29

J Deng W Dong R Socher L-J Li K Li and L Fei-Fei 2009 ImageNet A Large-Scale HierarchicalImage Database In CVPR09

Michael Denkowski and Alon Lavie 2014 Meteor uni-versal Language specific translation evaluation forany target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Haoyuan Gao Junhua Mao Jie Zhou Zhiheng HuangLei Wang and Wei Xu 2015 Are you talking to amachine dataset and methods for multilingual im-age question In Advances in Neural InformationProcessing Systems pages 2296ndash2304

Yifan Gao Chien-Sheng Wu Jingjing Li Shafiq JotySteven CH Hoi Caiming Xiong Irwin King andMichael Lyu 2020 Discern Discourse-aware en-tailment reasoning network for conversational ma-chine reading In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) pages 2439ndash2449 Online As-sociation for Computational Linguistics

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Kaiming He Xiangyu Zhang Shaoqing Ren and JianSun 2015 Deep residual learning for image recog-nition CoRR abs151203385

Tuomo Hiippala Malihe Alikhani Jonas HaverinenTimo Kalliokoski Evanfiya Logacheva SerafinaOrekhova Aino Tuomainen Matthew Stone andJohn A Bateman 2021 AI2D-RST a multimodalcorpus of 1000 primary school science diagramsLang Resour Evaluation 55(3)661ndash688

Jerry R Hobbs 1985 On the coherence and structureof discourse

Xinyue Huang and Adriana Kovashka 2016 Inferringvisual persuasion via body language setting anddeep features In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern RecognitionWorkshops pages 73ndash79

M G Kendall 1938 A new measure of rank correla-tion Biometrika 30(12)81ndash93

Diederik P Kingma and Jimmy Ba 2017 Adam Amethod for stochastic optimization

Ranjay Krishna Yuke Zhu Oliver Groth Justin John-son Kenji Hata Joshua Kravitz Stephanie ChenYannis Kalantidis Li-Jia Li David A ShammaMichael Bernstein and Li Fei-Fei 2016 Visualgenome Connecting language and vision usingcrowdsourced dense image annotations

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019a Integratingtext and image Determining multimodal documentintent in Instagram posts In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 4622ndash4632 Hong KongChina Association for Computational Linguistics

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019b Integrat-ing text and image Determining multimodal doc-ument intent in instagram posts arXiv preprintarXiv190409073

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov and et al 2020a The open imagesdataset v4 International Journal of Computer Vi-sion 128(7)1956ndash1981

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov et al 2020b The open images datasetv4 International Journal of Computer Vision pages1ndash26

Alex Lascarides and Matthew Stone 2009 A formalsemantic analysis of gesture Journal of Semantics26(4)393ndash449

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81 Barcelona SpainAssociation for Computational Linguistics

Jiasen Lu Dhruv Batra Devi Parikh and StefanLee 2019 Vilbert Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks In Advances in Neural Information Process-ing Systems volume 32 Curran Associates Inc

Pranava Madhyastha Josiah Wang and Lucia Specia2019 VIFIDEL Evaluating the visual fidelity ofimage descriptions In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics pages 6539ndash6550 Florence Italy Asso-ciation for Computational Linguistics

William C Mann and Sandra A Thompson 1987Rhetorical structure theory A theory of text orga-nization University of Southern California Infor-mation Sciences Institute Los Angeles

Scott McCloud 1993 Understanding comics The in-visible art William Morrow

Edwin G Ng Bo Pang Piyush Sharma and RaduSoricut 2020 Understanding guided image cap-tioning performance across domains arXiv preprintarXiv201202339

Christian Otto Matthias Springstein Avishek Anandand Ralph Ewerth 2019 Understanding catego-rizing and predicting semantic image-text relationsIn Proceedings of the 2019 on International Con-ference on Multimedia Retrieval pages 168ndash176ACM

Kishore Papineni Salim Roukos Todd Ward and Weijing Zhu 2002 Bleu a method for automatic evalu-ation of machine translation pages 311ndash318

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Advances in Neural Information Pro-cessing Systems 32 pages 8024ndash8035 Curran Asso-ciates Inc

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind Joshi and Bon-nie Webber 2008a The Penn Discourse Tree-Bank 20 In Proceedings of the Sixth Interna-tional Conference on Language Resources and Eval-uation (LRECrsquo08) Marrakech Morocco EuropeanLanguage Resources Association (ELRA)

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind K Joshi and Bon-nie L Webber 2008b The Penn discourse treebank20 In LREC Citeseer

J Pustejovsky and N Krishnaswamy 2020 Situatedmeaning in multimodal dialogue human-robot andhuman-computer interactions

Shaoqing Ren Kaiming He Ross Girshick and JianSun 2016 Faster r-cnn Towards real-time objectdetection with region proposal networks

Deborah Schiffrin 1980 Meta-talk Organizationaland evaluative brackets in discourse SociologicalInquiry 50(3-4)199ndash236

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Piyush Sharma Nan Ding Sebastian Goodman andRadu Soricut 2018 Conceptual captions Acleaned hypernymed image alt-text dataset for au-tomatic image captioning In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages2556ndash2565 Melbourne Australia Association forComputational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez Ł ukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008 Cur-ran Associates Inc

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 CIDEr Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Ramakrishna Vedantam C Lawrence Zitnick andDevi Parikh 2014 Cider Consensus-based imagedescription evaluation CoRR abs14115726

Anthony Viera and Joanne Garrett 2005 Understand-ing interobserver agreement The kappa statisticFamily medicine 37360ndash3

Jiacheng Xu Zhe Gan Yu Cheng and Jingjing Liu2019 Discourse-aware neural extractive text sum-marization arXiv preprint arXiv191014142

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Page 11: arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019a Integratingtext and image Determining multimodal documentintent in Instagram posts In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) pages 4622ndash4632 Hong KongChina Association for Computational Linguistics

Julia Kruk Jonah Lubin Karan Sikka Xiao Lin DanJurafsky and Ajay Divakaran 2019b Integrat-ing text and image Determining multimodal doc-ument intent in instagram posts arXiv preprintarXiv190409073

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov and et al 2020a The open imagesdataset v4 International Journal of Computer Vi-sion 128(7)1956ndash1981

Alina Kuznetsova Hassan Rom Neil Alldrin JasperUijlings Ivan Krasin Jordi Pont-Tuset ShahabKamali Stefan Popov Matteo Malloci AlexanderKolesnikov et al 2020b The open images datasetv4 International Journal of Computer Vision pages1ndash26

Alex Lascarides and Matthew Stone 2009 A formalsemantic analysis of gesture Journal of Semantics26(4)393ndash449

Chin-Yew Lin 2004 ROUGE A package for auto-matic evaluation of summaries In Text Summariza-tion Branches Out pages 74ndash81 Barcelona SpainAssociation for Computational Linguistics

Jiasen Lu Dhruv Batra Devi Parikh and StefanLee 2019 Vilbert Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks In Advances in Neural Information Process-ing Systems volume 32 Curran Associates Inc

Pranava Madhyastha Josiah Wang and Lucia Specia2019 VIFIDEL Evaluating the visual fidelity ofimage descriptions In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics pages 6539ndash6550 Florence Italy Asso-ciation for Computational Linguistics

William C Mann and Sandra A Thompson 1987Rhetorical structure theory A theory of text orga-nization University of Southern California Infor-mation Sciences Institute Los Angeles

Scott McCloud 1993 Understanding comics The in-visible art William Morrow

Edwin G Ng Bo Pang Piyush Sharma and RaduSoricut 2020 Understanding guided image cap-tioning performance across domains arXiv preprintarXiv201202339

Christian Otto Matthias Springstein Avishek Anandand Ralph Ewerth 2019 Understanding catego-rizing and predicting semantic image-text relationsIn Proceedings of the 2019 on International Con-ference on Multimedia Retrieval pages 168ndash176ACM

Kishore Papineni Salim Roukos Todd Ward and Weijing Zhu 2002 Bleu a method for automatic evalu-ation of machine translation pages 311ndash318

Adam Paszke Sam Gross Francisco Massa AdamLerer James Bradbury Gregory Chanan TrevorKilleen Zeming Lin Natalia Gimelshein LucaAntiga Alban Desmaison Andreas Kopf EdwardYang Zachary DeVito Martin Raison Alykhan Te-jani Sasank Chilamkurthy Benoit Steiner Lu FangJunjie Bai and Soumith Chintala 2019 PytorchAn imperative style high-performance deep learn-ing library In Advances in Neural Information Pro-cessing Systems 32 pages 8024ndash8035 Curran Asso-ciates Inc

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind Joshi and Bon-nie Webber 2008a The Penn Discourse Tree-Bank 20 In Proceedings of the Sixth Interna-tional Conference on Language Resources and Eval-uation (LRECrsquo08) Marrakech Morocco EuropeanLanguage Resources Association (ELRA)

Rashmi Prasad Nikhil Dinesh Alan Lee Eleni Milt-sakaki Livio Robaldo Aravind K Joshi and Bon-nie L Webber 2008b The Penn discourse treebank20 In LREC Citeseer

J Pustejovsky and N Krishnaswamy 2020 Situatedmeaning in multimodal dialogue human-robot andhuman-computer interactions

Shaoqing Ren Kaiming He Ross Girshick and JianSun 2016 Faster r-cnn Towards real-time objectdetection with region proposal networks

Deborah Schiffrin 1980 Meta-talk Organizationaland evaluative brackets in discourse SociologicalInquiry 50(3-4)199ndash236

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Piyush Sharma Nan Ding Sebastian Goodman andRadu Soricut 2018 Conceptual captions Acleaned hypernymed image alt-text dataset for au-tomatic image captioning In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1 Long Papers) pages2556ndash2565 Melbourne Australia Association forComputational Linguistics

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez Ł ukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008 Cur-ran Associates Inc

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 CIDEr Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Ramakrishna Vedantam C Lawrence Zitnick andDevi Parikh 2014 Cider Consensus-based imagedescription evaluation CoRR abs14115726

Anthony Viera and Joanne Garrett 2005 Understand-ing interobserver agreement The kappa statisticFamily medicine 37360ndash3

Jiacheng Xu Zhe Gan Yu Cheng and Jingjing Liu2019 Discourse-aware neural extractive text sum-marization arXiv preprint arXiv191014142

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Page 12: arXiv:2109.05281v1 [cs.CL] 11 Sep 2021

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez Ł ukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems volume 30 pages 5998ndash6008 Cur-ran Associates Inc

Ramakrishna Vedantam C Lawrence Zitnick and DeviParikh 2015 CIDEr Consensus-based image de-scription evaluation In Proceedings of the IEEEconference on computer vision and pattern recogni-tion pages 4566ndash4575

Ramakrishna Vedantam C Lawrence Zitnick andDevi Parikh 2014 Cider Consensus-based imagedescription evaluation CoRR abs14115726

Anthony Viera and Joanne Garrett 2005 Understand-ing interobserver agreement The kappa statisticFamily medicine 37360ndash3

Jiacheng Xu Zhe Gan Yu Cheng and Jingjing Liu2019 Discourse-aware neural extractive text sum-marization arXiv preprint arXiv191014142

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations