naacl hlt 2018in our presentation of extant storytelling research, discourse, and innovations....

77
NAACL HLT 2018 Storytelling Proceedings of the First Workshop June 5, 2018 New Orleans, Louisiana

Upload: others

Post on 03-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

NAACL HLT 2018

Storytelling

Proceedings of the First Workshop

June 5, 2018New Orleans, Louisiana

Page 2: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

c©2018 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-948087-23-0

ii

Page 3: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Introduction

Welcome to the first *ACL workshop on Storytelling!

Human storytelling has existed for as far back as we can trace, predating writing. Humans haveused stories for entertainment, education, cultural preservation; to convey experiences, history, lessons,morals; and to share the human experience.

Part of grounding artificial intelligence work in human experience can involve the generation,understanding, and sharing of stories. This workshop highlights the diverse work being done instorytelling and AI across different fields.

Papers at this workshop are multi-disciplinary, including work on neural, pipeline, and linguisticapproaches to understanding and creating stories.

We are also pleased to host a Visual Storytelling challenge, highlighting different methods forautomatically generating stories given a set of images; and an invited talk from Nasrin Mostafazadehon communicating about events through storytelling.

Enjoy the workshop!

Workshop website: http://www.visionandlanguage.net/workshop2018

iii

Page 4: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Storytelling Challenge

The Storytelling challenge provides teams with the VIST dataset to generate stories from sequences offive images, and is hosted on EvalAI. Submissions for the challenge at NAACL 2018 were evaluatedusing both automatic metrics and human evaluation. The winner was chosen based on the bestperformance across the human evaluations for focus, structure and coherence, detail, how visuallygrounded the stories were, how shareable the stories were, and whether they sounded like they werewritten by a human. More details can be found on the workshop website.

Submissions came from the following teams all over the world.

DG-DLMXDiana González-Rico, Gibran Fuentes-PinedaInstitute for Research in Applied Mathematics and Systems (IIMAS), Universidad Nacional Autónomade México (UNAM)

NLPSA 501Chao-Chun Hsu, Szu-Min Chen, Ming-Hsun Hsieh, Lun-Wei KuAcademia Sinica, Taiwan

UCSB-NLPXin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang WangUniversity of California, Santa Barbara, USA

SnuBiVttMin-Oh Heo, Taehyeong Kim, Kyung-Wha Park, Seonil Son, Byoung-Tak ZhangSeoul National University

More details on the systems and winning team will be provided at the workshop, and made available onthe website after the workshop!

Challenge website: http://www.visionandlanguage.net/workshop2018/#challenge

iv

Page 5: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Organizers:

Margaret Mitchell, Google ResearchTing-Hao ‘Kenneth’ Huang, Carnegie Mellon UniversityFrancis Ferraro, University of Maryland, Baltimore CountyIshan Misra, Carnegie Mellon University

Program Committee:

Elizabeth ClarkDavid ElsonMarjan GhazvininejadAndrew GordonGunhee KimBoyang LiStephanie LukinJoao MagalhaesLara MartinSaif MohammadNasrin MostafazadehMark RiedlLuowei Zhou

v

Page 6: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Invited Speaker:Nasrin Mostafazadeh, Elemental Cognition

Event-centric Context Modeling:The Case of Story Comprehension and Story Generation

Building AI systems that can process natural language input, comprehend it, and generate an engagingand contextually relevant output in response, has been one of the longest-running goals in AI. In

human-human communications, a major trigger to our meaningful communications are “events" andhow they cause/enable future events. We often communicate through telling stories in the form of

related series of events.

In this talk, I present my work on language processing in terms of events and how they interact witheach other in time. Mainly through the lens of storytelling, I will focus on story comprehension and

collaborative story generation, with a major emphasis on commonsense reasoning and narrativeknowledge as showcased in the Story Cloze Test framework. Through different use cases, I will

highlight the importance of establishing a contentful context and modeling multimodal contexts (suchas visual and textual) in various AI tasks.

vi

Page 7: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Table of Contents

Learning to Listen: Critically Considering the Role of AI in Human Storytelling and Character CreationAnna Kasunic and Geoff Kaufman. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

Linguistic Features of Helpfulness in Automated Support for Creative WritingMelissa Roemmele and Andrew Gordon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

A Pipeline for Creative Visual StorytellingStephanie Lukin, Reginald Hobbs and Clare Voss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Telling Stories with Soundtracks: An Empirical Analysis of Music in FilmJon Gillick and David Bamman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Towards Controllable Story GenerationNanyun Peng, Marjan Ghazvininejad, Jonathan May and Kevin Knight . . . . . . . . . . . . . . . . . . . . . . .43

An Encoder-decoder Approach to Predicting Causal Relations in StoriesMelissa Roemmele and Andrew Gordon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Neural Event Extraction from Movies DescriptionAlex Tozzo, Dejan Jovanovic and Mohamed Amer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

vii

Page 8: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around
Page 9: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Conference Program

Tuesday, June 5, 2018

10:00–10:10 Opening Remarks

10:15–11:00 Invited Talk by Nasrin Mostafazedeh

11:00–11:25 Learning to Listen: Critically Considering the Role of AI in Human Storytelling andCharacter CreationAnna Kasunic and Geoff Kaufman

11:25–11:50 Linguistic Features of Helpfulness in Automated Support for Creative WritingMelissa Roemmele and Andrew Gordon

11:50–13:45 Lunch

13:45–14:10 A Pipeline for Creative Visual StorytellingStephanie Lukin, Reginald Hobbs and Clare Voss

14:10–14:35 Telling Stories with Soundtracks: An Empirical Analysis of Music in FilmJon Gillick and David Bamman

14:35–15:00 Towards Controllable Story GenerationNanyun Peng, Marjan Ghazvininejad, Jonathan May and Kevin Knight

15:00–15:20 Break

15:20–16:00 Storytelling Challenge

16:00–16:25 An Encoder-decoder Approach to Predicting Causal Relations in StoriesMelissa Roemmele and Andrew Gordon

16:25–16:50 Neural Event Extraction from Movies DescriptionAlex Tozzo, Dejan Jovanovic and Mohamed Amer

ix

Page 10: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Tuesday, June 5, 2018 (continued)

16:50–17:00 Closing Remarks

x

Page 11: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Proceedings of the First Workshop on Storytelling, pages 1–13New Orleans, Louisiana, June 5, 2018. c©2018 Association for Computational Linguistics

Learning to Listen: Critically Considering the Role of AI in HumanStorytelling and Character Creation

Anna KasunicCarnegie Mellon University

Human-Computer Interaction InstitutePittsburgh, PA USA

[email protected]

Geoff KaufmanCarnegie Mellon University

Human-Computer Interaction InstitutePittsburgh, PA [email protected]

Abstract

In this opinion piece, we argue that there is aneed for alternative design directions to com-plement existing AI efforts in narrative andcharacter generation and algorithm develop-ment. To make our argument, we a) outlinethe predominant roles and goals of AI researchin storytelling; b) present existing discourseon the benefits and harms of narratives; andc) highlight the pain points in character cre-ation revealed by semi-structured interviewswe conducted with 14 individuals deeply in-volved in some form of character creation. Weconclude by proffering several specific designavenues that we believe can seed fruitful re-search collaborations. In our vision, AI collab-orates with humans during creative processesand narrative generation, helps amplify voicesand perspectives that are currently marginal-ized or misrepresented, and engenders expe-riences of narrative that support spectatorshipand listening roles.

1 Introduction

Somebody gets into trouble. Gets out ofit again. People love that story!(Vonnegut, 1970)

Once upon a time, people decided it was notenough to share stories; as humans are as mucha curious animal as we are a storytelling one,we sought to study them. Theories of narratol-ogy, or the study or narratives, defines and breaksdown narratives according to their distinct statesof action, events, and elements (Prince, 1974; Bal,2009) and (for the most part) narratologists agreethat to constitute a narrative, a text must tell astory, exist in a world, be situated in time, in-clude intelligent agents, and have some form ofcausal chain of events. Narratives also usuallyseeks to convey something meaningful to an au-dience (Ryan et al., 2007). (Note: In this paper,

we are somewhat relaxed with the terms “story”and “narrative,” and will often use the two inter-changeably). With the advancement of artificialintelligence (AI), research related to narratives hastaken on a whole new shape and meaning; not onlyare there new narrative forms ripe for the studying,including more examples of stories that are inter-active and branching (Ryan and Rebreyend, 2013),but there is also a whole field of research devotedto improving AI storytelling capabilities.

Although research that teaches AI to tell betterstories is challenging and intriguing on many lev-els, research and discourse in domains like psy-chology, literary fiction, and even social mediasuggests that a) humans benefit from telling sto-ries, so AI might serve us better if it nurturesour storytelling predilections rather than act solelyas storyteller, and that b) after all these years,we as humans still have a lot to learn and im-prove when it comes to telling narratives. Ourstories brim with issues of representation, bias,and authenticity that can dilute or negate the ben-eficial powers of consuming and interacting withnarratives, and our social structures promote cer-tain stories while silencing others. If we’re notcareful, AI storytellers will only inherit and ex-acerbate these problematic patterns. In what fol-lows, we present an opinion piece that urges re-searchers at the intersections of AI, natural lan-guage processing (NLP), human-computer inter-action (HCI), and storytelling to envision differ-ent futures for research related to AI and story-telling. In this paper, to be clear, we do not strive topresent precise quantitative or qualitative conclu-sions or recommendations, nor are we exhaustivein our presentation of extant storytelling research,discourse, and innovations. Rather, we seek tospark dialogues, questions, and cross-disciplineexchanges around the role of AI in storytelling.To this end, we discuss existing conversations and

1

Page 12: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

research from human-computer interaction, AI,and other domains concerned with storytelling; weprovide anecdotal highlights from interviews weconducted with character creators; and we illumi-nate alternative, promising directions for AI sto-rytelling research. We begin by outlining someof the predominant roles that AI research in sto-rytelling has held in order to provide context forour discussion. As a framing note, readers shouldbe aware that we will use the term AI to encom-pass different forms of computational approachesthat may not fall within everyone’s defined scopeof AI. Whatever their specific disciplines or per-spectives, we humbly request that our readers re-lax their definitions of AI for the duration of thispaper to include computational approaches, morebroadly.

2 Predominant Roles of AI in Narratives

We can categorize AI work in storytelling in threeways: 1) teaching AI to generate and understandstories; 2) helping human storytellers as a co-creator; and 3) modeling story elements. Someof the earliest work in storytelling and AI fo-cused on improving AI’s understanding of sto-ries using scripts, or “boring little stories” inthe words of the authors (Schank et al., 1975;Schank and Abelson, 1975). As technology ad-vanced, more research attention shifted to usingAI to generate stories. For example, the hy-pertext fiction model emerged, in which links tobranching narratives allowed for branching storiesand some level of direct agency over the story(Bolter and Joyce, 1987). Researchers have con-tinued making advances in story generation, de-veloping AI story, world, and character generatorsthat are planning- or event-sequence-based (Fair-clough and Cunningham, 2004; Lebowitz, 1987;Porteous and Cavazza, 2009; Riedl and Young,2010; Young et al., 2004; Barber and Kudenko,2007; Min et al., 2008). Generators may alsobe character-centric, in which players’ interac-tions with intelligent agents move the stories for-ward (Magerko, 2006; Swartjes and Theune, 2006;Cavazza et al., 2002). This work most commonlyaims to create entertainment value, with an empha-sis on learning and catering to player preferences(e.g., (Thue et al., 2007)) and is often framed interms of applications to interactive narratives, evenif the potential applications of the work could ex-tend to narratives, in general.

However, the goals of AI research in narrativesare not exclusively focused on AI story generation;some work also strives to teach machines how tobe more “human” through stories (Huang et al.,2016; Riedl and Harrison, 2016). Although muchless common, some prior work has also positionedAI as co-creator, encouraging and guiding humansin creating their own stories (Ryokai and Cassell,1999; Bers and Cassell, 1998; Van Broeckhovenet al., 2015). We have also seen work in the inter-active drama space in which human-allowed ac-tions are more open-ended, positioning humansto act more like actors on a stage than charac-ters that must conform to a limited story world(Mateas and Stern, 2003). In addition, some re-search in modeling stories has focused not onstory generation, but on understanding (and sub-sequently improving) human experiences of narra-tives. For example, work in identifying “emotionalarcs” looks at mood shifts and audience engage-ment in experiencing narratives (Chu and Roy,2017; Reagan, 2017), and other work has endeav-ored to identify turning points in stories (Ouyangand McKeown, 2015), model the shapes of stories(Mani, 2012; Elson, 2012), and understand the re-lationships of characters to narrative arcs (Bam-man et al., 2014b,a). These works have identifiedwhat people enjoy about stories, have learned fromour existing stories, and subsequently can augmentstory generation efforts.

In short, for the (many) story-lovers among us,it’s an exciting time to be working in AI, NLP andHCI. However, as we will discuss in the next sec-tion, there can be potential dangers in placing AIin a storytelling role, in aligning AI storytelling re-search too closely with interactive narratives, andin catering too much to the preferences of the hu-mans engaging in stories and games. We now toturn to a discussion of both the benefits and poten-tial harms of storytelling as they relate to extantwork on AI in narratives.

3 The Pleasures and Perils of Narratives

3.1 The Complex Pleasures of Narrative

Scholars in philosophy, psychology, anthropol-ogy, and related disciplines have characterizedstorytelling as fundamental to how we as hu-mans grow, learn, develop, and process and ex-perience the world, (Jung, 1964; Dautenhahn andNehaniv, 1998; Sutton-Smith, 1986, 2012; Paley,2009; Cooper, 1993). As such, our desire to en-

2

Page 13: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

gage in stories and storytelling is not a “frivolousimpulse, but a fundamental adaptive response”(Rose, 2012). As an experimental study from the1940s has shown, we go so far as to ascribe nar-rative to situations where no narrative form exists(Heider and Simmel, 1944). Although AI researchin emotional arcs recommends “happy” endingsand seeks to maximize positive moods (Chu andRoy, 2017; Reagan, 2017), other research suggeststhat the relationship between story enjoyment andemotion is more complex. Research on benignmasochism tells us that the human brain can de-rive pleasure from negative reactions and feelingssuch as sadness, fear, and disgust (Rozin et al.,2013). For example, sad films can be highly en-joyable, especially for certain groups such as fe-male and younger viewers (Oliver, 1993, 2003;Mares et al., 2008). Similarly, research has foundthat mixed emotional experiences, such as expe-riencing both happiness and sadness rather thanjust one or the other, can be beneficial to one’sphysical health (Hershfield et al., 2013). Thus,AI work that focuses on maximizing human en-joyment may overemphasize “sunny” experiencesof narratives, and by focusing on pleasure ratherthan growth, may favor stories with narratives thatfail to challenge and aid in human development.

Entertainment through narratives can be a valu-able end goal in itself, but it can also have other,attendant advantages. According to transporta-tion theory, we can achieve immersion in narrativeworlds through identification with characters andperceptions of plausibility or the “suspension ofdisbelief,” in which we view narrative worlds andcharacter actions as authentic (Green et al., 2004,2003; Tesser et al., 2005). Not only does this trans-portation lead to enjoyment, but it can also enableperspective taking and belief change, (Kaufmanand Libby, 2012; Berns et al., 2013). It can posi-tively transform us, e.g. leading us to personalitygrowth and maturation (Djikic et al., 2009b), withpotentially higher transformative effects on atti-tudes for those who are resistant to change, or havediminished emotionality (Dal Cin et al., 2004; Dji-kic et al., 2009a). Reading fiction has been shownto improve the ability to attribute mental states tooneself and others (known as Theory of Mind), animportant cognitive foundation for complex socialrelationships (Kidd and Castano, 2013), and read-ing narratives can lead us to feel psychologicallyconnected to groups of characters, increasing feel-

ings of belongingness and subsequently leading togreater feelings of satisfaction and more positivemood (Gabriel and Young, 2011).

3.2 On “Listening” Versus “Agentic”Narrative Forms

The jury is still out, however, on which formsof media provoke higher levels of transportation,transformation, and enjoyment. Here, we find ituseful to separate “listening” forms of narrativefrom “agentic” forms of narrative, and we will usethese terms throughout the remaineder of the pa-per. We characterize “listening” narratives as po-sitioning the consumer of the narrative in a morepassive role, listening or watching the story ratherthan making direct decisions that define or shapethe characters, the plot, or the narrative world.“Listening” narratives would include traditional,typically non-branching narratives such as films,novels, and short stories. We define “agentic”narratives as stories in which the narrative con-sumer has some level of direct agency over char-acters, plot, or other story elements. In our defi-nition, “agentic” narratives are akin to interactivenarratives, and this aligns with other definitionsof interactive narratives. For example, agency inthe context of interactive narratives can be saidto occur when the world “responds expressivelyand coherently to our engagement with it” (Mur-ray, 2004). Accordingly, we prefer to stress theidea of agency over interaction because we do notthink that “listening” narratives preclude interac-tion. Indeed, other researchers and designers os-tensibly share our view of agency in narratives asnebulously interactive. For example, PersausiveGames has produced experiences that question thenotion of how much traditional game definitions ofagency truly allows players to interact with and en-gage in a narrative (Bogost, 2005, 2006), and otherresearchers have chosen to use the term “agencyplay” rather than agency to suggest that interac-tive narratives require more expansive notions ofinteractivity (Harrell and Zhu, 2009). As we con-sider the future of narrative expression and con-sumption, we can consider the possibility of lis-tening narratives that allow narrative consumers tointeract without directly making decisions aboutnarrative or character arcs and shapes.

Indeed, studies have found that traditional “lis-tening” forms such as books and movies may beequally or even more engaging and transforma-

3

Page 14: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

tional than “agentic” narratives (Oh et al., 2014;Green et al., 2008; Jenkins, 2014). However, otherstudies have found that enjoyment and identifica-tion can be higher in agentic narratives (Hefneret al., 2007; Elson et al., 2014; Hand and Varan,2008). It may also be that certain types of interac-tion may have varying costs and benefits. For ex-ample, a study allowing for character customiza-tion actually decreased narrative engagement andenjoyment, showing that the qualities of a narra-tive, not character agency, might be more impor-tant (Green et al., 2004).

We make no attempt to argue for “listening”over “agentic” forms of engagement with narra-tives, but we do believe that listening forms war-rant more attention. For example, advancementsin AI mechanisms for branching narratives needn’tapply to exclusively agentic forms of narrative;we can also think in terms of new listening ex-periences that are simultaneously branching andnon-agentic (listening). A large body of workon branching narratives could be translated intonew forms of listening rather than agentic digi-tal media; for example, researchers have shownhow their work in story generation techniques likeplot graphs can be applied to branching (not nec-essarily agentic) narratives (Li et al., 2013; Guz-dial et al., 2015). However, to our knowledge,there does not exist a wide range of practical ap-plications of advancements in story generation andbranching narrative work to new, listening formsof narrative.

Some analogies may be useful here. In de-scribing his affinity for both traditional (listening)and interactive (agentic) narratives, storyteller andSIMS creator Will Wright likens traditional nar-ratives to a roller coaster, and games and interac-tive narratives to a dirt bike (Rose, 2012). Nei-ther experience is necessarily “better” than theother; whether the driver or the passenger, eachhas their unique benefits and affordances. More-over, just as the invention of the roller coaster en-abled new forms of “riding”, listening forms ofnarrative needn’t necessarily be “traditional.” Justas humans enjoy, learn, and develop through socialinteraction, we also have much to gain by spec-tatorship such as the popular pastime of “people-watching.” Just as there is value in active think-ing, there is also value in meditating (watching ourthoughts pass by without directly engaging). Weargue that by focusing so much scholarly attention

on agentic forms of narrative, we may be missingout on ways to use technology to engender newways of listening. Technological advancements inbranching narratives, for example, could be real-ized by means of listening narratives rather thanagentic narratives; we can consider the potentialbenefits of narratives that allow a multitude of ex-periences and paths, without conceding choice oragency to the listener/spectator.

Lastly, although existing efforts in AI narrativegeneration would suggest that humans benefit pri-marily from the experience of receiving narratives,the telling of stories can be highly beneficial. Theycan help us develop resilience, (East et al., 2010),provide therapeutic benefits (Block and Leseho,2005; Carlick and Biley, 2004; Chelf et al., 2000;Pennebaker, 1997), and activate imaginative pro-cesses that are key to human growth and develop-ment (Harris, 2000). Thus, as with the narrownessof focus on interactive (agentic) rather than listen-ing forms of narrative, prioritizing AI’s role as sto-ryteller misses valuable opportunities for empow-ering humans as storytellers.

3.3 The Potential Harms of ExistingNarratives

In addition, as we consider AI-enhanced story-telling experiences, we need to be mindful that ourstarting points— the story frames that we use totrain AI— may unintentionally marginalize, mis-represent, and altogether exclude many groups.Just as recent work in natural language process-ing has criticized and sought ways to rectify theamplification of negative biases (e.g. gender bi-ases) in NLP techniques (Dwork et al., 2012; Zhaoet al., 2017; Bolukbasi et al., 2016; Voigt et al.,2017), AI storytelling has the potential to amplifyand exacerbate issues of bias and diversity, whichin turn excludes certain individuals from experi-encing the potential benefits of story engagement.For example, AI story generators that learn fromexisting narrative corpora may learn that straightwhite male characters are best suited to be protag-onists or figures of power, and that genderqueerand people of color should occupy sidekick roles.These stereotypes may persist regardless of theidentification of the human author; for example,a study of online fan-fiction found that genderedstereotypes were highly common, and perpetuatedby both male and female-identifying authors (Fastet al., 2016). Thus, machine learning models may

4

Page 15: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

also learn from patterns of speech and role charac-terizations that stereotype certain groups, decreas-ing character authenticity. This is especially wor-risome because research demonstrates that narra-tive persuasion is less effective if people cannotidentify with the characters (So and Nabi, 2013;Ritterfeld and Jin, 2006; Slater et al., 2006; Gilligand Murphy, 2016).

Many popular films fail the classic Bechdel test,which simply specifies that the movie must 1) fea-ture at least two women, 2) that these womenmust talk to each other, and 3) that their conver-sation must concern something other than a man(Selisker, 2015), and fare even worse on new NLPtechniques that assess power differentials betweenmen and women in movies (Sap et al., 2017). Is-sues of representation in film highlighted by the2014 and 2015 Oscars, in which all awardeeswere white, sparked a social media firestorm un-der the hashtag #OscarsSoWhite (Syed, 2016; Bo-rum Chattoo, 2018), and brought to further light ahistory of under-representation in film, with only6.4% of all awardees since 1929 (1,688) beingnon-white (Berman, 2016). Minority groups areoften under-, mis-, or negatively represented infilm and other forms of narrative (Okoye, 2016;Smith, 2009; Hooks et al., 2006). In writing com-munities, gendered violence under the dominanceof a “straight male cisgender patriarchy” and ex-clusion of black and brown writers from major lit-erary publications has spawned a wave of debateand protest about exclusion, marginalization, andthe silencing of voices (Tsay et al., 2015; Groom,2015). Such issues suggest that AI may be moreuseful to us as an aid that can help identify bi-ases and stereotypes, and amplify muffled voices,rather than a generator that replicates and extendsour existing, problematic narratives.

In the next section, we cull anecdotal excerptsfrom a series of interviews we conducted with in-dividuals deeply involved with some form of char-acter creation to reveal existing pain points in hu-man’s attempts to avoid and address issues of bias-ing, stereotyping, under-representation, and mis-representation. Our interviewees’ discussions sug-gest concrete, specific ways in which AI can aidhumans in improving some of the more problem-atic elements of our existing storytelling efforts.

4 An Exploration of Challenges inCharacter Creation

The one thing about being a dude andwriting from a female perspective is thatthe baseline is, you suck.

As author Junot Dıaz’s quote above (Rosenberg,2012) points out, creating characters can be anintractable challenge. Unless we are in the rarecase of writing a story that is only about theself, with no secondary characters, creating the“Other”— a character who is different from one-self along one or multiple dimensions (Shawl andWard, 2005)— is inevitable. As humans, we de-fine ourselves along multiple axes of identity, in-cluding gender, sexuality, race/ethnicity, class, na-tionality, health, disability, education, and pas-sions/interests, to name a few; some axes may beespecially salient for some individuals, and incon-sequential for others. We posit that anxiety anduncertainty about how to authentically and sensi-tively create characters who are Other can hinderboth the creative process and the narrative expe-rience, as inauthentic characters can also impedecharacter identification and subsequent narrativetransportation and enjoyment. Understanding theways in which human character creators approachand grapple with creating characters that are Othercan provide insights into where the most crucialneeds lie, and how we might design AI systems toassist with rather than model human stories. Tothis end, we conducted interviews to explore andbetter understand the space of character creationand its attendant pain points. Below, we presentanecdotal highlights of interviews as they pertainto insights into needs for assistive AI; a full pre-sentation of our methodology and qualitative anal-ysis processes can be found elsewhere (more in-formation available upon request).

We conducted qualitative, semi-structured in-terviews with 14 individuals with deep involve-ment in character creation, including novelists,short story writers, poets, journalists, televisionand game writers, actors, directors, and role-playing gamers, game masters, and designers (in-cluding both tabletop and live-action role-playinggames). These character creators, recruited withthe help of professors in relevant departments atour local university, ranged in age from 19 to 62(average of 45), and held education levels from“some college” to PhD. All spoke English as aprimary language, and primarily were born and

5

Page 16: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

raised in the U.S. Eight identified as male, fiveas female, and one as non-binary; 11/14 identi-fied as white, one as black, one as Native Amer-ican, and one as Asian. For several of our par-ticipants, aspects of the narrative creation processconstituted their full-time occupation or activity—e.g. writer, videographer, professor of dramaor literature, and game designer— whereas oth-ers pursued narrative and/or character creation asa passion, hobby or pastime while also holdinganother occupation, such as secretary, civil ser-vant, human resources coordinator, or student.With IRB approval, we audio-recorded the inter-views (each lasting roughly 40-70 minutes), tran-scribed, and qualitatively analyzed using open-coding techniques to identify patterns across ourparticipants.

We asked our participants to describe their pro-cesses of creating or embodying their characters,what informs the development of their characters,on what axes they identify with or diverge fromtheir characters, and what conflicts or hesitationsthey have in creating or embodying certain kindsof characters. These responses validated existingresearch on the benefits of storytelling as a sourceof joy and growth. As one participant put it, “youget shaped by these stories that touch you, andby the sources that touch you. And I think youdevelop greater empathy. I think you become abetter human being though that” (p2). The resultsof our interviews offer key insights about how AIsystems can support humans in character creationand storytelling efforts, which we can organizeinto three themes: 1) the distinct ways in whichdifferent participants struggled and dealt with in-authenticity concerns; 2) the pain points partici-pants discussed about giving voice to charactersfrom under-represented groups; and 3) the impactof collaboration on character creation.

4.1 Concerns about Inauthentic Characters

Our participants generally fell into one of twocamps when it came to relationships between theircharacters and their self-identity. Either a) theyspecifically chose characters they viewed as sim-ilar to themselves, operating under the adage,“write what you know” or b) they grounded them-selves in notions of universality, seeking to findelements of themselves in characters that wereseemingly highly divergent from themselves. Yetboth groups expressed feelings of discomfort with

and anxiety about representing different view-points, suggesting opportunties for AI to assistwith authentic character representation.

Participants that fell into group A explained thatcertain character decisions were outside their com-fort zones, e.g. role-playing a character of anopposite gender (p9, p12). Another participantstated, “I’m very careful about running mental ill-nesses that I don’t and have never have. BecauseI’m sensitive enough to portrayals of my own, thatI kind of don’t want to screw that up.” (p13). Otherparticipants echoed this sentiment of certain sto-ries or portrayals being “not my story to tell” (p4).

Participants in group B felt it was very impor-tant to include diverse characters in their storiesand games in order to be more inclusive, but manyof these participants expressed concern that de-spite their best efforts, they might be misrepre-senting characters that identified in ways differ-ently from themselves. They wanted to remaininside the lines of what they felt, as one partici-pant worded it, “cultural appreciation” rather than“cultural appropriation” (p9). One role-player wasinitially hesitant to move outside her own iden-tity. She had slowly branched into different eth-nicities, genders, and sexualities, but had linger-ing apprehension regarding whether her portray-als were ethical and authentic, saying “HopefullyI’m not being horribly insulting to anyone of thatethnicity or sexuality while playing them. I hopeI’m not. I think I’m not. I think I’m doing it rel-atively sensitively (p11). Thus, AI could be help-ful in flagging characters that might be cause forconcern by perpetuating certain stereotypes or of-fenses.

4.2 Issues of Representation

We have given examples in this paper of narrativeexclusion along lines of gender and race/ethnicity;concerns related to these topic arose often in ourinterviews, and suggests opportunities for AI tooffer additional support. For example, one of ourparticipants, a drama director, said he made a pur-poseful decision to cast racially diverse actors inhis plays (p3).

Yet even among those for whom improving therepresentation of certain under-represented groupsis a priority, there can be conflict over how weshould represent such groups. For example, oneparticipant (p1) spoke of the controversies in TVwriting communities around what it means to

6

Page 17: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

write authentic characters of color. He spoke of apanel he participated in about TV representationsof people of color in which many of the panelistswere sharply divided on the questions: Is it okayfor a character to be universal in identity, such thatsomeone of a different race could conceivably playthat character? Or ought characters be steeped inthe specifics of their social identities and contexts?Participants also brought up issues of exclusionalong axes that are often overlooked. For example,one participant discussed issues with neurotypicalprivilege, explaining that collaborative storytellinggames are often exclusionary because they requireplayers to pool from a relatively common pool ofnarrative tropes, meanings and interpretations thatare not easy accessible to those who are neurodi-vergent (p8). These concerns about and disagree-ments on how to authentically and sensitively rep-resent different groups indicate that AI that couldserve to assist humans in grappling with and re-flecting on these issues.

4.3 Impact of Collaboration on CharacterCreation

Where writers of novels or short stories may bemore likely to develop narratives relatively au-tonomously and in isolation, other media lendthemselves to highly collaborative environments,such as role-playing games (where the game mas-ter and the role-playing actors interact to shape thenarrative), writing for the stage or screen (wherewriters interface with actors that embody theircharacters), and video game writing (where it iscommon for large teams to collaborate). Par-ticipants in narrative media with more collabo-rative development processes spoke enthusiasti-cally of how actors and other characters had re-shaped their understandings of their own charac-ters. For example, a participant who is a play-wright discussed how interactions with actors of-ten reshaped not just a character, but a whole play(p4). Similarly, a screenwriter-participant (p7)discussed completely revising a major scene af-ter an actress revealed she couldn’t “in her wildestdreams” imagine taking the action assigned to her.The interviewee stated that he often gains invalu-able insights from actors, and explained that therelationship between actors and their charactersare symbiotic; if an actor can’t feel they can betrue to a character, then everything will fall apart.Role players and game masters spoke of how in-

teractions with other characters shaped their un-derstandings of their own characters, and affectedthe decisions they made in the game. Where writ-ers in less collaborative contexts do research andseek guidance from those they feel may have moreexpertise or insight, in these more collaborativecontexts, characters are not just created; they areconstantly negotiated and re-negotiated. Throughthese processes of negotiation, our participants ex-plained that they felt their characters took on moreauthentic, lifelike forms. However, not all narra-tive media are structured to automatically supportsuch forms of collaboration and feedback, and notall narrators and character creators have access tosocial circles that can enable such collaboration.Intelligent agents that can play similar roles to hu-man collaborators (e.g., other role players and ac-tors) could provide critical, transformative feed-back to creators and narrativists that work in rel-ative isolation.

In sum, our interviews indicate that AI couldbe helpful as a storytelling assistant or co-creatorby offering practical assistance (e.g. flagging mis-representations), providing support for reflectionon representation, and by taking on character em-bodiment roles that are usually assumed by hu-mans in collaborative creation contexts.

5 Discussion

As we move forward towards new forms of media,narrative, and interaction, we urge scholars to takea step back and question the whys of AI in sto-rytelling. Based on existing research and currenttrends in AI and other domains invested in narra-tives, and informed by the qualitative interviewsour team conducted with character creators, werecommend a reorientation towards how we con-ceptualize the role of AI in storytelling. Yes, wecan keep moving towards a future in which AI be-comes more and more adept at human forms ofstorytelling. But is that the preferred future? Aswe consider the joys and benefits humans experi-ence by engaging in storytelling, the shortcomingsof our current narrative forms and processes thatcan exclude groups and dilute the transformativepower of narratives, and both the struggles and af-fordances of different processes of character cre-ation, we see several potential branches that futureAI narrative systems can grow. Here, we return tothe idea of “listening”: we envision AI that betterlistens to human storytellers and assists us as co-

7

Page 18: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

creators, and AI-assisted narrative forms that en-able “listening” rather than “agentic” engagement.We give examples of specific starting ideas wehave for 1) designing AI to support human story-tellers, and 2) investigating “listening” rather than“agentic” forms of narrative that we hope will in-spire the growth of new branches of AI narrativeresearch. We acknowledge that the current state ofcomputational powers renders some of our sugges-tions only feasible through at least partial Wizardof Oz approaches; we consider these ideas as start-ing points to guide future research and scientificadvancements. Although we think current pathsof AI research merit continued work and investi-gation, we believe that these alternate paths of in-quiry and design are at least equally promising andimportant.

5.1 AI and Crowd-Powered FeedbackMechanisms for Human Storytellers

As we saw from our literature review and our in-terviews, humans enjoy storytelling; they grow,learn, and heal from it. However, as we saw inour interviews, creating authentic characters canbe a challenging and emotionally fraught task, andas we saw from discussions of (mis)representationand stereotyping in film and literature, the storiesthat humans currently create are not the ideal mod-els for AI to emulate. Thus, instead of expendingall our effort on teaching AI to tell stories, we candivert some of our energies to using AI to help hu-mans tell the stories they may struggle to tell. AIhas already been used to model emotional arcs innarrative, and to identify bias in a number of do-mains. AI could be leveraged to better identifypotential problems that could dilute authenticityand stymie narrative transportation, such as exclu-sion and stereotyping. There are a number of ap-proaches that already use crowd-powered “mini-corpora” to teach AI how to generate narrative,(Guzdial et al., 2015; Li et al., 2012, 2013, 2014;Purdy and Riedl, 2016) but this existing work doesnot seek to improve experiences or engage crowd-workers in meaningful forms of storytelling. Tak-ing cues from work in improving crowd work-ers experiences by inducing curiosity (Law et al.,2016), we might further consider crowd-poweredfeedback mechanisms could allow both story cre-ators and crowd workers to engage in and benefitfrom stories in different ways.

For example, taking a character-centric ap-

proach, AI systems could prove useful in identi-fying when characters begin to fall into traps ofstereotypes or implausibility. We could considertraining models using a combination of existingcorpora and crowdsourcing; as a secondary ben-efit, we could design crowdsourcing studies suchthat they engage crowd workers in meaningfulstorytelling that contributes to larger, concretizedgoals so that crowd workers are also benefitingby consuming and ultimately contributing to revi-sions of narratives. Studies could invite participa-tion from those who most identify with marginal-ized and underrepresented populations, and arethereby able to speak to concepts of authenticityaround specific identities (including axes of iden-tity that our interview participants highlighted,such as mental illness and neurodivergence). Wecould then apply these models to new narrativesand use them to generate feedback for narrativecreators (e.g., flagging certain depictions that aredeemed to be inauthentic or insensitive); AI sys-tems provide prompts and exercises to encouragereflective or creative practices to psychologicallyand creatively grapple with these challenges.

Considering the dynamics of collaborative char-acter creation processes in domains like role play-ing games and acting, there could be opportuni-ties to re-purpose advances we’ve made in intel-ligent agents. During the narrative and charactercreation processes, creators could engage with in-telligent agents that take on certain roles in thestory, and give feedback on elements that feel in-authentic or incohesive. For creators that workin relative isolation, AI could simulate the morecollaborative creative atmospheres native to role-playing and acting environments, in which char-acters can quite literally talk back and generatethoughts of their own, thereby re-shifting and re-shaping aspects of the characters and of narrativesas a whole. Under such a scenario, humans wouldstill be the primary storytellers, just as playwrightsare still the people writing the script even if theydecide to make revisions and edits based on feed-back they receive from actors. Although AI cannotyet simulate human intelligence to the degree thatsuch ideas would require, we can think of ways wecould use crowdsourcing and/or partial Wizard ofOz approaches to achieve similar ends and provideguidance for future goals of AI.

8

Page 19: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

5.2 Innovating and Exploring “Listening”Narrative Forms

Technological progress has spawned innovationin the field of interactive narratives and narra-tive games, particularly in the realm of videogames. However, we argue that the scholarly en-ergy around interactive narratives might be oc-cluding potential for technological innovation in“listening” forms of narrative in which consumersare watchers or spectators rather than active agentsin the narrative. As discussed previously, researchhas shown that both interactive and “traditional”narrative media have positive impacts, and un-der certain circumstances, “traditional” narrativesmay be even more effective for producing par-ticular outcomes, such as narrative transportation.However, there is room to explore what “non-traditional” listening narratives could look like andproduce.

As a starting point to this path of inquiry, wecould leverage existing NLP research in styletransfer, which uses neural networks to learnstylistic elements of a corpus, and apply the styleschema onto new texts (Kabbara and Cheung,2016; Shen et al., 2017; Carlson et al., 2017; Fuet al., 2017; Han et al., 2017). Narratives that seekto persuade, shift opinions, or otherwise transformreaders are not always successful. For example astudy exposing youth to stories of LGBTQ peoplefound that where LGBTQ youth felt more hope-ful, hetero and cis youth felt more negative atti-tudes after the narrative exposure (Gillig and Mur-phy, 2016). Here, we could begin to think abouthow we could transform stories that could bet-ter achieve their narrative end goals (e.g., chang-ing attitudes) in ways that better speak to dif-ferent groups of readers. Automated style trans-fer while maintaining diegetic plausibility and co-herency could be one way to achieve this, and isworth further exploring. Again, we acknowledgethat given the current state of computational so-phistication, such an idea would require at leastpartial “Wizard of Oz” approaches.

We could also think of how “listening” toAI-powered on-demand storytelling could soothe,heal, and transport in times when we cannot ac-cess human-generated stories, or in time-sensitivesituations when the narrative specifications we areseeking are not readily met by existing, availablestories. In the midst of a bad break-up, an episodeof depression, an anxiety attack, a death of a loved

one, a school or work-related failure, or any num-ber of upsetting or traumatic experiences (poten-tially including the stresses of authentically rep-resenting “Other” characters in narratives), it maybe difficult to reach out to others, and musteringthe energy to actively engage in an agentic nar-rative might be too daunting. Instead, a listen-ing form of narrative could be more helpful. AnAI-powered listening narrative system could en-courage certain emotional, psychological, or be-havioral responses, such as allowing individuals toshift to more realistic and optimistic perspectives,or motivating individuals to reach out to friends,family or health support staff. It could be tailoredto the specific situation or instance in which ex-tra support is needed, could learn from one’s en-gagement with other narratives to cater to personalpreferences of narrative style, content, and char-acters, and could take various media forms, suchas text, audio (including more musical or sound-oriented narratives), video, augmented reality, orvirtual reality. For example, multi-modal sensingcould allow for branching even in the absence ofexplicit listener choice, such as using the listener’snonverbal or physiological responses to make de-cisions or to alter the course or trajectory of thestory. Given the current limitations of AI, early it-erations could sample from existing corpora thathave been studied to produce specific psychologi-cal or behavioral reactions.

We see these research and design suggestionsas mere starting points to inspire more interestingideas and conversations. We look forward to fur-ther discussing the opinions and ideas we’ve laidout in this paper, and collaborating with otherswho share our passions for narrative and exploringthe limits and potentials of technology. In otherwords:

To be continued. . .

6 Acknowledgements

A special thanks goes to the National ScienceFoundation’s Graduate Research Fellowship Pro-gram for their support of this work, and to DiyiYang, Mark Riedl, Anjalie Field, Judeth OdenChoi, and all our reviewers, all of whom providedfeedback, ideas, and references to literature thatwe earnestly but imperfectly tried to incorporateinto this final version.

9

Page 20: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

ReferencesMieke Bal. 2009. Narratology: Introduction to the the-

ory of narrative. University of Toronto Press.

David Bamman, Brendan O’Connor, and Noah ASmith. 2014a. Learning latent personas of film char-acters. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics (ACL),page 352.

David Bamman, Ted Underwood, and Noah A Smith.2014b. A bayesian mixed effects model of literarycharacter. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), volume 1, pages 370–379.

Heather Barber and Daniel Kudenko. 2007. A usermodel for the generation of dilemma-based interac-tive narratives. In Workshop on Optimizing PlayerSatisfaction at AIIDE, volume 7.

Eliza Berman. 2016. See the entire history of the oscarsdiversity problem in one chart.

Gregory S Berns, Kristina Blaine, Michael J Prietula,and Brandon E Pye. 2013. Short-and long-term ef-fects of a novel on connectivity in the brain. Brainconnectivity, 3(6):590–600.

Marina Umaschi Bers and Justine Cassell. 1998. Inter-active storytelling systems for children: using tech-nology to explore language and identity. Journal ofInteractive Learning Research, 9(2):183.

Laurie Block and Johanna Leseho. 2005. listen andi tell you something: Storytelling and social actionin the healing of the oppressed. British Journal ofGuidance & Counselling, 33(2):175–184.

Ian Bogost. 2005. Airport insecurity.

Ian Bogost. 2006. Disaffected!

Jay David Bolter and Michael Joyce. 1987. Hypertextand creative writing. In Proceedings of the ACMConference on Hypertext, HYPERTEXT ’87, pages41–50, New York, NY, USA. ACM.

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,Venkatesh Saligrama, and Adam T Kalai. 2016.Man is to computer programmer as woman is tohomemaker? debiasing word embeddings. In Ad-vances in Neural Information Processing Systems,pages 4349–4357.

Caty Borum Chattoo. 2018. Oscars so white: Gender,racial, and ethnic diversity and social issues in usdocumentary films (2008–2017). Mass Communi-cation and Society, pages 1–27.

A Carlick and Francis C Biley. 2004. Thoughts onthe therapeutic use of narrative in the promotion ofcoping in cancer care. European Journal of CancerCare, 13(4):308–317.

Keith Carlson, Allen Riddell, and Daniel N. Rockmore.2017. Zero-shot style transfer in text using recurrentneural networks. CoRR, abs/1711.04731.

Marc Cavazza, Fred Charles, and Steven J Mead. 2002.Character-based interactive storytelling. IEEE Intel-ligent systems, 17(4):17–24.

Jane Harper Chelf, Amy MB Deshler, Shauna Hillman,and Ramon Durazo-Arvizu. 2000. Storytelling: Astrategy for living and coping with cancer. CancerNursing, 23(1):1–5.

Eric Chu and Deb Roy. 2017. Audio-visual senti-ment analysis for learning emotional arcs in movies.arXiv preprint arXiv:1712.02896.

Patsy Cooper. 1993. When Stories Come to School:Telling, Writing, and Performing Stories in the EarlyChildhood Classroom. ERIC.

Sonya Dal Cin, Mark P Zanna, and Geoffrey T Fong.2004. Narrative persuasion and overcoming resis-tance. Resistance and persuasion, pages 175–191.

Kerstin Dautenhahn and Chrystopher Nehaniv. 1998.Artificial life and natural stories. In InternationalSymposium on Artificial Life and Robotics, pages435–439. Citeseer.

Maja Djikic, Keith Oatley, Sara Zoeterman, and Jor-dan B Peterson. 2009a. Defenseless against art? im-pact of reading fiction on emotion in avoidantly at-tached individuals. Journal of Research in Person-ality, 43(1):14–17.

Maja Djikic, Keith Oatley, Sara Zoeterman, and Jor-dan B Peterson. 2009b. On being moved by art:How reading fiction transforms the self. CreativityResearch Journal, 21(1):24–29.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, OmerReingold, and Richard Zemel. 2012. Fairnessthrough awareness. In Proceedings of the 3rd inno-vations in theoretical computer science conference,pages 214–226. ACM.

Leah East, Debra Jackson, Louise O’Brien, and Kath-leen Peters. 2010. Storytelling: an approach thatcan help to develop resilience. Nurse Researcher(through 2013), 17(3):17–25.

David K Elson. 2012. Modeling narrative discourse.Columbia University.

Malte Elson, Johannes Breuer, James D Ivory, andThorsten Quandt. 2014. More than stories with but-tons: Narrative, mechanics, and context as determi-nants of player experience in digital games. Journalof Communication, 64(3):521–542.

Chris Fairclough and Padraig Cunningham. 2004. Aistructuralist storytelling in computer games. Tech-nical report, Trinity College Dublin, Department ofComputer Science.

10

Page 21: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Ethan Fast, Tina Vachovsky, and Michael S Bernstein.2016. Shirtless and dangerous: Quantifying linguis-tic signals of gender bias in an online fiction writingcommunity. In ICWSM, pages 112–120.

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, DongyanZhao, and Rui Yan. 2017. Style transfer intext: Exploration and evaluation. arXiv preprintarXiv:1711.06861.

Shira Gabriel and Ariana F Young. 2011. Becom-ing a vampire without being bitten: The narrativecollective-assimilation hypothesis. PsychologicalScience, 22(8):990–994.

Traci Gillig and Sheila Murphy. 2016. Fostering sup-port for lgbtq youth? the effects of a gay adolescentmedia portrayal on young viewers. InternationalJournal of Communication, 10:23.

Melanie C Green, Timothy C Brock, and Geoff F Kauf-man. 2004. Understanding media enjoyment: Therole of transportation into narrative worlds. Com-munication Theory, 14(4):311–327.

Melanie C Green, Sheryl Kass, Jana Carrey, BenjaminHerzig, Ryan Feeney, and John Sabini. 2008. Trans-portation across media: Repeated exposure to printand film. Media Psychology, 11(4):512–539.

Melanie C Green, Jeffrey J Strange, and Timothy CBrock. 2003. Narrative impact: Social and cogni-tive foundations. Taylor & Francis.

Kia Groom. 2015. A call to arms: Bite the hand thatfeeds you.

Matthew Guzdial, Brent Harrison, Boyang Li, andMark Riedl. 2015. Crowdsourcing open interactivenarrative. In FDG.

Mengqiao Han, Ou Wu, and Zhendong Niu. 2017. Un-supervised automatic text style transfer using lstm.In National CCF Conference on Natural LanguageProcessing and Chinese Computing, pages 281–292.Springer.

Stacey Hand and Duane Varan. 2008. Interactive nar-ratives: Exploring the links between empathy, inter-activity and structure. In European Conference onInteractive Television, pages 11–19. Springer.

D Fox Harrell and Jichen Zhu. 2009. Agency play:Dimensions of agency for interactive narrative de-sign. In AAAI spring symposium: Intelligent narra-tive technologies II, pages 44–52.

Paul L Harris. 2000. Understanding childrens worlds:The work of the imagination. Oxfo Blackwell.

Dorothee Hefner, Christoph Klimmt, and PeterVorderer. 2007. Identification with the player char-acter as determinant of video game enjoyment. InEntertainment computing–ICEC 2007, pages 39–48.Springer.

Fritz Heider and Marianne Simmel. 1944. An exper-imental study of apparent behavior. The Americanjournal of psychology, 57(2):243–259.

Hal E Hershfield, Susanne Scheibe, Tamara L Sims,and Laura L Carstensen. 2013. When feeling badcan be good: Mixed emotions benefit physical healthacross adulthood. Social psychological and person-ality science, 4(1):54–61.

Bell Hooks et al. 2006. Black looks: Race and repre-sentation. Academic Internet Pub Inc.

Ting-Hao Kenneth Huang, Francis Ferraro, NasrinMostafazadeh, Ishan Misra, Aishwarya Agrawal, Ja-cob Devlin, Ross Girshick, Xiaodong He, PushmeetKohli, Dhruv Batra, et al. 2016. Visual storytelling.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1233–1239.

Keenan M Jenkins. 2014. Choose your own adventure:Interactive narratives and attitude change. Ph.D.thesis, The University of North Carolina at ChapelHill.

Carl Gustav Jung. 1964. Man and his symbols. Laurel.

Jad Kabbara and Jackie Chi Kit Cheung. 2016. Stylis-tic transfer in natural language generation systemsusing recurrent neural networks. In Proceedingsof the Workshop on Uphill Battles in LanguageProcessing: Scaling Early Achievements to RobustMethods, pages 43–47.

Geoff F. Kaufman and Lisa K. Libby. 2012. Chang-ing beliefs and behavior through experience-taking.Journal of Personality and Social Psychology,103(2):1–19.

David Comer Kidd and Emanuele Castano. 2013.Reading literary fiction improves theory of mind.Science, 342(6156):377–380.

Edith Law, Ming Yin, Joslin Goh, Kevin Chen,Michael A. Terry, and Krzysztof Z. Gajos. 2016.Curiosity killed the cat, but makes crowdwork bet-ter. In Proceedings of the 2016 CHI Conferenceon Human Factors in Computing Systems, CHI ’16,pages 4098–4110, New York, NY, USA. ACM.

Michael Lebowitz. 1987. Planning stories. In Proceed-ings of the 9th annual conference of the cognitivescience society, pages 234–242.

Boyang Li, Stephen Lee-Urban, Darren Scott Appling,and Mark O Riedl. 2012. Crowdsourcing narrativeintelligence. Advances in Cognitive Systems, 2(1).

Boyang Li, Stephen Lee-Urban, and Mark Riedl. 2013.Crowdsourcing interactive fiction games. In FDG,pages 431–432.

Boyang Li, Mohini Thakkar, Yijie Wang, and Mark ORiedl. 2014. Data-driven alibi story telling for socialbelievability. Social Believability in Games.

11

Page 22: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Brian Magerko. 2006. Player modeling in the interac-tive drama architecture. Department of ComputerScience and Engineering, University of Michigan.

Inderjeet Mani. 2012. Computational modeling ofnarrative. Synthesis Lectures on Human LanguageTechnologies, 5(3):1–142.

Marie-Louise Mares, Mary Beth Oliver, and JoanneCantor. 2008. Age differences in adults’ emotionalmotivations for exposure to films. Media Psychol-ogy, 11(4):488–511.

Michael Mateas and Andrew Stern. 2003. Facade: Anexperiment in building a fully-realized interactivedrama. In Game developers conference, volume 2,pages 4–8.

Wook-Hee Min, Eok-Soo Shim, Yeo-Jin Kim, andYun-Gyung Cheong. 2008. Planning-integratedstory graph for interactive narratives. In Pro-ceedings of the 2Nd ACM International Workshopon Story Representation, Mechanism and Context,SRMC ’08, pages 27–32, New York, NY, USA.ACM.

Janet Murray. 2004. From game-story to cyberdrama.First person: New media as story, performance, andgame, 1:2–11.

Jeeyun Oh, Mun-Young Chung, and Sangyong Han.2014. The more control, the better? Journal ofMedia Psychology.

Summer Okoye. 2016. The black commodity.

Mary Beth Oliver. 1993. Exploring the paradox of theenjoyment of sad films. Human Communication Re-search, 19(3):315–342.

M.B. Oliver. 2003. Mood management and selec-tive exposure. In Jennings Bryan, David Roskos-Ewoldsen, and Joanne Cantor, editors, Communica-tion and emotion: Essays in honor of Dolf Zillmann.Routledge.

Jessica Ouyang and Kathleen McKeown. 2015. Mod-eling reportable events as turning points in narrative.In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages2149–2158.

Vivian Gussin Paley. 2009. A child’s work: The impor-tance of fantasy play. University of Chicago Press.

James W Pennebaker. 1997. Writing about emotionalexperiences as a therapeutic process. Psychologicalscience, 8(3):162–166.

Julie Porteous and Marc Cavazza. 2009. Controllingnarrative generation with planning trajectories: therole of constraints. In Joint International Confer-ence on Interactive Digital Storytelling, pages 234–245. Springer.

Gerald Prince. 1974. A grammar of stories: An intro-duction, volume 13. Walter de Gruyter.

Christopher Purdy and Mark O Riedl. 2016. Readingbetween the lines: Using plot graphs to draw in-ferences from stories. In International Conferenceon Interactive Digital Storytelling, pages 197–208.Springer.

Andrew J Reagan. 2017. Towards a science of humanstories: using sentiment analysis and emotional arcsto understand the building blocks of complex socialsystems. Ph.D. thesis, The University of Vermontand State Agricultural College.

Mark O Riedl and Brent Harrison. 2016. Using storiesto teach human values to artificial agents. In AAAIWorkshop: AI, Ethics, and Society.

Mark O Riedl and Robert Michael Young. 2010. Nar-rative planning: Balancing plot and character. Jour-nal of Artificial Intelligence Research, 39:217–268.

Ute Ritterfeld and Seung-A Jin. 2006. Addressing me-dia stigma for people experiencing mental illness us-ing an entertainment-education strategy. Journal ofhealth psychology, 11(2):247–267.

Frank Rose. 2012. The art of immersion: How the digi-tal generation is remaking Hollywood, Madison Av-enue, and the way we tell stories. WW Norton &Company.

Alyssa Rosenberg. 2012. From Lena Dunham to JunotDıaz: How to write people who aren’t you.

Paul Rozin, Lily Guillot, Katrina Fincher, AlexanderRozin, and Eli Tsukayama. 2013. Glad to be sad,and other examples of benign masochism. Judgmentand Decision Making, 8(4):439.

Marie-Laure Ryan and A-L Rebreyend. 2013. Fromnarrative games to playable stories. Nouvelle revuedesthetique, (1):37–50.

Marie-Laure Ryan et al. 2007. Toward a definition ofnarrative. The Cambridge companion to narrative,pages 22–35.

Kimiko Ryokai and Justine Cassell. 1999. Storymat: aplay space for collaborative storytelling. In CHI’99extended abstracts on Human factors in computingsystems, pages 272–273. ACM.

Maarten Sap, Marcella Cindy Prasettio, Ari Holtzman,Hannah Rashkin, and Yejin Choi. 2017. Connota-tion frames of power and agency in modern films.In Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing, pages2329–2334.

Roger C Schank and Robert P Abelson. 1975. Scripts,plans, and knowledge. In IJCAI, pages 151–157.

Roger C Schank et al. 1975. Sam–a story understander.Research report no. 43.

Scott Selisker. 2015. The Bechdel test and the socialform of character networks. New Literary History,46(3):505–523.

12

Page 23: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Nisi Shawl and Cynthia Ward. 2005. Writing theOther. Seattle, Washington: Aqueduct Press.

Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola. 2017. Style transfer from non-parallel textby cross-alignment. In Advances in Neural Informa-tion Processing Systems, pages 6833–6844.

Michael D Slater, Donna Rouner, and Marilee Long.2006. Television dramas and support for controver-sial public policies: Effects and mechanisms. Jour-nal of Communication, 56(2):235–252.

Zadie Smith. 2009. Speaking in tongues. The NewYork review of books, 26:1–16.

Jiyeon So and Robin Nabi. 2013. Reduction of per-ceived social distance as an explanation for media’sinfluence on personal risk perceptions: A test of therisk convergence model. Human CommunicationResearch, 39(3):317–338.

Brian Sutton-Smith. 1986. Children’s fiction making.

Brian Sutton-Smith. 2012. The folkstories of children.University of Pennsylvania Press.

Ivo Swartjes and Mariet Theune. 2006. A fabula modelfor emergent narrative. In International Conferenceon Technologies for Interactive Digital Storytellingand Entertainment, pages 49–60. Springer.

Jawad Syed. 2016. Oscars so white: an institutionalracism perspective. Counterpunch.

Abraham Tesser, Joanne V Wood, and Diederik AStapel. 2005. Building, defending, and regulatingthe self: A psychological perspective. PsychologyPress.

David Thue, Vadim Bulitko, Marcia Spetch, and EricWasylishen. 2007. Interactive storytelling: A playermodelling approach. In AIIDE, pages 43–48.

Tyler Tsay, Katherine Frain, and Serafima Fedorova.2015. Best & worst literary moments of 2015.

Frederik Van Broeckhoven, Joachim Vlieghe, and OlgaDe Troyer. 2015. Using a controlled natural lan-guage for specifying the narratives of serious games.In International conference on interactive digitalstorytelling, pages 142–153. Springer.

Rob Voigt, Nicholas P Camp, Vinodkumar Prab-hakaran, William L Hamilton, Rebecca C Hetey,Camilla M Griffiths, David Jurgens, Dan Jurafsky,and Jennifer L Eberhardt. 2017. Language frompolice body camera footage shows racial dispari-ties in officer respect. Proceedings of the NationalAcademy of Sciences, page 201702413.

Kurt Vonnegut. 1970. The shape of stories.

R Michael Young, Mark O Riedl, Mark Branly, ArnavJhala, RJ Martin, and CJ Saretto. 2004. An archi-tecture for integrating plan-based behavior genera-tion with interactive game environments. Journal ofGame Development, 1(1):51–70.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, VicenteOrdonez, and Kai-Wei Chang. 2017. Men alsolike shopping: Reducing gender bias amplifica-tion using corpus-level constraints. arXiv preprintarXiv:1707.09457.

13

Page 24: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Proceedings of the First Workshop on Storytelling, pages 14–19New Orleans, Louisiana, June 5, 2018. c©2018 Association for Computational Linguistics

Linguistic Features of Helpfulnessin Automated Support for Creative Writing

Melissa Roemmele and Andrew S. GordonInstitute for Creative Technologies, University of Southern California

[email protected], [email protected]

Abstract

We examine an emerging NLP applicationthat supports creative writing by automaticallysuggesting continuing sentences in a story.The application tracks users’ modifications togenerated sentences, which can be used toquantify their “helpfulness” in advancing thestory. We explore the task of predicting help-fulness based on automatically detected lin-guistic features of the suggestions. We illus-trate this analysis on a set of user interactionswith the application using an initial selectionof features relevant to story generation.

1 Introduction

At the intersection between natural language gen-eration, computational creativity, and human-computer interaction research is the vision of toolsthat directly collaborate with people in authoringcreative content. With recent work on automati-cally generating creative language (Ghazvininejadet al., 2017; Stock and Strapparava, 2005; Vealeand Hao, 2007, e.g.), this vision has started tocome to fruition. One such application focuses onproviding automated support to human authors forstory writing. In particular, Roemmele and Gor-don (2015), Khalifa et al. (2017), Manjavacas et al.(2017), and Clark et al. (2018) have developed sys-tems that automatically generate suggestions fornew sentences to continue an ongoing story.

As with other interactive language generationtasks, there is no obvious approach to evaluatingthese systems. The number of acceptable contin-uations that can be generated for a given story isopen-ended, so measures that strictly rely on sim-ilarity to a constrained set of gold standard sen-tences, e.g. BLEU score (Papineni et al., 2002),are not ideal. Moreover, the focus of evaluation ininteractive applications should be on users’ judg-ments of the quality of the interaction. While

it is straightforward to ask users to rate gener-ated content (McIntyre and Lapata, 2009; Perez yPerez and Sharples, 2001; Swanson and Gordon,2012), self-reported ratings for global dimensionsof quality (e.g. “on a scale of 1-5, how coherentis this sentence in this story?”) do not necessarilyprovide insight into the specific characteristics thatinfluenced these judgments, which users might noteven be explicitly aware of. It is more useful toexamine users’ judgment on an implicit level: forexample, by allowing them to adapt generated se-quences. This is related to rewriting tasks in otherdomains like grammatical error correction (Sak-aguchi et al., 2016), where annotators edit sen-tences to improve their perceived quality. This en-ables the features of the modified sequence to becompared to those of the original.

In this work, we analyze a set of user interac-tions with the application Creative Help (Roem-mele and Gordon, 2015), where users make ‘help’requests to automatically suggest new sentences ina story, which they can then freely modify. Wetake advantage of Creative Help’s functionalitythat tracks authors’ edits to generated sentences,resulting in an alignment between each originalsuggestion and its modified form. Previous workon this application compared different generationmodels according to the similarity between sug-gestions and corresponding modifications, basedon the idea that more helpful suggestions will re-ceive fewer edits. Here, we focus on quantifyingsuggestions according to a set of linguistic featuresshown by existing research to be relevant to storygeneration. We examine whether these featurescan be used to predict how much authors modifythe suggestions. We propose that this type of anal-ysis is useful for identifying the aspects of gen-erated content authors implicitly find most help-ful for writing. It can inform the evaluation offuture creativity support systems in terms of how

14

Page 25: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

well they maximize features associated with help-fulness.

2 Application

The Creative Help interface consists simply of atext box where users can write a story. Authors areinstructed that they can type \help\ at any pointwhile writing in order to generate a suggestion fora new sentence in the story, and that they can freelymodify this suggestion like any other text that al-ready appears in the story. As soon as the sug-gested sentence appears to the author, the applica-tion starts tracking any edits the author makes to it.Once one minute has elapsed since the author lastedited the sentence, the application logs the mod-ified sentence alongside its original version. SeeRoemmele and Gordon (2015) for further detailsabout this tracking and logging functionality. Theresult of authors’ interactions with the applicationis a dataset aligning generated suggestions to theircorresponding modifications along with the storycontext that precedes the help request.

The current generation model integrated intoCreative Help is a Recurrent Neural Network Lan-guage Model (RNN LM) with Gated RecurrentUnits (GRUs) that generates sentences through it-erative random sampling of its probability distri-bution, as described in Roemmele and Gordon(2018). The motivation for this baseline model isthat by training it on a corpus of fiction stories,it produces sequences that are likely to appear inthese stories, but the unpredictability associatedwith random sampling yields novel word combi-nations that may be appealing from the standpointof creativity (Boden, 2004; Dartnall, 2013; Liapiset al., 2016). The RNN LM was trained on a sub-set of the BookCorpus1 (Kiros et al., 2015), whichcontains freely available fiction books uploadedby authors to smashwords.com. The subset in-cluded 8032 books from a variety of genres, whichwere split into 155,400 chapters (a little over halfa billion words). To prepare the dataset for train-ing, the stories were tokenized into lowercasedwords. All punctuation was treated in the sameway as words. A vocabulary of all words occur-ring at least 25 times in the text was established,which resulted in 64,986 unique words being in-cluded in the model. All other words were mappedto a generic <UNKNOWN> token that was re-stricted from being generated. Proper names were

1yknzhu.wixsite.com/mbweb

handled uniquely by replacing them with a tokenindicating their entity type and a unique numeri-cal identifier for that entity (e.g. <PERSON1>).During generation, a list of all entities mentionedprior to the help request was maintained. Whenthe model generated one of these abstract entitytokens, it was replaced with an entity of the corre-sponding type and numerical index in the story. Ifno such entity type was found in the story, an en-tity was randomly sampled from a list of entitiesfound in the training data.

The RNN2 was set up with a 300-dimensionword embedding layer and two 500-dimensionGRU layers. It was trained for one single itera-tion through all chapters, which were observed inbatches of 125. The Adam algorithm (Kingma andBa, 2015) was used for optimization. To gener-ate a sentence when a help request was made, themodel observed all text prior to the help request(the context) to compute a probability distributionfor the next word. A word was sampled from thisdistribution, appended to the story, and this pro-cess was repeated to generate 35 words. All wordsafter the first detected sentence boundary3 werethen filtered (in some cases, no sentence bound-ary was detected so all 35 words were included inthe returned sentence). Finally, the suggestion was‘detokenized’ using some heuristics for punctua-tion formatting, capitalization, and merging con-tractions before being presented to the author.

3 Experiment and Analyses

We recruited people via social media, email, andAmazon Mechanical Turk to interact with CreativeHelp4 for at least fifteen minutes. Participantswere asked to write a story of their choice. Theywere told the objective of the task was to experi-ment with asking for \help\ but they were not re-quired to make a certain number of help requests.They could choose to edit, add to, or delete a sug-gestion just like any other text in their story, with-out any requirement to change the suggestion atall. Ultimately, 139 users participated in the task,resulting in suggestion-modification pairs for 940help requests, which includes pairs where the sug-gestion and modification are equivalent becauseno edits were made.

Given this dataset of pairs, we first quantified

2Code at: github.com/roemmele3Based on spaCy’s sentence segmentation: spacy.io4https://fiction.ict.usc.edu/creativehelp/

15

Page 26: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Initial Story: I knew it wasn’t a good idea to put the alligatorin the bathtub. The problem was that there was nowhere elsewaterproof in the house, and Dale was going to be home intwenty minutes.

Suggested: I needed to know, too, and I was gladI was feeling it.Modified: I needed to know how upset he wouldbe if he found out about my adoption spree.

Initial Story: My brother was a quiet boy. He liked to spendtime by himself in his room and away from others. It wasn’tsuch a bad thing, as it allowed him to focus on his morecreative side. He would write books, draw comics, and writelyrics for songs that he would learn to play as he got older.

Suggested: He’d have to learn to get in touch withmy father.Modified: He had an ok relationship with my par-ents, but mostly because they supported his sepa-ration.

Table 1: Examples of generated suggestions and corresponding modifications with their preceding context

the degree to which authors edited the sugges-tions. In particular, we calculated the similar-ity between each suggestion and correspondingmodification in terms of Levenshtein edit distance:1 − dist(sug,mod)

max(|sug|,|mod|) , where higher values indicatemore similarity. The mean similarity score for thisdataset was 0.695 (SD=0.346), indicating that au-thors most often chose to retain large parts of thesuggestions instead of fully rewriting them. Weinvestigated whether these similarity scores couldbe predicted by the linguistic features of the sug-gestions. Features that significantly correlate withLevenshtein similarity can be interpreted as being‘helpful’ in influencing authors to make use of theoriginal suggestion in their story. It is certainlypossible to use other similarity metrics to quantifyhelpfulness, such as similarity in terms of wordembeddings. These measures may model simi-larity below the surface text of the suggestion, inwhich the modification may use different words toalternatively express the same story event or idea.

With this approach, given a metric for any fea-ture, the helpfulness of that feature can be quan-tified. Here, we selected some features used inprevious work on story generation and evaluatingwriting quality. In particular, we included somefeatures used in systems applied to the Story ClozeTest (Mostafazadeh et al., 2016), which involvesselecting the most likely ending for a given storyfrom a provided set of candidates. Roemmele et al.(2017a) also explored some of these metrics tocompare different models for sentence-based storycontinuation in an offline framework. Our met-rics consist of those that analyze the individualfeatures of a sentence by itself (story-independent,Metrics 1-7 below), and those that analyze the sen-tence with reference to the story context that pre-cedes the suggestion (story-dependent, Metrics 8-14 below). For the story-dependent metrics, weonly considered suggestions that did not appear asthe first sentence in the story (910 suggestions).

Sentence Length: The length of a candidateending in the Story Cloze Test was found to pre-dict its correctness (Bugert et al., 2017; Schwartzet al., 2017). We measured the length of sugges-tion in terms of its number of words (Metric 1).

Grammaticality: Grammaticality is an obvi-ous feature of high-quality writing. We used Lan-guage Tool5 (Miłkowski, 2010), a rule-based sys-tem that detects various grammatical errors. Thissystem computed an overall grammaticality scorefor each sentence, equal to the proportion of totalwords in the sentence deemed to be grammaticallycorrect (Metric 2).

Lexical Frequency: Writing quality has beenfound to correlate with the use of unique words(Burstein and Wolska, 2003; Crossley et al.,2011). We computed the average frequency of thewords in each suggestion according to their Good-Turing smoothed counts in the Reddit CommentCorpus6 (Metric 3).

Syntactic Complexity: Writing quality is alsoassociated with greater syntactic complexity (Mc-Namara et al., 2010; Pitler and Nenkova, 2008).We examined this feature in terms of the numberand length of syntactic phrases in the generatedsentences. Phrase length was approximated by thenumber of children under each head verb/noun asgiven by the dependency parse. We counted the to-tal number of noun phrases (Metric 4) and wordsper noun phrase (Metric 5), and likewise the num-ber of verb phrases (Metric 6) and words per verbphrase (Metric 7). These metrics were all normal-ized by sentence length.

Lexical Cohesion: Correct endings in the StoryCloze Test tend to have higher lexical similar-ity to their contexts according to statistical mea-sures of similarity (Mihaylov and Frank, 2017;Mostafazadeh et al., 2016; Flor and Somasun-daran, 2017). We analyzed lexical cohesion be-

5Code at: pypi.python.org/pypi/language-check6spacy.io/docs/api/token

16

Page 27: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

tween the context and suggestion in terms oftheir Jaccard similarity (proportion of overlap-ping words; Metric 8), GloVe word embeddings7

trained on the Common Crawl corpus (Metric 9),and sentence (skip-thought) vectors8 (Kiros et al.,2015) trained on the BookCorpus (Metric 10). Forthe latter two, the score was the cosine similaritybetween the means of the context and suggestionvectors, respectively.

Style Consistency: Automated measures ofwriting style have been used to predict the suc-cess of fiction novels (Ganjigunte Ashok et al.,2013). Moreover, Schwartz et al. (2017) foundthat simple n-gram style features could distin-guish between correct and incorrect endings inthe Story Cloze Test. We examined the similar-ity in style between the context and suggestion interms of their distributions of coarse-grained part-of-speech tags, using the same approach as Irelandand Pennebaker (2010). The similarity betweenthe context c and suggestion s for each POS cate-gory was quantified as 1 − |posc−poss|posc+poss

, where posis the proportion of words with that tag. We aver-aged the scores across all POS categories (Metric11). We also looked at style in terms of the Jaccardsimilarity between the POS trigrams in the contextand suggestion (Metric 12).

Sentiment Similarity: The relation betweenthe sentiment of a story and a candidate ending inthe Story Cloze Test can be used to predict its cor-rectness (Flor and Somasundaran, 2017; Goel andSingh, 2017; Bugert et al., 2017). We applied sen-timent analysis to the context and suggestion usingthe tool9 described in Staiano and Guerini (2014),which provides a valence score for 11 emotions.For each emotion, we computed the inverse dis-tance 1

(1+|ec−es|) between the context and sugges-tion scores ec and es, respectively. We averagedthese values across all emotions to get one overallsentiment similarity score (Metric 13).

Entity Coreference: Events in stories arelinked by common entities (e.g. characters, loca-tions, and objects), so coreference between entitymentions is particularly important for establishingthe coherence of a story (Elsner, 2012). We calcu-lated the proportion of entities in each suggestionthat coreferred to an entity in the correspondingcontext10 (Metric 14).

7nlp.stanford.edu/projects/glove8github.com/ryankiros/skip-thoughts9github.com/marcoguerini/DepecheMood/releases

10Using CoreNLP: stanfordnlp.github.io/CoreNLP

4 Results and Conclusion

ρ

1. Sentence length -0.0822. Grammaticality 0.0973. Word frequency 0.0584. # NPs 0.1125. NP length 0.0526. # VPs 0.0017. VP length -0.0228. Jaccard sim 0.0179. GloVe sim 0.10510. Skip-thought sim 0.25811. Word POS sim -0.03712. Trigram POS sim -0.02313. Sentiment sim 0.10714. Coreference 0.134

Table 2: Correlation ρ between metric scores for sug-gestions and similarity to modifications

Table 2 shows the Spearman correlation coeffi-cient (ρ) between the suggestion scores for eachmetric and their Levenshtein similarity to the re-sulting modifications. This coefficient indicatesthe degree to which the corresponding feature pre-dicted authors’ modifications, where higher co-efficients mean that authors applied fewer edits.Statistically significant correlations (p < 0.005)are highlighted in gray, indicating that suggestionswith higher scores on these metrics were partic-ularly helpful to authors. Suggestion length didnot have a significant impact, but grammatical-ity emerged as a helpful feature. The frequencyscores of the words in the suggestions did not sig-nificantly influence their helpfulness. In terms ofsyntactic complexity, suggestions with more nounphrases were edited less often, but verb complex-ity was not influential. For lexical cohesion, thenumber of overlapping words between the sugges-tion and its context (Jaccard similarity) was notpredictive, but vector-based similarity was an in-dicator of helpfulness. Similarity in terms of sen-tence (skip-thought) vectors was the most help-ful feature overall, which suggests these repre-sentations are indeed useful for modeling coher-ence between neighboring sentences in a story.Analogously, Roemmele et al. (2017b) and Srini-vasan et al. (2018) found that these representa-tions were very effective for encoding story sen-tences in the Story Cloze Test in order to predict

17

Page 28: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

correct endings. Neither metric for style similar-ity predicted authors’ edits, but sentiment similar-ity between the suggestion and context was sig-nificantly helpful. Finally, suggestions that morefrequently coreferred to entities introduced in thecontext were more helpful.

These results describe this particular sample ofCreative Help interactions for a selected set of fea-tures relevant to story generation, but this anal-ysis can be scaled to determine the influence ofany feature in an automated writing support frame-work where authors can adapt generated content.The objective of this approach is to leverage datafrom user interactions with the system to estab-lish an automated feedback loop for evaluation, bywhich features that emerge as helpful can be pro-moted in future systems.

Acknowledgments

The projects or efforts depicted were or are spon-sored by the U.S. Army. The content or informa-tion presented does not necessarily reflect the po-sition or the policy of the Government, and no of-ficial endorsement should be inferred.

ReferencesMargaret A Boden. 2004. The creative mind: Myths

and mechanisms. Psychology Press.

Michael Bugert, Yevgeniy Puzikov, Andreas Ruckle,Judith Eckle-Kohler, Teresa Martin, EugenioMartınez-Camara, Daniil Sorokin, Maxime Peyrard,and Iryna Gurevych. 2017. LSDSem 2017: Explor-ing data generation methods for the story cloze test.In LSDSem 2017.

Jill Burstein and Magdalena Wolska. 2003. TowardEvaluation of Writing Style: Finding Overly Repet-itive Word Use in Student Essays. In Proceedingsof the Tenth Conference on European Chapter of theAssociation for Computational Linguistics. Associ-ation for Computational Linguistics, Stroudsburg,PA, USA, pages 35–42.

Elizabeth Clark, Anne Ross, Chenhao Tan, YangfengJi, and Noah A. Smith. 2018. Creative Writing witha Machine in the Loop: Case Studies on Slogans andStories. In Proceedings of the 23rd ACM Confer-ence on Intelligent User Interfaces. IUI’2018.

Scott A. Crossley, Jennifer L. Weston, Susan T. McLainSullivan, and Danielle S. McNamara. 2011. The De-velopment of Writing Proficiency as a Function ofGrade Level: A Linguistic Analysis. Written Com-munication 28(3):282–311.

Terry Dartnall. 2013. Artificial intelligence and cre-ativity: An interdisciplinary approach, volume 17.Springer Science & Business Media.

Micha Elsner. 2012. Character-based Kernels for Nov-elistic Plot Structure. In Proceedings of the 13thConference of the European Chapter of the Associa-tion for Computational Linguistics. EACL ’12.

Michael Flor and Swapna Somasundaran. 2017. Sen-timent Analysis and Lexical Cohesion for the StoryCloze Task. In LSDSem 2017.

Vikas Ganjigunte Ashok, Song Feng, and Yejin Choi.2013. Success with Style: Using Writing Style toPredict the Success of Novels. In Proceedings ofthe 2013 Conference on Empirical Methods in Natu-ral Language Processing. Association for Computa-tional Linguistics, Seattle, Washington, USA, pages1753–1764.

Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, andKevin Knight. 2017. Hafez: an Interactive PoetryGeneration System. In Proceedings of ACL 2017,System Demonstrations. Association for Computa-tional Linguistics, pages 43–48.

Pranav Goel and Anil Kumar Singh. 2017. IIT (BHU):System Description for LSDSem17 Shared Task. InLSDSem 2017.

Molly E Ireland and James W Pennebaker. 2010. Lan-guage style matching in writing: synchrony in es-says, correspondence, and poetry. Journal of per-sonality and social psychology 99(3):549.

Ahmed Khalifa, Gabriella AB Barros, and Julian To-gelius. 2017. DeepTingle. In International Confer-ence on Computational Creativity 2017.

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In The Interna-tional Conference on Learning Representations. SanDiego.

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov,Richard S. Zemel, Antonio Torralba, Raquel Ur-tasun, and Sanja Fidler. 2015. Skip-thought vec-tors. In Proceedings of the 28th International Con-ference on Neural Information Processing Systems.MIT Press, Cambridge, MA, USA, pages 3294–3302.

Antonios Liapis, Georgios N Yannakakis, ConstantineAlexopoulos, and Phil Lopes. 2016. Can comput-ers foster human users creativity? Theory and praxisof mixed-initiative co-creativity. Digit. Cult. Educ.(DCE) 8(2):136–152.

Enrique Manjavacas, Folgert Karsdorp, Ben Burten-shaw, and Mike Kestemont. 2017. Synthetic Litera-ture: Writing Science Fiction in a Co-Creative Pro-cess. In Proceedings of the Workshop on Compu-tational Creativity in Natural Language Generation(CC-NLG 2017). pages 29–37.

18

Page 29: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Neil McIntyre and Mirella Lapata. 2009. Learning toTell Tales: A Data-driven Approach to Story Gen-eration. In Proceedings of the Joint Conference ofthe 47th Annual Meeting of the ACL and the 4th In-ternational Joint Conference on Natural LanguageProcessing of the AFNLP. Association for Compu-tational Linguistics, Stroudsburg, PA, USA, pages217–225.

Danielle S. McNamara, Scott A. Crossley, andPhilip M. McCarthy. 2010. Linguistic features ofwriting quality. Written Communication 27(1):57–86.

Todor Mihaylov and Anette Frank. 2017. Story Clozeending selection baselines and data examination. InLSDSem 2017.

Marcin Miłkowski. 2010. Developing an open-source,rule-based proofreading tool. Softw. Pract. Exper.40(7):543–566.

Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A Corpusand Cloze Evaluation for Deeper Understanding ofCommonsense Stories. In Proceedings of the 14thAnnual Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies (NAACL HLT2016). pages 839–849.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automaticevaluation of machine translation. In Proceedings ofthe 40th Annual Meeting on Association for Compu-tational Linguistics. Association for ComputationalLinguistics, pages 311–318.

Rafael Perez y Perez and Mike Sharples. 2001. MEX-ICA: A computer model of a cognitive account ofcreative writing. Journal of Experimental & Theo-retical Artificial Intelligence 13(2):119–139.

Emily Pitler and Ani Nenkova. 2008. Revisiting read-ability: A unified framework for predicting textquality. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing.Association for Computational Linguistics, Strouds-burg, PA, USA, pages 186–195.

Melissa Roemmele and Andrew S Gordon. 2015. Cre-ative Help: A Story Writing Assistant. In Inter-national Conference on Interactive Digital Story-telling. Springer International Publishing.

Melissa Roemmele and Andrew S. Gordon. 2018. Au-tomated Assistance for Creative Writing with anRNN Language Model. In Proceedings of the 23rdInternational Conference on Intelligent User In-terfaces Companion. ACM, New York, NY, USA,IUI’18, pages 21:1–21:2.

Melissa Roemmele, Andrew S Gordon, and ReidSwanson. 2017a. Evaluating Story Generation Sys-tems Using Automated Linguistic Analyses. In

SIGKDD 2017 Workshop on Machine Learning forCreativity.

Melissa Roemmele, Sosuke Kobayashi, Naoya Inoue,and Andrew Gordon. 2017b. An RNN-based Bi-nary Classifier for the Story Cloze Test. In LSDSem2017.

Keisuke Sakaguchi, Courtney Napoles, Matt Post, andJoel Tetreault. 2016. Reassessing the Goals ofGrammatical Error Correction: Fluency Instead ofGrammaticality. Transactions of the Association forComputational Linguistics 4:169–182.

Roy Schwartz, Maarten Sap, Ioannis Konstas, LeilaZilles, Yejin Choi, and Noah A Smith. 2017. The Ef-fect of Different Writing Tasks on Linguistic Style:A Case Study of the ROC Story Cloze Task. InThe SIGNLL Conference on Computational NaturalLanguage Learning (CoNLL 2017).

Siddarth Srinivasan, Richa Arora, and Mark Riedl.2018. A Simple and Effective Approach to the StoryCloze Test. In Proceedings of the 16th Annual Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies (NAACL HLT 2018).

Jacopo Staiano and Marco Guerini. 2014. De-pecheMood: A lexicon for emotion analysis fromcrowd-annotated news. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics (Short Papers).

Oliviero Stock and Carlo Strapparava. 2005. The actof creating humorous acronyms. Applied ArtificialIntelligence 19(2):137–151.

Reid Swanson and Andrew S. Gordon. 2012. SayAnything: Using Textual Case-Based Reasoningto Enable Open-Domain Interactive Storytelling.ACM Transactions on Interactive Intelligent Systems2(3):1–35.

Tony Veale and Yanfen Hao. 2007. Comprehendingand generating apt metaphors: a web-driven, case-based approach to figurative language. In Proceed-ings of AAAI.

19

Page 30: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Proceedings of the First Workshop on Storytelling, pages 20–32New Orleans, Louisiana, June 5, 2018. c©2018 Association for Computational Linguistics

A Pipeline for Creative Visual Storytelling

Stephanie M. Lukin, Reginald Hobbs, Clare R. VossU.S. Army Research Laboratory

Adelphi, MD, [email protected]

Abstract

Computational visual storytelling produces atextual description of events and interpreta-tions depicted in a sequence of images. Thesetexts are made possible by advances and cross-disciplinary approaches in natural languageprocessing, generation, and computer vision.We define a computational creative visual sto-rytelling as one with the ability to alter thetelling of a story along three aspects: to speakabout different environments, to produce vari-ations based on narrative goals, and to adaptthe narrative to the audience. These aspects ofcreative storytelling and their effect on the nar-rative have yet to be explored in visual story-telling. This paper presents a pipeline of task-modules, Object Identification, Single-ImageInferencing, and Multi-Image Narration, thatserve as a preliminary design for building acreative visual storyteller. We have piloted thisdesign for a sequence of images in an annota-tion task. We present and analyze the collectedcorpus and describe plans towards automation.

1 Introduction

Telling stories from multiple images is a creativechallenge that involves visually analyzing the im-ages, drawing connections between them, and pro-ducing language to convey the message of thenarrative. To computationally model this cre-ative phenomena, a visual storyteller must takeinto consideration several aspects that will influ-ence the narrative: the environment and presen-tation of imagery (Madden, 2006), the narrativegoals which affect the desired response of thereader or listener (Bohanek et al., 2006; Thorneand McLean, 2003), and the audience, who mayprefer to read or hear different narrative styles(Thorne, 1987).

The environment is the content of the imagery,but also its interpretability (e.g., image quality).Canonical images are available from a number

of high-quality datasets (Everingham et al., 2010;Plummer et al., 2015; Lin et al., 2014; Ordonezet al., 2011), however, there is little coverage oflow-resourced domains with low-quality imagesor atypical camera perspectives that might appearin a sequence of pictures taken from blind persons,a child learning to use a camera, or a robot survey-ing a site. For this work, we studied an environ-ment with odd surroundings taken from a cameramounted on a ground robot.

Narrative goals guide the selection of what ob-jects or inferences in the image are relevant or un-characteristic. The result is a narrative tailoredto different goals such as a general “describe thescene”, or a more focused “look for suspicious ac-tivity”. The most salient narrative may shift asnew information, in the form of images, is pre-sented, offering different possible interpretationsof the scene. This work posed a forensic task withthe narrative goal to describe what may have oc-curred within a scene, assuming some temporalconsistency across images. This open-endednessevoked creativity in the resulting narratives.

The telling of the narrative will also differ basedupon the target audience. A concise narrative ismore appropriate if the audience is expecting tohear news or information, while a verbose and hu-morous narrative is suited for entertainment. Au-diences may differ in how they would best experi-ence the narrative: immersed in the first person orthrough an omniscient narrator. The audience inthis work was unspecified, thus the audience wasthe same as the storyteller defining the narrative.

To build a computational creative visual story-teller that customizes a narrative along these threeaspects, we propose a creative visual storytellingpipeline requiring separate task-modules for Ob-ject Identification, Single-Image Inferencing, andMulti-Image Narration. We have conducted an ex-ploratory pilot experiment following this pipeline

20

Page 31: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Figure 1: Creative Visual Storytelling Pipeline: T1 (Object Identification), T2 (Single Image Inferencing),T3 (Multi-Image Narration)

to collect data from each task-module to train thecomputational storyteller. The collected data pro-vides instances of creative storytelling from whichwe have analyzed what people see and pay atten-tion to, what they interpret, and how they weavetogether a story across a series of images.

Creative visual storytelling requires an under-standing of the creative processes. We arguethat existing systems cannot achieve these cre-ative aspects of visual storytelling. Current objectidentification algorithms may perform poorly onlow-resourced environments with minimal train-ing data. Computer vision algorithms may over-identify objects, that is, describe more objects thanare ultimately needed for the goal of a coherentnarrative. Algorithms that generate captions of animage often produce generic language, rather thanlanguage tailored to a specific audience. Our pilotexperiment is an attempt to reveal the creative pro-cesses involved when humans perform this task,and then to computationally model the phenomenafrom the observed data.

Our pipeline is introduced in Section 2, wherewe also discuss computational considerations andthe application of this pipeline to our pilot experi-ment. In Section 3 we describe the exploratory pi-lot experiment, in which we presented images of alow-quality and atypical environment and have an-notators answer “what may have happened here?”This open-ended narrative goal has the potential toelicit diverse and creative narratives. We did not

specify the audience, leaving the annotator free towrite in a style that appeals to them. The data andanalysis of the pilot are presented in Section 4,as well as observations for extending to crowd-sourcing a larger corpus and how to use these cre-ative insights to build computational models thatfollow this pipeline. In Section 5 we compareour approach to recent works in other storytellingmethodologies, then conclude and describe futuredirections of this work in Section 6.

2 Creative Visual Storytelling Pipeline

The pipeline and interaction of task-modules wehave designed to perform creative visual story-telling over multiple images are depicted in Fig-ure 1. Each task-module answers a question criti-cal to creative visual storytelling: “what is here?”(T1: Object Identification), “what happens here?”(T2: Single-Image Inferencing), and “what hashappened so far?” (T3: Multi-Image Narration).We discuss the purpose, expected inputs and out-puts of each module, and explore computationalimplementations of the pipeline.

2.1 PipelineThis section describes the task-modules we de-signed that provide answers to our questions forcreative visual storytelling.

Task-Module 1: Object Identification (T1).Objects in an image are the building blocks for sto-rytelling that answer the question, literally, “what

21

Page 32: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

is here?” This question is asked of every im-age in a sequence for the purposes of object cura-tion. From a single image, the expected outputsare objects and their descriptors. We anticipatethat two categories of object descriptors will be in-formative for interfacing with the subsequent task-modules: spatial descriptors, consisting of objectco-locations and orientation, and observational at-tribute descriptors, including color, shape, or tex-ture of the object. Confidence level will provideinformation about the expectedness of the objectand its descriptors, or if the object is difficult oruncertain to decipher given the environment.

Task-Module 2: Single-Image Inferencing(T2). Dependent upon T1, the Single-Image In-ferencing task-module is a literal interpretation de-rived from the objects previously identified in thecontext of the current image. After the curation ofobjects in T1, a second round of content selectioncommences in the form of inference determina-tion and selection. Using the selected objects, de-scriptors, and expectations about the objects, thistask-module answers the question “what happenshere?” For example, the function of “kitchen”might be extrapolated from the co-location of a ce-real box, pan, and crockpot.

Separating T2 from T1 creates a modular sys-tem where each task-module can make the best de-cision given the information available. However,these task-modules are also interdependent: as theinferences in T2 depend upon T1 for object selec-tion, so too does the object selection depend uponthe inferences drawn so far.

Task-Module 3: Multi-Image Narration(T3). A narrative can indeed be constructed froma single image, however, we designed our pipelineto consider when additional context, in the formof additional images, is provided. The Multi-Image Narration task-module draws from T1 andT2 to construct the larger narrative. All images,objects, and inferences are taken into consider-ation when determining “what has happened sofar?” and “what has happened from one image tothe next?” This task-module performs narrativeplanning by referencing the inferences and objectsfrom the previous images. It then produces a natu-ral language output in the form of a narrative text.Plausible narrative interpretations are formed fromglobal knowledge about how the addition of newimages confirm or disprove prior hypotheses andexpectations.

2.2 From Pipeline Design to Pilot

Our first step towards building this automatedpipeline is to pilot it. We will use the dataset col-lected and the results from the exploratory studyto to build an informed computational, creative vi-sual storyteller. When piloting, we refer to thispipeline a sequence of annotation tasks.

T1 is based on computer vision technology. Ofparticular interest are our collected annotations onthe low-quality and atypical environments that tra-ditionally do not have readily available object an-notations. Commonsense reasoning and knowl-edge bases drive the technology behind deriv-ing T2 inferences. T3 narratives consist of twosub-task-modules: narrative planning and natu-ral language generation. Each technology can bematched to our pipeline, and be built up separately,leveraging existing works, but tuned to this task.

Our annotators are required to write in naturallanguage (though we do not specify full sentences)the answers to the questions posed in each task-module. While this natural language intermedi-ate representation of T1 and T2 is appropriate fora pilot study, a semantic representation of thesetask-modules might be more feasible for compu-tation until the final rendering of the narrative text.For example, drawing inferences in T2 with theobjects identified in T1 might be better achievedwith an ontological representation of objects andattributes, such as WordNet (Fellbaum, 1998), andinferences mined from a knowledge base.

In our annotation, the sub-task-modules of nar-rative planning and natural language generationare implicitly intertwined. The annotator does notnote in the exercise intermediary narrative plan-ning before writing the final text. In computa-tion, T3 may generate the final narrative text word-by-word (combining narrative planning and natu-ral language generation). Another approach mightfirst perform narrative planning, followed by gen-eration from a semantic or syntactic representa-tion that is compatible with intermediate represen-tations from T1 and T2.

3 Pilot Experiment

A paper-based pilot experiment implementing thispipeline was conducted. Ten annotators (A1 -A10)1 participated in the annotation of the three

1A5, an author of this paper, designed the experiment andexamples. All annotators had varying degrees of familiaritywith the environment in the images.

22

Page 33: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Figure 2: image1, image2, and image3 in Pilot Experiment Scene

images in Figure 2 (image1 - image3). Theseimages were taken from a camera mounted on aground robot while it navigated an unfamiliar en-vironment. The environment was static, thus, pre-senting these images in temporal order was not ascritical as it would have been if the images werestill-frames taken from a video or if the imagescontained a progression of actions or events.

Annotators first addressed the questions posedin the Object Identification (T1) and Single-ImageInference (T2) task-modules for image1. They re-peated the process for image2 and image3, and au-thored a Multi-Image Narrative (T3). The anno-tator work flow mimicked the pipeline presentedin Figure 1. For each subsequent image, the timeallotted increased from five, to eight, to elevenminutes to allow more time for the narrative tobe constructed after annotators processed the ad-ditional images. An example image sequence withanswers was provided prior to the experiment. A5

gave a brief, oral, open-ended explanation of theexperiment as not to bias annotators to what theyshould focus on in the scene or what kind of lan-guage they should use. The goal of this data col-lection is to gather data that models the creativestorytelling processes, not to track these processesin real-time. A future web-based interface willallow us to track the timing of annotation, whatinformation is added when, and how each task-module influences the other task-modules for eachimage.

Object Identification did not require annotatorsto define a bounding box for labeled objects, norwere annotators required to provide objective de-scriptors2. Annotators authored natural languagelabels, phrases, or sentences to describe objects,attributes, and spatial relations while indicating

2As we design a web-based version of this experiment, wewill enforce interfaces explicitly linked to object annotations,and the desire to view previously annotated images.

confidence levels if appropriate.During Single Image Inferencing, annotators

were shown their response from T1 as they au-thored a natural language description of activityor functions of the image, as well as a natural lan-guage explanation of inferences for that determi-nation, citing supporting evidence from T1 out-put. For a single image, annotators may answerthe questions posed by T1 and T2 in any order tobuild the most informed narrative.

Annotators authored a Multi-Image Narrative toexplain what has happened in the sequence of im-ages presented so far. For each image seen in thesequence, annotators were shown their own natu-ral language responses from T1 and T2 for thoseimages. Annotators were encouraged to look backto their responses in previous images (as the bot-tom row of Figure 1 indicates), but not to makechanges to their responses about the previous im-ages. They were, however, encouraged to incor-porate previous feedback into the context of thecurrent image. From this task-module, annotatorswrote a natural language narrative connecting ac-tivity or functions in the images which will be usedto learn how to weave together a story across theimages.

The open-ended “what has happened here?”narrative goal has no single answer. These anno-tations may be treated as ground truth, but we runthe risk of potentially missing out on creative alter-natives. Bootstraping all possible objects and in-ferences would achieve greater coverage, yet thisquickly becomes infeasible. We lean toward themiddle, where the answers collected will help de-termine what annotators deem as important.

4 Results and Analysis

In this section, we discuss and analyze the col-lected data and provide insights for incorporatingeach task-module into a computational system.

23

Page 34: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

# Annotators Objects

10 calendar, water bottle9 computer, table/desk8 chair4 walls, window2 blue triangles1 floor, praying rug

Table 1: Objects identified by annotators in image1

# Annotators Objects

10 suitcase, shirt8 sign6 green object5 fire extinguisher4 walls3 bag2 floor, window1 coat hanger, shoes, rug

Table 2: Objects identified by annotators in image2

# Annotators Objects

7 crockpot, cereal box, table6 pan5 container2 walls, label1 thread and needle, coffee pot,

jam, door frame

Table 3: Objects identified by annotators in image3 (to-tal of 7 annotators)

4.1 Object Identification (T1)

Thirty three objects were identified across the im-ages.3 A5 identified the most of these objects (20),and A1, the least (10). Tables 1 - 3 show theobjects identified and how many annotators ref-erenced each object. A set of objects emergedin each image that captured the annotators’ atten-tion. Object descriptor categories are tabulated inTable 44. Not surprisingly, the most common de-scriptors were attributes, e.g., color and shape, fol-lowed by co-locations. Orientation was not ob-served in this dataset, however this category maybe useful for other disrupted environments. Weobserved instances of uncertainty, e.g., “a suitcase,not entirely sure, because of zipper and size”, andunexpected objects, “unfinished floor”, whereas“floors” may have not been labeled otherwise.

Lack of coverage and overlap in this task withrespect to objects and descriptors is not discour-aging. In fact, we argue that exhaustive object

3Due to time constrains, A2 - A4 did not complete image3.4Tabulation of descriptors in Tables 7 - 9 in Appendix.

Total Average Min Max

SpatialCo-Location 51 6.3 0 14

ObservationalAttribute 99 12.2 3 22

ConfidenceUnexpected 7 0.7 0 4Uncertainty 28 3.3 0 8

Total 185 22.6 7 39

Table 4: Object descriptor summary with counts perannotator (A2 - A4 excluded from average, min, andmax; see footnote 4)

identification is counter-intuitive and detrimentalto creative visual storytelling. Annotators mayhave identified only the objects of interest to thenarrative they were forming, and viewed other ob-jects as distractors. The most frequent of the iden-tified objects are likely to be the most influentialin T2 where the calendar, computer, and chair pro-vide more information than the “blue triangles”.

Not only can selective object identification pro-vide the most salient objects for deriving interpre-tations, but the Object Identification exercise withrespect to storytelling can differentiate betweenobjects and descriptors that are commonplace orotherwise irrelevant. For instance, if a fire extin-guisher was not annotated as red, we are inclinedto deduce it is because this fact is well known orunimportant, rather than the result of a distractedannotator.5

When automating this task-module, new objectidentification algorithms should account for thefollowing: a sampling of relevant objects specificto the storytelling challenge, and attention to po-tential outlier descriptors which may be more in-dicative than a standard descriptor, depending onthe environment.

4.2 Single-Images Inferencing (T2)

We highlight A1 and A8 for the remainder of thediscussion6. Table 5 shows A1’s annotation ofSingle-Image Inferencing and Multi-Image Nar-ration. In the Single-Image Inferencing (T2) forimage1, A1 noted the “office” theme by referenc-ing the desk and computer, and expressed uncer-tainty with respect to the window looking “weird”and unlike a typical office building. A1 kept clear

5We expect this to be revealed in the web-based versionof the task with a stricter annotation interface.

6Other annotation results in Tables 10 - 17 in Appendix.

24

Page 35: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Image Single-Image Inference Multi-Image NarrativeImage1 Looks like a dingy, sparse office. The computer

desk, calendar indicate an office, but the space is un-finished (no dry wall, carpet) and area outside win-dow looks weird, not like an office building.

Image2 Looks like someone was staying here temporarily,using this now to store clothes, or maybe as a bed-room. Again, it’s atypical because its an unfinishedspace that looks uncomfortable.

I think this person was hiding out here to get readyfor some event. The space isn’t finished enough tobe intended for habitation, but someone had to stayhere, perhaps because they didn’t want to be found,and you wouldn’t expect someone to be living in aconstruction zone.

Image3 This area was used as a sort of kitchen or food stor-age prep area.

Someone was definitely living here even though itwasn’t finished or intended to be a house. They wereprobably using a crock pot because you can makefood in this without having larger appliances like astove, oven. There’s no milk, so this person may belactose intolerant. The robot should vanquish themwith milk.

Table 5: A1’s annotation (previously identified objects in Single-Image Inference text in italics)

Image Single-Image Inference Multi-Image NarrativeImage1 This is likely a workplace of some sort. It is unclear

if it is an unfinished part of a current/suspended con-struction project or it is just a utilitarian space insideof an industrial facility. The presence of a computermonitor suggest it is in use or a low crime area.

Image2 This is a jobsite of some sort. It has unfinished wallsand what may be a paper shredder.

This is an unfinished building. There is some ev-idence of office-type work (i.e. work involving pa-per and computers). The existence of “windows” be-tween rooms suggests that this is not a dwelling (orintended to become one), that is, a building designedto be a dwelling, but what it is remains unclear.

Image3 A room in a building is being used as a cooking andeating station, based upon presence of food, table,and cooking instruments.

This building is being used by a likely small numberof individuals for unclear purposes including cook-ing, eating, and basic office work.

Table 6: A8’s annotation (previously identified objects in Single-Image Inference text in italics)

the distinction between images in their annota-tion of image2, as there were no references to theoffice observed only in image1. Instead, refer-ences in image2 were to the storage of clothes. Inthe single-image interpretation of image3, A1 sug-gested that this was a food preparation area fromthe presence of the crockpot, cereal, and the otherfood items that appeared together. A8, whose an-notation is in Table 6, also noted the “workplace”theme from the desk and computer, though A8

leaned more towards a construction site, citing theutilitarian space. Due to uncertainty of the envi-ronment, A8 misidentified the suitcase in image2as a shredder, and incorporated it prominently intotheir interpretation. Similar to A1, A8 also indi-cated in image3 that this was a food preparationarea.

A8’s misinterpretation of the suitcase raises animplementation question: are the inferences andalgorithms we develop only as good as our en-

vironment data allows them to be? How mighta misunderstanding of the environment affect theinferences? This environment showcased theuniqueness of the physical space and low-qualityof images, yet all annotators indicated, withoutprompting or instruction, varying degrees of con-fidence in their interpretations based upon the ev-idence. A8 indicated their uncertainty about thesuitcase object by hedging that it was “what maybe a paper shredder”. This expression of uncer-tainty should be preserved in an automated systemfor instances such as this when an answer is un-known or has a low confidence level.

T2 is intended to inform a commonsense rea-soner and knowledge base based on T1 to deducethe setting. This task-module describes functionsof rooms or spaces, e.g., food preparation areasand office space. Additional interpretations aboutthe space were made by annotators from the over-all appearance of objects in the image, such as the

25

Page 36: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

atmospheric observation “lighting of rooms is notvery good” (A7, Table 15 in Appendix). Theseinferences might not be easily deducible from T1alone, but the combination of these task-modulesallows for these to occur.

Evaluating this annotation in a computationalsystem will require some ground truth, though wehave previously stated that it is impossible to claimsuch a gold standard in a creative storytelling task.Evaluation must therefore be subject to both quali-tative and quantitative analyses, including, but notlimited to, commonsense reasoning on validationsets and determining plausible alternatives to com-monsense interpretations.

4.3 Multi-Image Narration (T3)

The narrative begins to form across the first twoimages in the Multi-Image Narration task-module(T3). A1 hypothesized that someone was “hidingout”, going a step beyond their T2 inference ofan “office space” in image1, to extrapolate “whathas happened here” rather than “what happenshere”. In image2, A1 had hedged their narrativewith “I think”, but the language became strongerand more confident in image3, in which A1 “def-initely” thought that the space was inhabited. A1

pointed out that a lack of milk was unexpected ina canonical kitchen, and supplemented their nar-rative with a joke, suggesting to “vanquish themwith milk”. In image2, A8 interpreted that thespace was not intended for long-term dwelling.Their narrative shifted in image3 when anotherscene was revealed. A8 concluded that this spacewas inhabited by a group, despite the annotator’sprevious assumption in image2 that it was notsuited for this purpose.

There is no a guaranteed “correct” narrative thatunfolds, especially if we are seeking creativity.Some narrative pieces may fall into place as ad-ditional images provided context, but in the caseof these environments, annotators were challengedto make sense of the sequence and pull together aplausible, if not uncertain, narrative.

The narrative goal and audience aspects of cre-ative visual storytelling will directly inform T3.A variety of creative narratives and interpretationsemerged from this pilot, despite the particularlysparse and odd environment and openness of thenarrative goal. Based on the responses from eachsuccessive task-modules, all annotators’ interpre-tations and narratives are correct. Even with anno-

tator misunderstandings, the narratives presentedwere their own interpretation of the environment.As the audience in this task was not specified, an-notators could use any style to tell their story. Thedata collected expressed creativity through jokes(A1), lists and structured information (A5), con-cise deductions (A6, A8), uncertain deductions(A4), first person (A1, A3, A5), omniscient nar-rators (A2), and the use of “we” inclusive of therobot navigating the space (A7, A9, A10).

Future annotations may assign an audience ora style prompt in order to observe the varied lan-guage use. This will inform computational modelsby curating stylistic features and learning from ap-propriate data sources.

5 Related work

Visual storytelling is still a relatively new subfieldof research that has not yet begun to capture thehighly creative stories generated by text-based sto-rytelling systems to date. The latter supports thedefinition of specific goals or presents alternatenarrative interpretations by generating stories ac-cording to character goals (e.g., Meehan (1977))and author goals (e.g., Lebowitz (1985)). Other in-teractive, co-constructed, text-based narrative sys-tems make use of information retrieval methods byimplicitly linking the text generation to the inter-pretation. As a result, systems incorporating thesemethods cannot be adjusted for different narrativegoals or audiences (Cychosz et al., 2017; Swansonand Gordon, 2008; Munishkina et al., 2013).

Other research in text-based storytelling focuseson answering the question “what happens next?”to infer the selection of the most appropriate nextsentence. This method indirectly relies on the se-lection of sentences to evaluation the results of aforced choice between the “best” or “correct” nextsentence of the choices when given a narrativecontext (as in the Story Close Test (Mostafazadehet al., 2016) and the Children’s Book Test (Hillet al., 2015)). Our pipeline, by contrast, buildson a series of open-ended questions, for whichthere is no single gold-standard or reference an-swer. Instead, we expect in time to follow priorwork by Roemmele et al. (2011) where evaluationwill entail generating and ranking plausible inter-pretations.

Recent work on caption generation combinescomputer vision with a simplified narration, or sin-gle sentence text description of an image (Vinyals

26

Page 37: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

et al., 2015). Image processing typically takesplace in one phase, while text generation fol-lows in a second phase. Superficially, this sep-aration of phases resembles the division of laborin our approach, where T1 and T2 involve image-specific analysis, and T3 involves text generation.However this form of caption generation dependssolely on training data where individual imagesare paired with individual sentences. It assumesthe T3 sub-task-modules can be learned from thesame data source, and generates the same sen-tences on a per-image basis, regardless of the or-der of images. One can readily imagine the inade-quacy of stringing together captions to construct anarrative, where the same captions describe bothimages of a waterfall flowing down, and thosesame images in reverse order where instead thewater seems to be flowing up.

The work most similar in approach to our vi-sual storyteller annotation pipeline is Huang et al.(2016) who separate their tasks into three tiers: thefirst over single images, generating literal descrip-tions of images in isolation (DII), the second overmultiple images, generating literal descriptions ofimages in sequence (DIS), and the third over mul-tiple images, generating stories for images in se-quence (SIS). While these tiers may seem analo-gous to ours, there are different assumptions un-derlying the tasks in data collection. For each task,their images are annotated independently by dif-ferent annotators, while in our approach, all im-ages are annotated by annotators performing allof our tasks. The DII task is an exhaustive objectidentification task on single images, yet we leaveT1 up to our annotators to determine how manyobjects and attributes to describe in an image toavoid the potential for object over-identification.The SIS task involves a set of images over whichannotators select and possibly reorder, then writeone sentence per image to create a narrative, withthe opportunity to skip images. In our pipeline,we have intentionally designed our task-modulesto allow for the possibility of one task-module tobuild off of and influence one another. It is possi-ble in our approach for an annotator’s inference inT2 of one image to feed forward and affect theirT1 annotations in the subsequent image, whichmight in turn affect the resulting T3 narrative. Inshort, Huang et al. (2016) capture the thread ofstorytelling in one tier only, their SIS condition,while our annotators build their narratives across

task-modules as they progress from image to im-age.

6 Conclusion and Future Work

This paper introduces a creative visual storytellingpipeline for a sequence of images that delegatesseparate task-modules for Object Identification,Single-Image Inferencing, and Multi-Image Nar-ration. These task-modules can be implementedto computationally describe diverse environmentsand customize the telling based on narrative goalsand different audiences. The pilot annotation hascollected data for this visual storyteller in a low-resourced environment, and analyzed how creativevisual storytelling is performed in this pipeline forthe purposes of training a computational, creativevisual storyteller. The pipeline is grounded in nar-rative decision-making processes, and we expectit to perform well on both low- and high-qualitydatasets. Using only curated datasets, however,runs the risk of training algorithms that are notgeneral use.

We are now positioned to conduct a crowd-sourcing annotation effort, followed by an imple-mentation of this storyteller following the outlinedtask-modules for automation. Our pipeline andimplementation detail are algorithmically agnos-tic. We anticipate off-the-shelf and state-of-the-artcomputer vision and language generation method-ologies will provide a number of baselines forcreative visual storytelling: to test environments,compare an object identification algorithm trainedon high-quality data against one trained on low-quality data; to test narrative goals, compare acomputer vision algorithm that may over-identifyobjects against one focused on a specific set toform a story; to test audience, compare a captiongeneration algorithm that may generate genericlanguage against one tailored to the audience de-sires.

The streamlined approach of our experimen-tal annotation pipeline allows us to easily promptfor different narrative goals and audiences in fu-ture crowdsourcing to obtain and compare differ-ent narratives. Evaluation of the final narrativemust take into consideration the narrative goal andaudience. In addition, evaluation must balance thecorrectness of the interpretation with expressingcreativity, as well as the grammaticality of the gen-erated story, suggesting new quantitative and qual-itative metrics must be developed.

27

Page 38: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

ReferencesJennifer G Bohanek, Kelly A Marin, Robyn Fivush,

and Marshall P Duke. 2006. Family narrative inter-action and children’s sense of self. Family process,45(1):39–54.

Margaret Cychosz, Andrew S Gordon, ObiageliOdimegwu, Olivia Connolly, Jenna Bellassai, andMelissa Roemmele. 2017. Effective scenario de-signs for free-text interactive fiction. In Inter-national Conference on Interactive Digital Story-telling, pages 12–23. Springer.

Mark Everingham, Luc Van Gool, Christopher KIWilliams, John Winn, and Andrew Zisserman. 2010.The pascal visual object classes (voc) challenge. In-ternational journal of computer vision, 88(2):303–338.

Christiane Fellbaum. 1998. WordNet. Wiley OnlineLibrary.

Felix Hill, Antoine Bordes, Sumit Chopra, and JasonWeston. 2015. The goldilocks principle: Readingchildren’s books with explicit memory representa-tions. arXiv preprint arXiv:1511.02301.

Ting-Hao Kenneth Huang, Francis Ferraro, NasrinMostafazadeh, Ishan Misra, Aishwarya Agrawal, Ja-cob Devlin, Ross Girshick, Xiaodong He, PushmeetKohli, Dhruv Batra, et al. 2016. Visual storytelling.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1233–1239.

Michael Lebowitz. 1985. Story-telling as planning andlearning. Poetics, 14(6):483–502.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European confer-ence on computer vision, pages 740–755. Springer.

Matt Madden. 2006. 99 ways to tell a story: exercisesin style. Random House.

James R Meehan. 1977. Tale-spin, an interactive pro-gram that writes stories. In Ijcai, volume 77, pages91–98.

Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A cor-pus and cloze evaluation for deeper understandingof commonsense stories. In Proceedings of NAACL-HLT, pages 839–849.

Larissa Munishkina, Jennifer Parrish, and Marilyn AWalker. 2013. Fully-automatic interactive story de-sign from film scripts. In International Conferenceon Interactive Digital Storytelling, pages 229–232.Springer.

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg.2011. Im2text: Describing images using 1 millioncaptioned photographs. In Neural Information Pro-cessing Systems (NIPS).

Bryan A Plummer, Liwei Wang, Chris M Cervantes,Juan C Caicedo, Julia Hockenmaier, and Svet-lana Lazebnik. 2015. Flickr30k entities: Col-lecting region-to-phrase correspondences for richerimage-to-sentence models. In Computer Vision(ICCV), 2015 IEEE International Conference on,pages 2641–2649. IEEE.

Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S Gordon. 2011. Choice of plausible alterna-tives: An evaluation of commonsense causal reason-ing. In AAAI Spring Symposium: Logical Formal-izations of Commonsense Reasoning, pages 90–95.

Reid Swanson and Andrew S Gordon. 2008. Say any-thing: A massively collaborative open domain storywriting companion. In Joint International Confer-ence on Interactive Digital Storytelling, pages 32–40. Springer.

Avril Thorne. 1987. The press of personality: Astudy of conversations between introverts and ex-traverts. Journal of Personality and Social Psychol-ogy, 53(4):718.

Avril Thorne and Kate C McLean. 2003. Telling trau-matic events in adolescence: A study of master nar-rative positioning. Connecting culture and memory:The development of an autobiographical self, pages169–185.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Computer Vision and Pat-tern Recognition (CVPR), 2015 IEEE Conferenceon, pages 3156–3164. IEEE.

28

Page 39: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Appendix: Additional AnnotationsObject # Descriptor Text # DescriptorCalendar 10 hanging off the table, taped to table top 4 co-location

marked up, ink, red circle on calendar, marked with pen 4 attributeforeign language 1 attributepaper 1 attributepicture on top 1 attribute

Water bottle 10 on the floor, on ground, on floor 3 co-locationto the right of table 1 co-locationmostly empty, unclear if it has been opened 2 attributeplastic 2 attributeclosed with lid 1 attribute

Computer 9 screen, black turned off; monitor, black 3 attributeTable / Desk 9 has computer on it, [computer] on table 2 co-location

gray, black 2 attributewood 2 attributemetal 1 attribute(presumed) rectangular 1 uncertainty

Chair 8 folding 6 attributemetal 5 attributegrey 3 attribute

Walls 4 wood 1 attributeunfinished and showing beams, unfinished construction 3 unexpected

Window 4 in wall behind chair 1 co-locationwindow to another room; perhaps chairs in other room 1 uncertaintyno glass 1 unexpected

Blue triangles 2 blue objects in windowsill 1 unexpectedFloor 1 unfinished 1 unexpectedPraying rug 1

Table 7: Object Identification for image1

Object # Descriptor Text # DescriptorSuitcase 10 black, orange stripes; black and red; black with red trim; blue and copper 5 attribute

(not entirely sure) because of zipper item and size 1 uncertaintyresembles a paper shredder 1 uncertaintya suitcase or a heater 1 uncertainty

Shirt 10 on hanger; on fire extinguisher; on wall; hanging 6 co-locationblack 5 attributelong sleeves 2 attributeblack thing hanging on wall (unclear what it is); black object 2 uncertainty

Sign 8 on the wall 4 co-locationmaybe indicating ’3’?; roman numerals; 3 dashes; Arabic numbers; foreignlanguage; room number 111

6 attribute

poster 1 attributemap or blueprints 1 uncertainty

Green object 6 spherical 1 attributehanging in window; in windowsill 2 co-locationgreen thing outside room; green object; unidentifiable object; lime greenobject

4 uncertainty

light post? fan? 1 uncertaintyFire extinguisher 5 hanging off of black thing (also unclear as to what this is or does); on wall 3 co-location

obscured 1 co-locationcylindrical 1 attributewhite and red thing; red object, white and red piece of object 3 uncertainty

Wall 4 wooden 1 attributeunfinished; visible plywood studs 2 unexpected

Bag 3 backpack or bag; something round; pile of clothes 2 uncertaintyon the ground 2 co-locationnext to suitcase 1 co-location

Floor 2 marking of industry grade particle board, unfinished 2 attributeWindow 2Coat hanger 1 hanging on wall 1 co-location

wire 1 attributewhite 1 attribute

Shoes 1 shoes or hat 1 uncertaintyRug 1

Table 8: Object Identification for image229

Page 40: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Object # Descriptor Text # DescriptorCrockpot 7 on table 1 co-location

green 2 attributeold fashioned 1 attributekitchen appliance 1 attributewhite or silver 1 uncertainty

Cereal box 7 on table 1 co-locationto the right of crockpot 1 co-locationshredded wheat 4 attributecardboard 1 attributeprinted black letters 1 attribute

Table 7 wood 3 attributecoffee table style 2 attributepale 1 attribute

Pan 6 on ground; on floor 3 co-locationblue handle 1 attributemedium-size 1 attribute

Container 5 clear 2 attributeplastic 3 attributeempty 2 attributerectangular 1 attributehinged top 1 attribute

Walls 2 lined with paper 1 attributeLabel 2 on pressure cooker 2 co-location

white 1 attributeThread and nee-dle

1 to the right of cereal box 1 co-location

Coffee pot 1 what looks like a coffee pot 1 uncertaintyempty 1 attributebehind cereal box 1 attribute

Jam 1 plaid red and white lid 1 attributeDoor frame 1

Table 9: Object Identification for image3

Image Single-Image Inferencing Multi-Image NarrationImage1 Someone sits at table and puts water bottle on floor

while perhaps taking notes for others in some room.Folding chair suggests temporary or new use ofspace while building under construction.

Image2 Hallway view, suggesting exit path where someonemight leave luggage while being in building

Same building as in the first scene because same typeof wood for walls, floor, and opening/window con-struction. Arabic numbers on paper sign loosely at-tached (because wavy surface of paper e.g. not rigid,not laminated) to the wall suggests temporary des-ignation of space for specific use, as an organizedarrangement by some people for others.

Image3 N/A N/A

Table 10: A2’s annotation

Image Single-Image Inferencing Multi-Image NarrationImage1 I believe this is an office, because there is a computer

monitor on a table, the table is serving as a desk,and there is a metal chair next to the monitor andthe desk. A calendar is typically found in an office,however the calendar here is not in a location that isconvenient for a person

Image2 I believe that this is a standard room that serves as astorage area. The absence of other objects does nothint at this room serving any other purpose.

I believe that this storage room is located in a homesince personal items such as a luggage bag and spareshirt are not typically found in a public building.From the marked calendar in the previous picture,it appears that the occupants are preparing to travelvery soon.

Image3 N/A N/A

Table 11: A3’s annotation

30

Page 41: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Image Single-Image Inferencing Multi-Image NarrationImage1 An office or computer setting/workstation. Actual

computer maybe under desk (not visible) or miss-ing. Water bottle suggests someone used this spacerecently. Chair not facing desk suggests person leftin a hurry (not pushed under desk). Red circled datesuggests some significance.

Image2 Shirt and suitcase suggests someone stored their per-sonal items in this space. Room being labeled sug-gests recent occupants used more than 1 part of thisspace. Space does not look comfortable, but per-sonal effects are here anyway. Holiday?

Someone camped out here and planned activities.They left in a hurry and didn’t spend time puttingthings in their suitcases, or they had a visitor and thevisitor left abruptly. The occupant may have left onthe date marked in the calendar. The date may havehad personal significance for an operation.

Image3 N/A N/A

Table 12: A4’s annotation

Image Single-Image Inferencing Multi-Image NarrationImage1 Office (chair, desk, computer, calendar). Unfinished

building (walls, floor, window)Image2 In an unfinished building closet, common space.

Things thrown to the side. Doesn’t care much aboutoffice safety because fire extinguisher is covered,therefore not easily accessible.

Not sure about either workzone because randomlyplaced clothes and unsafe work environment. Couldbe a factory with unsafe conditions. Someone livingor storing clothes in a “break room”?

Image3 “Camp” site but not outdoors. Items on floor indicatesome disarray or disregard for cleanliness. Why isthe crock pot on the coffee table with cereal? Break-fast? But why are the walls strange?

Food like this shouldn’t appear in a safe work envi-ronment, so I no longer think that. Someone seemsto be living here in an unsafe and probably unregu-lated (re: fire extinguisher) way. Someone is hidingout in an uninhabited warehouse or work site (walls,floors, windows)

Table 13: A5’s annotation

Image Single-Image Inferencing Multi-Image NarrationImage1 This is an office space because there is a desk, chair,

computer and calendar. These items are typicalitems that would be in an office space.

Image2 This looks like a storage space, a closet, or the en-trance/exit to a building. People typically pile thingssuch as a suitcase, hanging clothes, backpack, etc.at one of those locations. A storage space or closetwould allow for the items to be stored for a long timebut would also be due to people being ready to leaveon travel.

Due to the lack of decorations I would say these pic-tures were taken in a location where people werestaying or working temporarily (like a headquarterssafe house, etc.)

Image3 These are items that would typically be found in akitchen or break area. You would see a table orcounter in a kitchen or break room. The pan andcrock pot are not items that would be seen in otherrooms, like a living room, office, bathroom, bed-room.

I would say this is a house or temporary space be-cause the items are not organized and the surround-ing area is not decorative. The scenes look messyand it doesn’t look like it gets cleaned or has beencleaned recently. Plus the space contains a suitcasewhich gives the impressions that the person has notunpacked.

Table 14: A6’s annotation

31

Page 42: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Image Single-Image Inferencing Multi-Image NarrationImage1 Looks like a place to work, with chair, table, mon-

itor. Calendar is out of place because people don’thave calendars from the edge of a table, so it canonly be seen from the floor. Walls are unfurnished,only wood and plywood. A window in the wall, likean interior window. Not sure what it is a windowfor-why is the window in that location?

Image2 We are looking through a doorway or hallway. Shirtand suitcase belong together. Not sure what otherobjects are (green, red, black on ground).

Might be same location as Image 1, because thewooden/plywood walls and floor are similar. Notsure what the images have to do with each other, butmight be 2 different rooms in same location. We’reviewing this image from another room, because thisroom has a poster in it. Lighting of rooms is not verygood, almost looks like spot lights, so not like an or-dinary, prototypical house.

Image3 A bunch of objects on a table, with a few objectsunderneath. The objects on/under the table all haveto do with food or preparing food. Walls are lightcolored. In the foreground appears to be a woodendoor jam. Although there are some kitchen items,this does not look like a typical kitchen

It is difficult to tell if this is in the same location asthe previous 2 images. The wood door jam mightbe the same, but hard to know if wall is plywoodand we don’t see any other wooden framing. Roomsfrom all 3 images don’t appear connected physically.No understandable context or connections.

Table 15: A7’s annotation

Image Single-Image Inferencing Multi-Image NarrationImage1 This looks like a make-shift room or space. Has a

military of intel feel to it. Could be a briefing or aninterrogation room. Given the prayer rug, definitelyinteraction between parties of different backgrounds,etc.

Image2 This view or room reflects living quarters. Given thenature of the condition of the wall, it is a make-shift.The existence of a number identifying this room in-dicates that it is one of many.

Combining the 2 pictures, this is beginning to looklike part of a structure used for military/intel pur-poses. The location is most likely somewhere in theMiddle East given how the numbers are written inHindi indicating Arabic language. This also meanswe have multi-party/individual interactions.

Image3 This picture has all the ingredients to presentinga kitchen: food and cookware leads to a kitchen.Given the “rough” look of the setting, this has thehallmarks of a make-shift kitchen.

This confirms, more than anything else, the scenariodescribed in picture 2. As a whole, looks like somesort of post or output or a make-shift temporary type.Only necessities are present and the place couldn’tquickly be abandoned.

Table 16: A9’s annotation

Image Single-Image Inferencing Multi-Image NarrationImage1 This was probably used as a workspace, given the

chair and table with the monitor and the calendar.Someone was recently there because the bottle is up-right.

Image2 This was a space that someone lived in given theclothes, fan(?), heater/suitcase(?). Given the mess,they left abruptly. The fire extinguisher indicates apresence because it is a safety aid.

This suggests we’re in a space occupied by some-one because of the office type and “living room” typeroom setup. It was purposefully made and left veryabruptly (messy clothes, chair not pushed in).

Image3 This seems to be a kitchen area because all objectsare food related. It is messy. The rice cooker has ablue light and may be on. There is a window lettingin light, visible on the back wall.

This supports the assumption that the environmentwas recently occupied. Food is opened, rice cookeris on, mess suggests it was abruptly abandoned,much like image 2’s mess. The robot appears to be Ithe doorway at an angle.

Table 17: A10’s annotation

32

Page 43: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Proceedings of the First Workshop on Storytelling, pages 33–42New Orleans, Louisiana, June 5, 2018. c©2018 Association for Computational Linguistics

Telling Stories with Soundtracks: An Empirical Analysis of Music in Film

Jon GillickSchool of Information

University of California, [email protected]

David BammanSchool of Information

University of California, [email protected]

Abstract

Soundtracks play an important role in carry-ing the story of a film. In this work, we col-lect a corpus of movies and television showsmatched with subtitles and soundtracks andanalyze the relationship between story, song,and audience reception. We look at the con-tent of a film through the lens of its latent top-ics and at the content of a song through de-scriptors of its musical attributes. In two ex-periments, we find first that individual topicsare strongly associated with musical attributes,and second, that musical attributes of sound-tracks are predictive of film ratings, even aftercontrolling for topic and genre.

1 Introduction

The medium of film is often taken to be a canon-ical example of narrative multimodality: it com-bines the written narrative of dialogue with the vi-sual narrative of its imagery. While words bearthe burden alone for creating compelling charac-ters, scenes, and plot in textual narratives like lit-erary novels, in film, this responsibility is sharedby each of the contributors, including the screen-writer, director, music supervisor, special effectsengineers, and many others. Working together tosupport the overall story tends to make for moresuccessful component parts; the Academy Awardwinner for Best Picture, for example, often alsocollects many other awards and nominations—including acting, cinemotography, and sound de-sign.

While film has recently been studied as a tar-get in natural language processing and computervision for such tasks as characterizing gender rep-resentation in dialogue (Ramakrishna et al., 2015;Agarwal et al., 2015), inferring character typesfrom plot summaries (Bamman et al., 2013), mea-suring the memorability of phrasing (Danescu-Niculescu-Mizil et al., 2012), question answering

(Guha et al., 2015; Kocisky et al., 2017), naturallanguage understanding (Frermann et al., 2017),summarization (Gorinski and Lapata, 2015) andimage captioning (Zhu et al., 2015; Rohrbachet al., 2015, 2017; Tapaswi et al., 2015), themodalities examined are almost exclusively lim-ited to text and image. In this work, we presenta new perspective on multimodal storytelling byfocusing on a so-far neglected aspect of narrative:the role of music.

We focus specifically on the ways in whichsoundtracks contribute to films,1 presenting a firstlook from a computational modeling perspectiveinto soundtracks as storytelling devices. By devel-oping models that connect films with musical pa-rameters of soundtracks, we can gain insight intomusical choices both past and future. While agreat film score is in part determined by how wellit fits with the context of the story (Byrne, 2012),we are also interested in uncovering musical as-pects that, in general, work better in support of afilm.

To move toward understanding both whatmakes a film fit with a particular kind of song andwhat musical aspects can broadly be effective inthe service of telling a story, we make the follow-ing contributions:

1. We present a dataset of 41,143 films pairedwith their soundtracks. Metadata for the filmsis drawn from IMDB, linked to subtitle infor-mation from OpenSubtitles2016 data (Lisonand Tiedemann, 2016), and soundtrack datais linked to structured audio information fromSpotify.

2. We present empirical results demonstratingthe relationship between audio qualities ofthe soundtrack and viewers’ responses to

1We use the word film to refer to both movies and televi-sion shows interchangeably.

33

Page 44: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

the films they appear in; soundtracks withmore instrumental or acoustic songs generatehigher ratings; “danceable” songs lower theaverage ratings for films they appear in.

3. We present empirical results demonstratingthe relationship between the topics that makeup a film script and the audio qualities ofthe soundtrack. Films with settings in highschool or college, for example, tend to haveelectric instrumentation and singing; sound-tracks with faster tempos appear both in filmsabout zombies and vampires and in films inwhich the word dude appears frequently.

2 The Narrative Role of Music in Film

The first films appeared around 1890, before thedevelopment of technology that enabled synchro-nization of picture with sound (Buhler et al.,2010). While silent films featured no talking ormusic in the film itself, they were often accom-panied by music during live performances in the-atres. Rather than playing set scores, these live ac-companiments were largely improvised; practicalcatalogues for such performances describe the mu-sical elements appropriate for emotions and narra-tive situations in the film (Becce, 1919; Erdmannet al., 1927). For example, Lang and West (1920)note that a string accompaniment with tremolo(trembling) effect is appropriate for “suspense andimpending disaster”; an organ tone with heavypedal is appropriate for “church scenes” and forgenerally connoting “impressive dignity”; flutesare fitting for conveying “happiness,” “spring-time” or “sunshine.”

With the rise of talkies in the late 1920’s(Slowik, 2012), music could be incorporated di-rectly into the production of the film, and was of-ten composed specifically for it; Gorbman (1987)describes that in the classical model of film pro-duction, scored music is “not meant to be heardconsciously,” primarily acts as a signifier of emo-tion, and provides referential and narrative cues,such as establishing the setting or character. Theuse of Wagnerian leitmotif—the repeated associ-ation of a musical phrase with a film element,such as a character—is common in original scores,especially in those for epic films (Prendergast,1992).

Works from the “Golden Age” of film mu-sic (the period between 1935–1950, shortly af-ter the rise of synchronized sound) set the stan-

dard for cinematic scoring practices and have beenextensively analyzed in the film music literature(Slowik, 2012). Following this period, with therise of rock and roll, popular music began to makeits way into film soundtracks in addition to thescores written specifically for the movie. As Rod-man (2006) points out, this turn coincided with di-rectors seeing the potential for songs to contributeto the narrative meaning of the film:

In The Blackboard Jungle, Bill Haley’srock and roll anthem, ‘Rock Around theClock,’ was used in the opening cred-its, not only to capture the attention ofthe teenage audience, but also to sig-nify the rebellious energy of teenagers inthe 1950s. . . . The Graduate relied uponthe music and poetry of Simon and Gar-funkel to portray the alienation of Amer-ican youth of the 1960s. Easy Ridertook a more aggressive counterculturalstance by using the rock music of HoytAxton, Steppenwolf, The Byrds, andJimi Hendrix to portray the youth rebel-lion in American society, complete withcommunes, long hair and drugs (Rod-man, 2006, 123)

In recent years, the boundaries between popu-lar music and film music in the traditional sensehave become increasingly blurred, pushed for-ward especially by more affordable music produc-tion technology including synthesizers and pre-recorded samples that allow a broad range ofcomposers to use sounds previously reserved forthose with access to a full orchestra (Pinch et al.,2009). Though electronic music pioneers likeWendy Carlos have been composing for film sincethe late 1960’s (Pinch et al., 2009), pop and elec-tronic musicians have only gradually been recog-nized as film composers in their own right, withDaft Punk’s original score for Tron: Legacy in2010 marking a breakthrough into the mainstream(Anderson, 2012).

3 Data

In order to begin exploring the relationship be-tween films and their soundtracks, we gather datafrom several different sources. First, we drawon the OpenSubtitles2016 data of parallel textsfor film subtitles (Lison and Tiedemann, 2016);this dataset includes scripts for a wide variety of

34

Page 45: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

movies and episodes of television shows (106,609total in English) and contains publicly availablesubtitle data. Each film in the OpenSubtitles2016data is paired with its unique IMDB identifier; us-ing this information, we extract IMDB metadatafor the film, including title, year of release, av-erage user rating (a real number from 0-10), andgenre (a set of 28 categories, ranging from dramaand comedy to war and film-noir).

Most importantly, we also identify soundtrackinformation on IMDB using this identifier; sound-tracks are listed on IMDB in the same form as theyappear in the movie/television credits (generallyalso in the order of appearance of the song). Atypical example is the following:

Golden SlumbersWritten by John Lennon and Paul McCartneyPerformed by Jennifer HudsonJennifer Hudson appears courtesy of EpicRecordsProduced by Harvey Mason Jr.

This structured format is very consistent acrossfilms (owing to the codification of the appear-ance of this information in a film’s closing credits,which is thereby preserved in the user transcrip-tion on IMDB2). For each song in a soundtrackfor a film, we extract the title, performers and writ-ers through regular expressions (which are precisegiven the structured format).

We then identify target candidate matches fora source soundtrack song by querying the publicSpotify API for all target songs in the Spotify cat-alogue with the same title as the source song inthe IMDB soundtrack. The names of performersare not standardized across datasets (e.g., IMDBmay list an artist as The Velvet Underground, whileSpotify may list the same performance as The Vel-vet Underground and Nico). To account for this,we identify exact matches between songs as thosethat share the same title and where the longestcommon substring between the source and targetperformers spans at least 75% the length of ei-ther entity; if no exact match is found, we iden-tify the best secondary match as the target songwith the highest Spotify popularity among targetcandidates with the same title as the source. Inthe example above, if this particular performance

2https://help.imdb.com/article/contribution/titles/soundtracks/GKD97LHE9TQ49CZ7

of Golden Slumbers by Jennifer Hudson (from themovie Sing) were not in Spotify’s catalogue, itwould match the performance by The Beatles onAbbey Road.

Spotify provides a number of extracted audiofeatures for each song; from a set of 13 we chose5 that we hypothesized would be predictive ofviewer preferences and whose descriptors are alsointerpretable enough to enable discussion. Thosethat we include in our analysis are the follow-ing, with descriptions drawn from Spotify’s TrackAPI:3

• Mode. “Mode indicates the modality (majoror minor) of a track, the type of scale fromwhich its melodic content is derived. Majoris represented by 1 and minor is 0.”

• Tempo. “The overall estimated tempo of atrack in beats per minute (BPM). In musicalterminology, tempo is the speed or pace of agiven piece and derives directly from the av-erage beat duration.” In the raw data, theserange from 36 to 240; we divide by the max-imum value of 240 to give a range between0.15 and 1.

• Danceability. “Danceability describes howsuitable a track is for dancing based on acombination of musical elements includingtempo, rhythm stability, beat strength, andoverall regularity. A value of 0.0 is leastdanceable and 1.0 is most danceable.”

• Instrumentalness. “Instrumentalness pre-dicts whether a track contains no vocals.‘Ooh’ and ‘aah’ sounds are treated as instru-mental in this context. Rap or spoken wordtracks are clearly ‘vocal.’ The closer the in-strumentalness value is to 1.0, the greaterlikelihood the track contains no vocal con-tent. Values above 0.5 are intended to rep-resent instrumental tracks, but confidence ishigher as the value approaches 1.0.”

• Acousticness. “A confidence measure from0.0 to 1.0 of whether the track is acoustic. 1.0represents high confidence the track is acous-tic.”

The dataset totals 189,340 songs (96,526unique) from 41,143 movies/television shows,

3https://developer.spotify.com/web-api/get-audio-features/

35

Page 46: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

0.50

0.55

0.60

1950 1970 1990 2010

(a) Average danceability

0.44

0.46

0.48

0.50

0.52

1950 1970 1990 2010

(b) Average tempo

0.1

0.2

0.3

0.4

1950 1970 1990 2010

(c) Average instrumentalness

0.3

0.4

0.5

0.6

0.7

1950 1970 1990 2010

(d) Average acousticness

0.6

0.7

0.8

1950 1970 1990 2010

(e) Average mode (major = 1)

Figure 1: Change in audio features over time, 1950–2015. Films and TV shows in the late 1980s peakedfor danceable soundtracks, with electric instrumentation and singing. Each plot displays the averagevalue for that feature, with 95% bootstrap confidence intervals.

along with a paired script for each movie/show.Figures 1 and 2 provide summary statistics of thisdataset (using only the metadata and audio fea-tures) and begin to demonstrate the potential ofthis data. Figure 1 illustrates the change in theaverage value of each feature between 1950–2015.Soundtracks featuring acoustic songs naturally de-cline over this time period with the rise of elec-tric instruments; as time progresses, soundtracksfeature quicker tempos and include more songs inminor keys. The 1980s in particular are peaks fordanceable soundtracks, with electric instrumenta-tion and voice, while the 1990s appear to reactagainst this dominance by featuring songs withcomparatively lower danceability, higher acousticinstrumentation, and less singing.

Figure 2, in contrast, displays variations in se-lected audio features across genres. Movies andtelevision shows tagged by IMDB users with genrelabels of “war” and “western” tend to have songsthat are slow and in major keys, whereas gameshows and adult films more often have faster songsin minor keys. We can also see from figure 2 thatdifferent audio characteristics can have differentamounts of variation across genres; mode variesmore with genre than tempo does.

4 Analysis

We present two analyses here shedding some lighton the role that music plays in the narrative formof films, demonstrating the relationship betweenfine-grained topics in a film’s script and specificaudio features described in §3 above; and mea-sure the impact of audio features in the sound-track on the reception to the storytelling in theform of user reviews, attempting to control for thetopical and generic influence of the script usingtopically-coarsened exact matching in causal in-ference (Roberts et al., 2016).

4.1 Topic analysis of audio features

While we can expect to see trends in the music em-ployed in films over time and across genres, thesesurface descriptors do not tell us about the actualcontents of the film. What kind of stories havesoundtracks in minor keys? Is there a relationshipbetween the content of a movie or television showand the tempo of its soundtrack? We investigatethis by regressing the content of the script againsteach audio feature; rather than representing thescript by the individual words it contains, we seekinstead to uncover the relationship between broadthematic topics implicit in the script and those fea-

36

Page 47: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

(a) Top and bottom 5 film genres with the fastest and slowest soundtracks. Blueindicates faster tempos and red indicates slower tempos.

(b) Top and bottom 5 film genres whose soundtracks are most major and mostminor. Blue indicates major and red indicates minor.

Figure 2: Top and bottom 5 film genres in terms of average tempo and mode. Heights of the bars representpercentage difference from the mean across the entire dataset. Actual values are at the tops of the bars.

tures.To do so, we identify topics in all 41,143 scripts

using LDA (Blei et al., 2003), modeling eachscript as a distribution over 50 inferred topics, re-moving stopwords and names.

We model the impact of individual topics onaudio features by representing each document asits 50-dimensional distribution of topic propor-tions; in order to place both very frequent andinfrequent topics on the same scale, we normal-ize each topic to a standard normal across thedataset. For each of the four real-valued au-dio features y = {danceability, acousticness,instrumentalness, and tempo}, we regress therelationship between the document-topic repre-sentations in turn using OLS; for the binary-valuedmode variable (major vs. minor key), we model

the relationship between the topic representationsusing binary logistic regression.

Table 1 illustrates the five strongest positive andnegative topic predictors for each audio feature.

While the latent topics do not have defined cate-gories, we can extract salient aspects of the storiesbased on the words in the most prevalent topics.The words describing the topics most and least as-sociated with an audio feature can give us someinsight into how songs are used in soundtracks.

• Mode. Major versus minor is typically oneof the most stark musical contrasts, with themajor key being characteristically associatedwith joy and excitement, and the minor keybeing associated with melancholy (Hevner,1935). We see songs in major being used in

37

Page 48: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

0.141 captain ship sir0.074 town horse sheriff0.055 sir colonel planet0.050 boy huh big0.050 mr. sir mrs.

-0.045 sir brother heart-0.052 leave understand father-0.054 gibbs mcgee boss-0.066 agent security phone-0.083 baby yo y’all

(a) mode (major/minor)

0.044 mr. sir mrs.0.028 boy huh big0.020 christmas la aa0.019 sir dear majesty0.019 leave understand father

-0.017 um work fine-0.018 kill dead blood-0.020 fuck shit fucking-0.021 school class college-0.024 dude cool whoa

(b) acousticness

0.017 baby yo y’all0.010 woman married sex0.008 sir brother heart0.007 dude cool whoa0.006 spanish el la

-0.006 father lord church-0.007 remember feel dead-0.008 sir dear majesty-0.011 captain ship sir-0.015 sir colonel planet

(c) danceability

0.003 dude cool whoa0.003 sir colonel planet0.002 music show sing0.002 um work fine0.002 kill dead blood

-0.002 baby yo y’all-0.002 remember feel dead-0.002 sir dear majesty-0.004 mr. sir mrs.-0.005 captain ship sir

(d) tempo

0.025 sir colonel planet0.020 mr. sir mrs.0.017 years world work0.017 captain ship sir0.016 agent security phone

-0.011 school class college-0.012 baby yo y’all-0.013 woman married sex-0.014 sir brother heart-0.015 music show sing

(e) instrumentalness

Table 1: Script topics predictive of audio features. For each feature, the 5 topics most predictive of highfeature values (1-5) and the 5 topics most predictive of low feature values (46-50) are shown. Topicsare displayed as the top three most frequent words within them. Coefficients for mode (as a categoricalvariable) are for binary logistic regression; those for all other features are for linear regression.

films with polite greetings, those with sher-iffs and horses, and with ship captains. Mi-nor songs appear more often in stories involv-ing separation from parents, FBI agents, orviruses. The strongest topic associated withmajor key is “captain ship”; productionsmostly strongly associated with this topic in-clude episodes from the TV show Star Trek:The Next Generation.

• Acousticness. The acousticness of a sound-track captures the degree of electric instru-mentation; as figure 4(d) shows, acoustic-ness shows the greatest decline over time(corresponding to the rise of electric instru-ments); we see this also reflected topicallyhere, with the topic most strongly associ-ated with acoustic soundtracks being “mr. sirmrs.”; this topic tends to appear in periodpieces and older films, such as Arsenic andOld Lace (1944).

• Danceability. The danceability of a song isthe degree to which is it suitable for danc-ing. The topics most strongly associated withconsistently danceable soundtracks including

“baby yo”—dominant in movies like Mal-ibu’s Most Wanted (2003), Menace II Society(1993) Hustle & Flow (2005)—and “womenmarried”—dominant in episodes of Friendsand Sex and the City.

• Tempo. Musical tempo is a measure of pace;the strongest topic associated with fast pace isthe “dude cool” topic, include episodes fromthe TV show Workaholics and The Simpsons.Perhaps unsurprisingly, the mannered “mr.sir mrs.” topic is associated with a slowtempo.

• Instrumentalness. Instrumentalness mea-sures the degree to which a song is en-tirely instrumental (i.e., devoid of vocals likesinging), such as classical music. The “mr.sir mrs.” topic again rates highly along thisdimension (presumably corresponding withthe use of classical music in these films);also highly ranking is the “sir colonel” topic,which is primarily a subgenre of science fic-tion, including episodes from the TV showStargate and the movie Star Wars: EpisodeIII: Revenge of the Sith (2005).

38

Page 49: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

4.2 Impact on ratings

While the analysis above examines the internal re-lationship between a film’s soundtrack and its nar-rative, we can also explore the relationship be-tween the soundtrack and a film’s reception: howdo audiences respond to movies with fast sound-tracks, to acoustic soundtracks, or to soundtracksthat are predominantly classical (i.e., with highinstrumentalness)? We measure response in thiswork by the average user rating for the film onIMDB (a real value from 0-10).

One confound with this kind of analysis is thecomplication with the content of the script; as §4.1demonstrates, some topics are clearly associatedwith audio features like “acousticness,” so if weidentify a relationship between acousticness anda film’s rating, that might simply be a relation-ship between the underlying topic (e.g., “dude,”“cool,” “whoa”) and the rating, rather than acous-ticness in itself.

In order to account for this, we employ meth-ods from causal inference in observational studies,drawing on the methods of coarsened exact match-ing using topic distributions developed by Robertset al. (2016). Conventional methods for exactmatching aim to identify the latent causal exper-iment lurking within observational data by elimi-nating all sources of variation in covariates exceptfor the variable of interest, identifying a subset ofthe original data in which covariates are balancedbetween treatment conditions; in our case, if 100films have high tempo, 100 have low tempo andthe 200 films are identical in every other dimen-sion, then if tempo has a significant correlationwith a film’s rating, we can interpret that signif-icance causally (since there is no other source ofvariation to explain the relationship).

True causal inference is dependent on accuratemodel specification (e.g., its assumptions fail if animportant explanatory covariate is omitted). Inour case, we are seeking to model the relation-ship between audio features of the soundtrack andIMDB reviewers’ average rating for a film, and in-clude features representing the content of a filmthrough a.) its topic distribution and b.) explicitgenre categories from IMDB (a binary value foreach of the 28 genres). We know that this modelis mis-specified—surely other factors of a film’scontent impacting its rating may also be corre-lated with audio features—but in using the ma-chinery of causal inference, we seek not to make

causal claims but rather to provide a stricter crite-rion against which to assess the significance of ourresults.

Here, let us define the “treatment variable” tobe the variable (such as “acousticness”) whose re-lationship with rating we are seeking to establish.The original value for this variable is real; we bi-narize it into two treatment conditions (0 and 1)by thresholding at 0.5 (all values above this limitare set to 1; otherwise 0). To test the relation-ship between audio features and user ratings in thisprocedure, we place each data point in a stratumdefined by the values of its other covariates; wecoarsen the values of each covariate into a binaryvalue: for all numeric audio features, we binarizeat a threshold of 0.5; for topic distributions, wecoarsen by selecting the argmax topic as the singlebinary topic value. For each stratum with at least5 data points in each treatment condition, we sam-ple data points to reflect the overall distribution ofthe treatment variable in the data; any data pointsin strata for which there are fewer than 5 pointsfrom each condition are excluded from analy-sis. This, across all strata, defines our matcheddata for analysis. We carry out this process oncefor each treatment variable {mode, danceability,acousticness, instrumentalness, and tempo}.

Coefficient Audio feature0.121∗ Acousticness0.117∗ Instrumentalness0.031 Tempo0.024∗ Mode−0.103∗ Danceability

Table 2: Impact of audio features on IMDB aver-age user rating. Features marked ∗ are significantat p ≤ 0.001.

Table 2 presents the results of this analysis:all audio features of the soundtrack except tempohave a significant (if small) impact on the averageuser rating for the film they appear with. A highlydanceable soundtrack would lower the score of afilm from 9.0 to a 8.897; adding an acoustic sound-track would raise it to 9.121; and adding an instru-mental soundtrack with no vocals would raise it to9.117. Our experiments suggest that certain musi-cal aspects might be generally more effective thanothers in the context of a film score, and that theseattributes significantly shape a viewer’s reactionsto the overall film.

39

Page 50: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

5 Previous Work

Because the ability to forecast box office success isof practical interest for studios who want to decidewhich scripts to “Greenlight” (Eliashberg et al.,2007; Joshi et al., 2010) or where to invest mar-keting dollars (Mestyan et al., 2013), a number ofprevious studies look at predicting movie successby measuring box office revenue or viewer ratings.Eliashberg et al. (2007) used linear regression onmetadata and textual features drawn from “spoil-ers”, detailed summaries written by moviegoers,to predict return on investment in movie-making.Joshi et al. (2010) used review text from criticsto predict revenues, finding n-gram and depen-dency relation features to informative complementto metadata-based features. Jain (2013) used Twit-ter sentiment to predict box office revenue, furtherclassifying movies as either a Hit, a Flop, or Aver-age. Oghina et al. (2012) used features computedfrom Twitter and YouTube comments to predictIMDB ratings.

Though a number of previous works have at-tempted to predict film performance from textand metadata, little attention has been paid to therole of the soundtrack in a movie’s success. Xuand Goonawardene (2014) did consider sound-tracks, finding the volume of internet searches fora movie’s soundtrack preceding a release to be pre-dictive of box office revenue. This work, however,only considers the popularity of the soundtrackas a surface feature; it does not directly measurewhether the musical characteristics of the songsin the soundtracks are themselves predictive.

6 Conclusion

In this work, we introduce a new dataset of films,subtitles, and soundtracks along with two empir-ical analyses, the first demonstrating the connec-tions between the contents of a story, (as mea-sured by the topics in its script) and the musicalfeatures that make up its soundtrack, and the sec-ond identifying musical aspects that are associatedwith better user ratings on IMDB. Soundtracks us-ing acoustic instruments, as measured by Spotify’s“acousticness” descriptor, and those with instru-ments but no vocals, as measured by the “instru-mentalness” descriptor, are each linked with morethan a 0.11 increase in ratings on a 10-star scale,even when controlling for other musical dimen-sions, topic, and genre through Coarsened ExactMatching. Soundtracks that are more “danceable”

point in the opposite direction, indicating a de-crease of 0.1 stars.

We hope that one of the primary beneficiariesof the line of work introduced here will be mu-sic supervisors, whose job involves choosing ex-isting music to license or hiring composers to cre-ate original scores. Understanding the connectionsbetween the different modalities that contribute toa story can be useful for understanding the historyof film scoring and music licensing as well as formaking decisions during the production process.Though traditionally the music supervisor plays awell-defined role on a film, in contemporary prac-tice many people contribute to music supervisionthroughout the production process for all kinds ofmedia, from movies and television to advertising,social media, and games.

There are several directions of future researchthat are worth further pursuit. First, while wehave shown that strong relationships exist betweenfilms and their soundtracks as a whole and that asoundtrack is predictive of user ratings, this rela-tionship only obtains over the entirety of the scriptand the entirety of the soundtrack; a more fine-grained model would anchor occurrences of indi-vidual songs at specific moments in the temporalnarrative of the script. While our data does notdirectly indicate when a song occurs, latent vari-able modeling over the scripts and soundtracks inour collection may provide a reasonable path for-ward. Second, while our work here has focusedon descriptive analysis of this new data, a poten-tially powerful application is soundtrack genera-tion: creating a new soundtrack for a film giventhe input of a script. This application has the po-tential to be useful for music supervisors, by sug-gesting candidate songs that fit the narrative of agiven script in production.

Music is a vital storytelling component ofmultimodal narratives such as film, televisionand theatre, and we hope to drive further workin this area. Data and code to support thiswork can be found at https://github.com/jrgillick/music_supervisor.

7 Acknowledgments

Many thanks to the anonymous reviewers for theirhelpful feedback. The research reported in this ar-ticle was supported by a UC Berkeley Fellowshipfor Graduate Study to J.G.

40

Page 51: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

ReferencesApoorv Agarwal, Jiehan Zheng, Shruti Kamath, Sri-

ramkumar Balasubramanian, and Shirin Ann Dey.2015. Key female characters in film have more totalk about besides men: Automating the bechdeltest. In Proceedings of the 2015 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies. Association for Computational Linguis-tics, Denver, Colorado, pages 830–840. http://www.aclweb.org/anthology/N15-1084.

Stacey Anderson. 2012. Electronic dance music goeshollywood. The New York Times .

David Bamman, Brendan O’Connor, and Noah A.Smith. 2013. Learning latent personas of film char-acters. In Proceedings of the 51st Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers). Association for Computa-tional Linguistics, Sofia, Bulgaria, pages 352–361.

Giuseppe Becce, editor. 1919. Kinothek. Neue Film-musik. Schlesinger’sche Buchhandlung.

David M. Blei, Andrew Ng, and Michael Jordan. 2003.Latent dirichlet allocation. JMLR 3:993–1022.

James Buhler, David Neumeyer, and Rob Deemer.2010. Hearing the movies: music and sound in filmhistory. Oxford University Press New York.

David Byrne. 2012. How music works. Three RiversPress.

Cristian Danescu-Niculescu-Mizil, Justin Cheng, JonKleinberg, and Lillian Lee. 2012. You had me athello: How phrasing affects memorability. In Pro-ceedings of the 50th Annual Meeting of the Associ-ation for Computational Linguistics: Long Papers- Volume 1. Association for Computational Linguis-tics, Stroudsburg, PA, USA, ACL ’12, pages 892–901. http://dl.acm.org/citation.cfm?id=2390524.2390647.

Jehoshua Eliashberg, Sam K Hui, and Z John Zhang.2007. From story line to box office: A new approachfor green-lighting movie scripts. Management Sci-ence 53(6):881–893.

Hans Erdmann, Giuseppe Becce, and Ludwig Brav.1927. Allgemeines Handbuch der Film-Musik.Schlesinger.

Lea Frermann, Shay B Cohen, and Mirella Lapata.2017. Whodunnit? crime drama as a case fornatural language understanding. arXiv preprintarXiv:1710.11601 .

Claudia Gorbman. 1987. Unheard melodies: Narrativefilm music. Indiana University Press.

Philip John Gorinski and Mirella Lapata. 2015. Moviescript summarization as graph-based scene extrac-tion. In Proceedings of the 2015 Conference of

the North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies. pages 1066–1076.

Tanaya Guha, Che-Wei Huang, Naveen Kumar, YanZhu, and Shrikanth S Narayanan. 2015. Genderrepresentation in cinematic content: A multimodalapproach. In Proceedings of the 2015 ACM on In-ternational Conference on Multimodal Interaction.ACM, pages 31–34.

Kate Hevner. 1935. The affective character of the ma-jor and minor modes in music. The American Jour-nal of Psychology 47(1):103–118. http://www.jstor.org/stable/1416710.

Vasu Jain. 2013. Prediction of movie success us-ing sentiment analysis of tweets. The InternationalJournal of Soft Computing and Software Engineer-ing 3(3):308–313.

Mahesh Joshi, Dipanjan Das, Kevin Gimpel, andNoah A Smith. 2010. Movie reviews and rev-enues: An experiment in text regression. In HumanLanguage Technologies: The 2010 Annual Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics. Association forComputational Linguistics, pages 293–296.

Tomas Kocisky, Jonathan Schwarz, Phil Blunsom,Chris Dyer, Karl Moritz Hermann, Gabor Melis,and Edward Grefenstette. 2017. The narrativeqareading comprehension challenge. arXiv preprintarXiv:1712.07040 .

Edith Lang and George West. 1920. Musical accom-paniment of moving pictures; a practical manual forpianists and organists and an exposition of the prin-ciples underlying the musical interpretation of mov-ing pictures, by. The Boston Music Co., Boston.

Pierre Lison and Jrg Tiedemann. 2016. Opensub-titles2016: Extracting large parallel corpora frommovie and tv subtitles. In Nicoletta Calzolari (Con-ference Chair), Khalid Choukri, Thierry Declerck,Sara Goggi, Marko Grobelnik, Bente Maegaard,Joseph Mariani, Helene Mazo, Asuncion Moreno,Jan Odijk, and Stelios Piperidis, editors, Proceed-ings of the Tenth International Conference on Lan-guage Resources and Evaluation (LREC 2016). Eu-ropean Language Resources Association (ELRA),Paris, France.

Marton Mestyan, Taha Yasseri, and Janos Kertesz.2013. Early prediction of movie box office suc-cess based on wikipedia activity big data. PloS one8(8):e71226.

Andrei Oghina, Mathias Breuss, Manos Tsagkias, andMaarten de Rijke. 2012. Predicting imdb movie rat-ings using social media. In European Conference onInformation Retrieval. Springer, pages 503–507.

Trevor J Pinch, Frank Trocco, and TJ Pinch. 2009.Analog days: The invention and impact of the Moogsynthesizer. Harvard University Press.

41

Page 52: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Roy M Prendergast. 1992. Film music: a neglectedart: a critical study of music in films. WW Norton& Company.

Anil Ramakrishna, Nikolaos Malandrakis, ElizabethStaruk, and Shrikanth Narayanan. 2015. A quan-titative analysis of gender differences in movies us-ing psycholinguistic normatives. In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing. Association for Compu-tational Linguistics, Lisbon, Portugal, pages 1996–2001. https://aclweb.org/anthology/D/D15/D15-1234.

Margaret E. Roberts, Brandon Stewart, and R. Nielsen.2016. Matching methods for high-dimensional datawith applications to text. Working paper.

Ronald Rodman. 2006. The popular song as leitmotifin 1990s film. Changing Tunes: The Use of Pre-existing Music in Film pages 119–136.

Anna Rohrbach, Marcus Rohrbach, Niket Tandon, andBernt Schiele. 2015. A dataset for movie descrip-tion. CVPR .

Anna Rohrbach, Atousa Torabi, Marcus Rohrbach,Niket Tandon, Christopher Pal, Hugo Larochelle,Aaron Courville, and Bernt Schiele. 2017. Moviedescription. International Journal of Computer Vi-sion 123(1):94–120. https://doi.org/10.1007/s11263-016-0987-1.

Michael James Slowik. 2012. Hollywood film musicin the early sound era, 1926-1934 .

Makarand Tapaswi, Martin Bauml, and Rainer Stiefel-hagen. 2015. Book2movie: Aligning video sceneswith book chapters. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR).

Haifeng Xu and Nadee Goonawardene. 2014. Doesmovie soundtrack matter? the role of soundtrack inpredicting movie revenue. In PACIS. page 350.

Yukun Zhu, Ryan Kiros, Richard S. Zemel, RuslanSalakhutdinov, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Aligning books and movies:Towards story-like visual explanations by watchingmovies and reading books. ICCV .

42

Page 53: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Proceedings of the First Workshop on Storytelling, pages 43–49New Orleans, Louisiana, June 5, 2018. c©2018 Association for Computational Linguistics

Towards Controllable Story Generation

Nanyun Peng Marjan Ghazvininejad Jonathan May Kevin KnightInformation Sciences Institute & Computer Science Department

University of Southern California{npeng,ghazvini,jonmay,knight}@isi.edu

Abstract

We present a general framework of analyzingexisting story corpora to generate controllableand creative new stories. The proposed frame-work needs little manual annotation to achievecontrollable story generation. It creates a newinterface for humans to interact with comput-ers to generate personalized stories. We applythe framework to build recurrent neural net-work (RNN)-based generation models to con-trol story ending valence1 (Egidi and Gerrig,2009) and storyline. Experiments show thatour methods successfully achieve the controland enhance the coherence of stories throughintroducing storylines. with additional controlfactors, the generation model gets lower per-plexity, and yields more coherent stories thatare faithful to the control factors according tohuman evaluation.

1 Introduction

Storytelling is an important task in natural lan-guage generation, which plays a crucial role in thegeneration of various types of texts, such as nov-els, movies, and news articles. Automatic storygeneration efforts started as early as the 1970swith the TALE-SPIN system (Meehan, 1977).Early attempts in this field relied on symbolicplanning (Meehan, 1977; Lebowitz, 1987; Turner,1993; Bringsjord and Ferrucci, 1999; Perez andSharples, 2001; Riedl and Young, 2010), case-based reasoning (Gervas et al., 2005), or gener-alizing knowledge from existing stories to assem-ble new ones (Swanson and Gordon, 2012; Liet al., 2013). In recent years, deep learning mod-els are used to capture higher level structure instories. Roemmele et al. (2017) use skip-thoughtvectors (Kiros et al., 2015) to encode sentences,and a Long Short-Term Memory (LSTM) net-work (Hochreiter and Schmidhuber, 1997) to gen-

1Happy or sad endings.

Stories

Analyzer

Generator

HumanInputs

StructuredControlFactors

Samwasastarathlete.Herantrackatcollege.Therewasabigracecomingup.Everyonewassurehewouldwin.

sadending

Samgotverynervousandlostthegame.

happyending

Hegotthefirstprize!

Figure 1: An overview (upper) and an example (lower) ofthe proposed analyze-to-generate story framework.

erate stories. Martin et al. (2017) train a recurrentencoder-decoder neural network (Sutskever et al.,2014) to predict the next event in the story.

Despite significant progress in automatic storygeneration, there has been less emphasis on con-trollability: having a system takes human inputsand composes stories accordingly. With the recentsuccesses on controllable generation of images(Chen et al., 2016; Siddharth et al., 2017; Lampleet al., 2017), dialog responses (Wang et al., 2017),poems (Ghazvininejad et al., 2017), and differentstyles of text (Hu et al., 2017; Ficler and Goldberg,2017; Shen et al., 2017; Fu et al., 2017). peoplewould want to control a story generation systemto produce interesting and personalized stories.

This paper emphasizes the controllability as-pect. We propose a completely data-driven ap-proach towards controllable story generation byanalyzing the existing story corpora. First, an an-alyzer extracts control factors from existing sto-ries, and then a generator learns to generate sto-ries according to the control factors. This createsan excellent interface for humans to interact: thegenerator can take human-supplied control factorsto generate stories that reflect a user’s intent. Fig-

43

Page 54: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

ure 1 gives the overview (upper) and an example(lower) of the framework. The instantiations of theanalyzer and the generator are flexible and can beeasily applied to different scenarios. We exploretwo control factors: (1) ending valence (happy orsad ending) and (2) storyline keywords. We usesupervised classifiers and rule-based keyword ex-tractors for analysis, and conditional RNNs forgeneration.

The contributions of the paper are two-fold:

1. We propose a general framework enabling in-teractive story generation by analyzing exist-ing story corpora.

2. We apply the framework to control story end-ing valence and storyline, and show that withthese additional control factors, our modelsgenerate stories that are both more coherentand more faithful to human inputs.

2 Controllable Story Generation

As a pilot study, we explore the control of 1) end-ing valence, which is an abstract, style-level ele-ment of stories, and 2) storyline, which is a moreconcrete, content-level concept for stories.

2.1 Ending Valence ControlPrior work has explored manipulating emotion ininteractive storytelling (Cavazza et al., 2009). Forsimplicity, we refine our scope to manipulating theending valence for controllable story generation.We categorize ending valence into happyEnding,sadEnding, or cannotTell.

Analyzer. The analyzer for the ending valencecontrol is a classifier that labels each story as hap-pyEnding, sadEnding, or cannotTell. Formally,given a story corpus X = {x1,x2, · · · ,xN} withN stories, the ending valence analyzer is a func-tion fv that maps each story xi to a label li:

li = fv(xi),

where i indexes instances. Since there is no priorwork on analyzing story ending valence, we buildour own analyzer by collecting some annotationsfor story ending valence from Amazon Mechani-cal Turk (AMT) and building a supervised classi-fier. We employ an LSTM-based logistic regres-sion classifier as it learns feature representationsthat capture long-term dependencies between thewords, and has been shown efficient in text classi-fication tasks (Tang et al., 2015).

Specifically, we use a bidirectional-LSTM toencode an input story into a sequence of vectorrepresentations hi = {hi,1, hi,2, · · · , hi,T }, wherehi = BiLSTM(xi) = [

−→h i;←−h i], T denotes the

story length, [, :, ] denotes element-wise concate-nation.

−→h i and

←−h i are sequences of vectors com-

puted by a forward and a backward LSTM. anLSTM-cell is applied at each step to complete thefollowing computations:

ii,tfi,toi,tci,t

=

σσσ

tanh

U

(Ewxi,thi,t−1

)(1a)

ci,t = fi,t � ci,t−1 + ii,t � ci,t (1b)

hi,t = LSTM -cell(xi,t, hi,t−1) (1c)

= oi,t � tanh(ci,t) (1d)

ii,t, fi,t, oi,t, ci,t are the input, forget, output gates,and a contemporary central memory that controlthe information flow of the previous contexts andthe current input. σ and tanh denotes element-wise sigmoid and tanh function. Ew is an embed-ding matrix that maps an input word xi,t to a x-dimensional vector. Each hi,t is a d-dimensionalvector that can be viewed as the contextual repre-sentation of word xi,t.

To obtain the sentence representation, we takea max pooling over the sentence, where for eachdimension j of the vector hi, we have:

hji = maxt∈[1,...,T ] hji,t, j = 1, ..., d. (2)

The final classifier is defined as:

fv(xi) = g(Whi + b), (3)

where g() is the softmax function, W and b aremodel parameters that are jointly learned with theBiLSTM parameters.

Generator. The generator for the endingvalence-controlled story generation is a con-ditional language model, where the probabil-ity of generating each word is denoted asp(wt|wt−1

1 , l; θ); l represents the ending valencelabel and θ represents model parameters. We learnvalence embeddings for the ending valence labelsto facilitate the computation. Formally, we learnan embedding matrix El to map each label lk intoa vector:

ekl = El[lk],

44

Page 55: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

whereEl is am×pmatrix that maps each label (pof them) into a m-dimensional vector. The endingvalence embeddings dimension are made the sameas the word embedding dimension for simplicity.

We add the ending valence as follows:

p(wt|wt−11 , l; θ) =

{g(V F(el,F(wt−1, ht−1))), t = sg(V F(wt−1, ht−1)), t = others

(4)

where s denotes the position right before the end-ing sentence, g() is the softmax function, F meansthe computations of an LSTM-cell, and V denotesparameters that perform a linear transformation.We treat the ending valence as an additional inputto the story. The valence embeddings are jointlylearned with other model parameters.

2.2 Storyline ControlLi et al. (2013) introduced plot graphs which con-tain events and their relations to represent story-line. Although the representation is rich, theseplot graphs are hard to define and curate withouthighly specialized knowledge. In this pilot study,we follow what Yan (2016) did for poetry, to usea sequence of words as the storyline. We furtherconfine the words to appear in the original story.

Analyzer. The analyzer for storyline controlis an extractor that extracts a sequence of wordski = {ki,1, ki,2, . . . , ki,r} from each story xi. Thekis are ordered according to their order in thestory. We adapt the RAKE algorithm (Rose et al.,2010) for keyword extraction, which builds docu-ment graphs and weights the importance of eachword combining several word-level and graph-level criteria. We extract the most important wordfrom each sentence as the storyline.

Generator. The generator for storyline-controlled generation is also a conditional lan-guage model. Specifically, we employ the seq2seqmodel with attention (Bahdanau et al., 2014) im-plemented in OpenNMT (Klein et al., 2017).Specifically, the storyline words are encoded intovectors by a BiLSTM: hk = BiLSTM(k) =

[−→h k;←−h k], and the decoder generate each word ac-

cording to the probability:

p(wt|wt−11 ,hk; θ) = g(V lst) (5a)

st = F att(wt−1, st−1, ct) (5b)

ct =

r∑

j=1

αtjhkj (5c)

αtj =exp(a(st−1, hkj ))∑rp=1 exp(a(st−1, h

kp))

(5d)

Agreement experiment Cases AgreementResearcher vs. Researcher 150 83%Turkers vs. Researcher 150 78%Classifier vs. Turkers 3980 69%Always happyEnding 3980 58%

Table 1: Annotation agreement for labeling story endingvalence. Labels are happyEnding, sadEnding, or cannotTell.The automatic classifier trained on 3980 turker annotated sto-ries achieved much better results than the majority baselineon 5-fold cross-validation.

Method PPLuncontrolled model 24.63Storyline controlled 18.36

Table 2: Perplexities on the ROCstories development data.When storylines are given, the controlled models achievelower perplexity than the uncontrolled one.

g() again denotes the softmax function, and V l

denotes parameters that perform a linear transfor-mation. F att() in Equation 5b denotes the com-putations of an LSTM-cell with attention mech-anism, where the context vector ct is computedby an weighted summation of the storyline wordsvectors as in Equation 5c, and the weights arecomputed from some alignment function a() as inEquation 5d.

3 Experimental Setup

We conduct experiments on the ROCstoriesdataset (Mostafazadeh et al., 2016), which consistsof 98,162 five-line stories for training, and 1871stories each for the development and test sets. Wetreat the first four sentences of each story as thebody and the last sentence as the ending. We buildanalyzers to annotate the ending valence and thestoryline for every story, and train the two con-trolled generators with 98,162 annotated stories.

3.1 Ending Valence Annotation

We conduct a three-stage data collection proce-dure to gather ending valence annotations andtrain a classifier to analyze the whole corpora. Weclassify all the stories into happyEnding, sadEnd-ing, or cannotTell. Table 1 summarizes the results.

In the first stage, two researchers annotate 150stories to gauge the feasibility of the task. It isnontrivial, as the agreement between the two re-searchers is only 83%, mainly because of the can-notTell case2. The second stage collects larger-

2The inter-annotator agreement is 95% if we exclude theinstances that at least one person chose cannotTell.

45

Page 56: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

MethodsEnding Valence Storyline

Faithfulness Coherence Faithfulness Coherenceavg score % win avg score % win avg score % win avg score % win

Retrieve Human 3.20 19.6 2.89 16.4 3.47 42.4 3.44 17.6Uncontrolled 3.26 27.8 3.54 41.8 2.82 20.8 3.54 29.2

Controlled-Generated 3.44 52.6 3.37 41.8 3.08 36.8 3.74 53.2

Table 3: Human evaluation for the ending valence (left) and storyline (right) controlled generation. Scores range in [1,5].Three stories (one from each method) are grouped together so that people can give comparative scores. Faithfulness surveyasks people to rate whether the generated stories reflect the given control factors. Coherence asks people to rate the coherenceof the stories without considering the control factors. % win measures how often the generated result by one method is ratedhigher than others, excluding the instances that tie on the highest score.

scale annotations from AMT. We gather 3980 an-notated stories with the turker-researcher agree-ment at 78%. A classifier as described in Sec-tion 2.1 is then trained to analyze the whole ROC-stories corpora. Using 5-fold cross-validation, weestimate the accuracy of the classifier to be 69%3,which, while not terribly impressive, is an 11%improvement over the majority baseline (hap-pyEnding). Considering the low inter-annotatoragreement on this problem, we consider this a de-cent analyzer.

4 Experimental Results

We compare the controlled generation under ourproposed framework with the uncontrolled gener-ation. We design the experiments to answer thefollowing research questions:

1. How does the controlled generation frame-work affect the generation quantitatively?

2. Does the proposed framework enables con-trols to the stories while maintaining the co-herence of the stories?

To answer the former question, we design auto-matic evaluations that measure the perplexity ofthe models given appropriate and inappropriatecontrols. For the latter question, we design humanevaluations to compare the generated stories fromcontrolled and uncontrolled versions in terms ofthe document-level coherence and the faithfulnessto the control factors.

4.1 Automatic EvaluationThe advantages of the automatic evaluation is thatit can be conducted at scale and gives panoramicviews of the systems. We compute the perplex-ities of different models on the ROCstories de-velopment dataset. Table 2 shows the results for

3We included the cannotTell cases and conducted a 3-class classification.

the storyline experiments. With the additional sto-ryline information, it is easier for the generationmodel to guess what will happen next in a story,thus yield lower perplexities. We conduct the sameexperiments for ending valence controlled genera-tion and observe the same. However, since endingvalence is only one bit of information, the perplex-ity difference is only 0.8.

4.2 Human Evaluation

We conduct a human evaluation with 1000 storygroups for each setting. Each group consists ofstories from: (1) the uncontrolled LSTM gener-ation model, (2) controlled generation with ourframework, and (3) a contrastive method which re-trieves and re-ranks existing sentences in the train-ing data. Users are asked to rate the three storieson a 1-5 scale with respect to faithfulness (whetherstories reflect the control factor), and coherence.All the evaluations are conducted on Amazon Me-chanic Turk. We compute the average score andpercentage win of each method. Table 3 summa-rizes the results.

Ending Valence For the ending valence control,we supply each system with the first 4 sentencesfrom ROCStories test set and an ending valencerandomly assigned by a human. The systems gen-erate endings4. We only let the systems generatehappyEnding or sadEnding stories, with the ratioaround 1:1. Faithfulness is defined as whether thegenerated stories reflect the given ending valence.

The contrastive method retrieves existing happyor sad endings from the training data instead ofgenerating new sentences. Specifically, we gatherall the stories that are annotated with happyEnd-ings from the 3980 annotated stories in one set,and all the sadEndings in another set. When thegiven ending valence is happyEnding, the sys-

4The uncontrolled LSTM generation model has no way totake the ending valence input.

46

Page 57: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Story Body Setting Endingsarah had been dreaming ofvisiting europe for years. shehad finally saved enough forthe trip. she landed inspain and traveled east acrossthe continent. she did n’tlike how different everythingwas.

Human sarah decided she preferred home over europe.

No-Valence-Control sarah was so happy that she had a great time.

Happy-Generated sarah was happy to go home.

Sad-Generated sarah was very disappointed.

Happy-Retrieved they both had a lot of fun.

Sad-Retrieved I had to go to the hospital.

samuel loved reading old sci-ence fiction stories. he col-lected books by hg wells andjules verne. his favorite washg wells. his father gave hima copy of the island of drmoreau.

Human he loved it!

No-Valence-Control he was so happy to see the book.

Happy-Generated he was very happy to see the new book.

Sad-Generated he was so sad that he couldn’t find his book.

Happy-Retrieved by the end of the day, he was glad he had gone!

Sad-Retrieved bob got very sick and had to go to the hospital.

Table 4: Samples of endings generated by given the first four sentences of a story and an ending valence. The uncontrolledversion cannot take the designated ending valence into account, while both the controlled versions reflect the correct valence.

Users’ Storyline Generated Storyjohn angry meetingpark happy

john was a very angry person. he was angry at his job. john decided to goto the park. john went to the park. john was very happy with his new job.

john angry meetingpark wolf

john was very angry. he had a meeting at the park. he went to the park. hesaw a wolf. john was very angry.

jennifer angry meetingpark happy

jennifer was very angry. she had a meeting at the park. jennifer was nothappy. she went to the park. jennifer was happy to be meeting her friend.

jennifer angry meetingpark wolf scared

jennifer was very angry. she had a meeting at the park. she went to thepark. she saw a wolf. jennifer was very angry.

Table 5: Samples of stories generated from user-given storyline words. The generated stories usually contains the words inthe storyline. Changing one word in the storyline can result in another creative story.

tem scores all the sentences from the happenEnd-ing set by combining it with the given 4 sen-tences, and using a trained uncontrolled gener-ation model to compute the likelihood. Thischooses the most coherent happyEnding for thegiven story. Similarly for the sadEnding sto-ries. Table 3 shows that the proposed analyze-to-generate framework (“Controlled-Generated”)achieves the highest faithfulness score while re-taining similar coherence as the uncontrolled one.

Storyline For storyline control, we supply eachsystem with 5 words as a storyline. The systemsgenerate stories accordingly. Storyline words areextracted from the ROCstories test set. The uncon-trolled generation model cannot take this input; itgenerates random stories. Faithfulness is definedas whether the generated stories follow the givenstoryline.

The contrastive method retrieves human writ-

ten sentences in the training data to compose sto-ries. Specifically, it follows the given storylinewords order to retrieve sentences from the trainingdata. The trained uncontrolled generation modelscores each sentence based on existing previoussentences and choose the highest scoring sentencefor each word in the storyline. If a word in the sto-ryline has never appeared in the training data, wesimply skip it.

As shown in Table 3, the contrastive methodachieves the highest faithfulness, probably be-cause it guarantees the words in the storyline ap-pear in the stories while the other systems cannot.However, the coherence of the contrastive methodis lowest, because it is constrained by the exist-ing sentences in the training data. Although anuncontrolled generation model is employed to en-courage document-level coherence, the availablechoices are restricted. Our method achieves the

47

Page 58: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

best coherence and higher faithfulness score thanthe uncontrolled version.

4.3 Generation SamplesTable 4 shows two examples of the ending valencecontrolled generation. The uncontrolled model“No-Valence-Control” can generate coherent end-ings; However, it cannot alter the ending valence.On the other hand, the two controlled models cangenerate different endings based on different end-ing valence. The contrastive retrieval method, re-stricted by the existing happyEnding and sadEnd-ing in the training data, obtains endings that arenot coherent with the whole story.

Table 5 demonstrates some examples from thestoryline controlled generation. The storylinewords are user supplied. We can see that this pro-vides fun interactions: changing one word in thestoryline can result in a creative new story.

5 Conclusion

We proposed an analyze-to-generate frameworkthat enables controllable story generation. Theframework is generally applicable for many con-trol factors. In this paper, two instantiations ofthe framework are explored to control the end-ing valence and the storyline of stories. Exper-iments show that our framework enables humancontrols while achieving better coherence than anuncontrolled generation models. In the future,we will explore other control factors and bettercontrollable generation models to adding the con-trol factors into the generated stories. The cur-rent analyze-to-generate framework is done in apipeline fashion. We also plan to explore the jointtraining of the analyzer and the generator to im-prove the quality of both.

Acknowledgements

We thank the anonymous reviewers for the use-ful comments and the suggestion for the right ter-minology of “ending valence”. This work is sup-ported by Contract W911NF-15-1-0543 with theUS Defense Advanced Research Projects Agency(DARPA).

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .

Selmer Bringsjord and David Ferrucci. 1999. Artificialintelligence and literary creativity: Inside the mindof brutus, a storytelling machine. Psychology Press.

Marc Cavazza, David Pizzi, Fred Charles, Thurid Vogt,and Elisabeth Andre. 2009. Emotional input forcharacter-based interactive storytelling. In Proceed-ings of The 8th International Conference on Au-tonomous Agents and Multiagent Systems-Volume1. International Foundation for Autonomous Agentsand Multiagent Systems, pages 313–320.

Xi Chen, Yan Duan, Rein Houthooft, John Schulman,Ilya Sutskever, and Pieter Abbeel. 2016. Infogan:Interpretable representation learning by informationmaximizing generative adversarial nets. In Proceed-ings of Advances in Neural Information ProcessingSystems.

Giovanna Egidi and Richard J Gerrig. 2009. How va-lence affects language processing: Negativity biasand mood congruence in narrative comprehension.Memory & cognition 37(5):547–555.

Jessica Ficler and Yoav Goldberg. 2017. Controllinglinguistic style aspects in neural language genera-tion. In Proceedings of the Workshop on StylisticVariation. pages 94–104.

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, DongyanZhao, and Rui Yan. 2017. Style transfer in text:Exploration and evaluation. In Proceedings of The32nd AAAI Conference on Artificial Intelligence.

Pablo Gervas, Belen Diaz-Agudo, Federico Peinado,and Raquel Hervas. 2005. Story plot genera-tion based on CBR. Knowledge-Based Systems18(4):235–242.

Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, andKevin Knight. 2017. Hafez: an interactive poetrygeneration system. Proceedings of 55th AnnualMeeting of the Association for Computational Lin-guistics, System Demonstrations pages 43–48.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation9(8):1735–1780.

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Toward con-trolled generation of text. In Proceedings of the In-ternational Conference on Machine Learning. pages1587–1596.

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. InProceedings of Advances in neural information pro-cessing systems.

Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander M. Rush. 2017. Open-NMT: Open-source toolkit for neural machine trans-lation. In ACL.

48

Page 59: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Guillaume Lample, Neil Zeghidour, NicolasUsunier, Antoine Bordes, Ludovic Denoyer,and Marc’Aurelio Ranzato. 2017. Fader networks:Manipulating images by sliding attributes. In Pro-ceedings of 31st Conference on Neural InformationProcessing Systems.

Michael Lebowitz. 1987. Planning stories. In Proceed-ings of the cognitive science society, Hillsdale.

Boyang Li, Stephen Lee-Urban, George Johnston, andMark Riedl. 2013. Story generation with crowd-sourced plot graphs. In Proceedings of The 28thAAAI Conference on Artificial Intelligence.

Lara Martin, Prithviraj Ammanabrolu, William Han-cock, Shruti Singh, Brent Harrison, and Mark Riedl.2017. Event representations for automated storygeneration with deep neural nets. In Proceedingsof the Thirty-Second AAAI Conference on ArtificialIntelligence.

James Meehan. 1977. TALE-SPIN, an interactive pro-gram that writes stories. In Proceedings of The In-ternational Joint Conferences on Artificial Intelli-gence Organization.

Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A cor-pus and cloze evaluation for deeper understandingof commonsense stories. In Proceedings of the Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies.

Rafael Perez and Mike Sharples. 2001. MEXICA: Acomputer model of a cognitive account of creativewriting. Journal of Experimental & Theoretical Ar-tificial Intelligence 13(2):119–139.

Mark Riedl and Robert Young. 2010. Narrative plan-ning: Balancing plot and character. Journal of Arti-ficial Intelligence Research 39:217–268.

Melissa Roemmele, Sosuke Kobayashi, Naoya Inoue,and Andrew M Gordon. 2017. An RNN-based bi-nary classifier for the story cloze test. LSDSem 2017page 74.

Stuart Rose, Dave Engel, Nick Cramer, and WendyCowley. 2010. Automatic keyword extraction fromindividual documents. Text Mining: Applicationsand Theory pages 1–20.

Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola. 2017. Style transfer from non-parallel textby cross-alignment. In Proceedings of Advances inneural information processing systems.

N Siddharth, Brooks Paige, Van de Meent, Alban Des-maison, Frank Wood, Noah D Goodman, PushmeetKohli, Philip HS Torr, et al. 2017. Learning disen-tangled representations with semi-supervised deepgenerative models. In Proceedings of Advances inNeural Information Processing Systems.

Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014. Se-quence to sequence learning with neural networks.In Proceedings of Advances in neural informationprocessing systems.

Reid Swanson and Andrew Gordon. 2012. Say any-thing: Using textual case-based reasoning to enableopen-domain interactive storytelling. ACM Trans-actions on Interactive Intelligent Systems 2(3):16.

Duyu Tang, Bing Qin, and Ting Liu. 2015. Documentmodeling with gated recurrent neural network forsentiment classification. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing. pages 1422–1432.

Scott R. Turner. 1993. Minstrel: A Computer Model ofCreativity and Storytelling. Ph.D. thesis, Los Ange-les, CA, USA. UMI Order no. GAX93-19933.

Di Wang, Nebojsa Jojic, Chris Brockett, and Eric Ny-berg. 2017. Steering output style and topic in neu-ral response generation. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing. pages 2130–2140.

Rui Yan. 2016. i, poet: Automatic poetry compositionthrough recurrent neural networks with iterative pol-ishing schema. In Proceedings of The InternationalJoint Conferences on Artificial Intelligence Organi-zation. pages 2238–2244.

49

Page 60: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Proceedings of the First Workshop on Storytelling, pages 50–59New Orleans, Louisiana, June 5, 2018. c©2018 Association for Computational Linguistics

An Encoder-decoder Approach to Predicting Causal Relations in Stories

Melissa Roemmele and Andrew S. GordonInstitute for Creative Technologies, University of Southern California

[email protected], [email protected]

Abstract

We address the task of predicting causally re-lated events in stories according to a standardevaluation framework, the Choice of Plausi-ble Alternatives (COPA). We present a neu-ral encoder-decoder model that learns to pre-dict relations between adjacent sequences instories as a means of modeling causality. Weexplore this approach using different meth-ods for extracting and representing sequencepairs as well as different model architectures.We also compare the impact of different train-ing datasets on our model. In particular, wedemonstrate the usefulness of a corpus not pre-viously applied to COPA, the ROCStories cor-pus. While not state-of-the-art, our results es-tablish a new reference point for systems eval-uated on COPA, and one that is particularly in-formative for future neural-based approaches.

1 Introduction

Automated story understanding is a long-pursuedtask in AI research (Dehn, 1981; Lebowitz, 1985;Meehan, 1977). It has been examined as a com-monsense reasoning task, by which systems makeinferences about events that prototypically occurin common experiences (e.g. going to a restau-rant) (Schank and Abelson, 1977). Early work of-ten failed to scale beyond narrow domains of sto-ries due to the difficulty of automatically inducingstory knowledge. The shift to data-driven AI es-tablished new opportunities to acquire this knowl-edge automatically from story corpora. The fieldof NLP now recognizes that the type of common-sense reasoning used to predict what happens nextin a story, for example, is as important for natu-ral language understanding systems as linguisticknowledge itself.

A barrier to this research has been the lackof standard evaluation schemes for benchmarkingprogress. The Story Cloze Test (Mostafazadeh

et al., 2016) was recently developed to addressthis, with a focus on predicting events that aretemporally and causally related within commonreal-world scenarios. The Story Cloze Test in-volves selecting which of two given sentences bestcompletes a particular story. Related to this isthe Choice of Plausible Alternatives (COPA) task(Roemmele et al., 2011), which uses the samebinary-choice format to elicit a prediction for ei-ther the cause or effect of a given story event.While the Story Cloze Test involves predicting theending of a story, COPA items focus specificallyon commonsense knowledge related to identifyingcausal relations between sequences.

The competitive approaches to narrative predic-tion evaluated by the Story Cloze Test largely in-volve neural networks trained to distinguish be-tween correct and incorrect endings of stories (Caiet al., 2017, e.g.). A neural network approach hasyet to be applied to the related COPA task. Inthe current paper, we initiate this investigation intothese models for COPA. In particular, we evaluatean encoder-decoder model that predicts the prob-ability that a particular sequence follows anotherin a story. Our experiments explore a few differ-ent variables for configuring this approach. First,we examine how to extract temporally related se-quence pairs provided as input to the model. Sec-ond, we vary the use of feed-forward versus re-current layers within the model. Third, we as-sess different vector-based representations of thesequence pairs. Finally, we compare our model us-ing different narrative corpora for training, includ-ing the ROCStories corpus which was developedin conjunction with the Story Cloze Test. Our re-sults are presented in comparison to existing sys-tems applied to COPA, which involve lexical co-occurrence statistics gathered from web corpora.Our best-performing model achieves an accuracyof 66.2% on the COPA test set, which falls short of

50

Page 61: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

the current state-of-the-art of 71.2% (Sasaki et al.,2017). Interestingly, this best result utilizes theROCStories for training, which is only a smallfraction of the size of the datasets used in existingapproaches. Applying our model to these largerdatasets actually yields significantly worse perfor-mance, suggesting that the model is sensitive tothe density of commonsense knowledge containedin its training set. We conclude that this densityis far more influential to COPA performance thanjust data quantity, and further success on the taskwill depend on methods for isolating implicit com-monsense knowledge in text.

2 Choice of Plausible Alternatives

The Choice of Plausible Alternatives (COPA) iscomposed of 1,000 items, where each item con-tains three sentences, a premise and two alter-natives, as well as a prompt specifying the rela-tion between them. The items are divided equallyinto development and test sets of 500 items each.The goal is to select which alternative conveysthe more plausible cause (or effect, based on theprompt) of the premise sentence. Half of theprompts elicit the more plausible effect of thepremise event, while the other half ask for themore plausible cause of the premise.

1. Premise: The homeowners disliked theirnosy neighbors. What happened as a result?Alternative 1:* They built a fence aroundtheir property.Alternative 2: They hosted a barbecue intheir backyard.

2. Premise: The man fell unconscious. Whatwas the cause of this?Alternative 1:* The assailant struck the manin the head.Alternative 2: The assailant took the man’swallet.

Above are examples of COPA items, where thedesignated correct alternative for each is starred.In a given item, both alternatives refer to eventsthat could be found within the same story, but thecorrect one conveys a more coherent causal rela-tion. All sentences consist of a single clause witha past tense verb. COPA items were written bya single author and then validated by other an-notators to ensure human accuracy approximated100%. See Roemmele et al. (2011) for further de-tails about the authoring and validation process.

3 Existing Approaches

Roemmele et al. (2011) presented a baseline ap-proach to COPA that focused on lexical co-occurrence statistics gathered from story corpora.The general idea is that a causal relation betweentwo story events can be modeled by the proxim-ity of the words that express the events. Thisapproach uses the Pointwise Mutual Information(PMI) statistic (Church and Hanks, 1990) to com-pute the number of times two words co-occurwithin the same context (i.e. within a certain Nnumber of words of each other in a story) rela-tive to their overall frequency. This co-occurencemeasure is order-sensitive such that the first wordin the pair is designated as the cause word and thesecond as the effect word, based on the assump-tion that cause events are more often describedbefore their effects in stories, relative to the re-verse. To calculate an overall causality score fortwo sequences S1 and S2, each cause word c inS1 is paired with each effect word e in S2, andthe PMI scores of all word pairs are averaged:∑

c∈S1

∑e∈S2 PMI(c,e)

|S1|∗|S2| . For a given COPA item, thepredicted alternative is the one that has the highercausality score with regard to the premise. Sincethe scores are asymmetric in assuming S1 is thecause of S2, COPA items that elicit the more plau-sible effect (i.e. items where “What happened asa result?” is the prompt) assign the premise andalternative to S1 and S2 respectively, whereas thisassignment is reversed for items where “What wasthe cause of this?” is the prompt. Gordon et al.(2011) applied this approach to PMI scores takenfrom a corpus of one million stories extracted frompersonal weblogs, which were largely non-fictionstories about daily life events written from thefirst-person perspective. A co-occurrence betweentwo words was counted when they appeared within25 words of one another in the same story. This re-sulted in 65.2% accuracy on the COPA test set.

The PMI approach assumes a causal relation be-tween events can be captured to some degree bytheir temporal co-occurrence in a story. Luo et al.(2016) introduced a variation that alternatively fo-cuses on explicit mentions of causality in text, re-ferred to as the CausalNet approach. They ex-tracted sequences matching lexical templates thatsignify causality, e.g. S1 LEADS TO S2, S2 RE-SULTS FROM S1, S2 DUE TO S1, where again S1is the cause event and S2 is the effect. As before,a co-occurrence is counted between each pair of

51

Page 62: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

words (c, e) in S1 and S2, respectively. Theypropose a measure of causal strength that adaptsthe PMI statistic to model both necessary causalityand sufficient causality for a given pair (c, e). Inthe measure for necessary causality, a discountingfactor is applied to the overall frequency of c in thePMI statistic, which models the degree to which cmust appear in order for e to appear. Alternatively,the measure for sufficient causality applies the dis-counting factor to the overall frequency of e, mod-eling the degree to which c alone will result in theoccurrence of e. The necessary and sufficient PMIscores for a given word pair are combined into asingle causal strength score. Akin to the previousapproach, the overall causality score for two se-quences is given by averaging the scores for theirword pairs. See Luo et al. for further technicaldetails about the causal strength measure.

Luo et al. applied this approach to extract causalpairs from a corpus of approximately 1.6 billionweb pages. They achieved 70.2% accuracy onthe COPA test set, significantly outperforming theresult from Gordon et al. (2011). Sasaki et al.(2017) evaluated the same CausalNet approach ona smaller corpus of web documents, ClueWeb1,which contains 700 million pages. They discov-ered that treating multi-word phrases as discretewords in the pairs boosted accuracy to 71.2%.Both results indicate that causal knowledge canbe extracted from large web data as an alterna-tive to story corpora. Rather than assuming thatcausality is implicitly conveyed by temporally re-lated sequences, they relied on explicit mentionsof causality to filter data relevant to COPA. Still,a lot of causal knowledge in stories is not high-lighted by specific lexical items. Consider the se-quence “John starts a pot of coffee because heis sleepy”, for example. This sequence wouldbe extracted by the CausalNet approach since itcontains one of the designated lexical markersof causality (“because”). However, the sequence“John is sleepy. He starts a pot of coffee” ex-presses the same causal relation but would not becaptured, and we know by people’s ability to an-swer COPA questions that they can infer this rela-tion. Using a large web corpus can possibly com-pensate for missing these instances, since the samecausal relations may be conveyed by sequencesthat contain explicit mentions of causality. How-ever, it still means that a lot of causal information

1lemurproject.org/clueweb12/

is potentially being overlooked.

4 Neural Network Approach

As mentioned in the introduction, our work ini-tiates the exploration of neural approaches toCOPA. We focus here on an encoder-decoder ar-chitecture. Originally applied to machine trans-lation (Cho et al., 2014), encoder-decoder mod-els have been extended to other sequence model-ing tasks like dialogue generation (Serban et al.,2016; Shang et al., 2015) and poetry generation(Ghazvininejad et al., 2016; Wang et al., 2016).We propose that this technique could be similarlyuseful for our task in establishing a mapping be-tween cause-effect sequence pairs. This directmodeling of co-occurrence between sequences isunique from the previous work, which relies onco-occurrence between pairs of individual words.

4.1 Sequence Segmentation

The inputs and outputs for the encoder-decodermodel are each word sequences. Given a cor-pus of stories as the training set for a model, wefirst segmented each story by clausal boundaries.This was done heuristically by analyzing the de-pendency parse of each sentence. Words whosedependency label was an adverbial clause mod-ifier (ADVCL; e.g. “After I got home, I got atext from her.”), conjunct (CONJ; “I dropped theglass and the glass broke.”), or prepositional com-plement (PCOMP; “He took me to the hospital toseek treatment.”) were detected as the heads ofclauses distinct from the main clause. All contigu-ous words dependent on the same head word weresegmented as a separate clause. These particularlabels do not capture all clausal boundaries (for ex-ample, relative clauses are not detected), but theyare intended to distinguish sequences that may re-fer to separate narrative events (e.g. “I dropped theglass” is segmented from “and the glass broke”).This is somewhat analogous to the segmentationperformed by Luo et al. (2016) that splits causeand effect clauses according to lexical templates.The difference is that the parsing labels we use forsegmentation do not explicitly indicate boundariesbetween causally related events. We did not per-form an intrinsic evaluation of this procedure interms of how often it correctly segmented narra-tive events. Instead, we evaluated its impact onCOPA prediction by comparing it to traditionalsegmentation based on sentence boundaries for the

52

Page 63: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Figure 1: FFN (left) and RNN (right) encoder-decoder models

same model, as conveyed in Section 5.3.

4.2 Sequence Pairs

After segmenting the stories, we joined neigh-boring segments (i.e. clauses or sentences) intoinput-output segment pairs (S1, S2). In all ofour experiments, we filtered pairs where one orboth of the segments contained more than 20words. We manipulated the temporal windowwithin which these pairs were joined, by pairingall segments within N segments of each other.For a given segment at position t in a story,pairs were established between all segments insegmentt, . . . , segmentt+N . For example, whenN=1, a pair was formed with the next segmentonly (segmentt, segmentt+1); when N=2, pairswere formed between (segmentt, segmentt+1)and (segmentt, segmentt+2). By doing this, weintended to examine the proximity of causal infor-mation in a story according to its impact on COPAprediction; we expected that more adjacent clauseswould contain stronger causal relations than moredistant clauses. Gordon et al. (2011) analogouslyevaluated this by varying the number of wordswithin which PMI pairs were formed, but withoutregard to sentence or clause boundaries.

4.3 Encoder-decoder Models

We examined two types of encoder-decoder mod-els: one with feed-forward (FFN) layers and onewith recurrent (RNN) layers (i.e. a sequence-to-sequence model), both shown in Figure 1. In bothcases, the model implicitly assumes that the inputsegment S1 represents the cause of the output seg-ment S2, so the model learns to predict that S2will appear as the effect of S1. The theoretical mo-tivation for comparing the FFN and RNN is to de-termine the importance of word order for this task.

The existing COPA approaches only accounted forword order to the extent of capturing word pairswithin the same context of N words (though Sasakiet al. (2017) also accounted for multi-word expres-sions). The FFN encoder-decoder ignores wordorder. The model is very simple: both the in-put and output segments are collapsed into flat n-dimensional vectors of word counts (i.e. bag-of-words), so the hidden (encoder) layer observes allwords in each segment in parallel. On the output(decoder) layer (which has sigmoid activation likethe encoder), the FFN computes a score for eachword indicating its probability of appearing any-where in output segment.

In contrast, the RNN captures word order in thesegments. In particular, it uses a recurrent (en-coder) layer with Gated Recurrent Units (GRUs)(Cho et al., 2014) to iteratively encode the inputsequence, and another recurrent (decoder) layer torepresent output segment. The final hidden stateof the encoder layer after observing the whole in-put is provided as the initial hidden state to thedecoder. The decoder then iteratively computes arepresentation of the output sequence that is con-ditioned upon the input segment. For each time-point in this decoder layer, a softmax layer is ap-plied to predict a probability distribution over eachword being observed in the segment at that partic-ular timepoint. Both the FFN and RNN encoder-decoders are trained using the cross-entropy lossfunction to maximize the output word probabili-ties observed during training.

Once trained, a model predicts the likelihoodthat a cause sequence S1 given as input resultsin an effect sequence S2 based on the meanprobability of the words in S2 computed by themodel. When applied to COPA, consistent withthe methodology described in Section 3, in items

53

Page 64: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

where the prompt elicits the alternative that con-veys the effect of the premise, the premise is des-ignated as S1 and the alternative as S2. In con-trast, when the item elicits the alternative describ-ing the most likely cause of the premise, an alter-native is assigned to S1 and the premise to S2. Inconsidering the two alternatives in a COPA item,the one contained in the (S1, S2) pair that obtainsthe highest score is predicted as more plausible.

5 Initial Experiments

5.1 ROCStories CorpusThe PMI and CausalNet approaches to COPAmade use of large web corpora. Gordon et al.(2011) proposed that stories are a rich source forthe commonsense knowledge needed to answerCOPA questions. Mostafazadeh et al. (2016) fol-lowed this proposal by releasing the ROCStoriescorpus2, intended to be applied to commonsensereasoning tasks. The ROCStories corpus has yet tobe utilized for COPA prediction. This dataset con-sists of 97,027 five-sentence narratives authoredvia crowdsourcing. In contrast to weblog sto-ries, these stories were written with the specificobjective to minimize discourse complexity andexplicate prototypical causal and temporal rela-tions between events in salient everyday scenar-ios. COPA items also target these latent com-monsense relations, so the ROCStories appear tobe particularly suitable for this domain. Table1 shows some examples of stories in this corpusand corresponding COPA items that address thesame causal knowledge. The ROCStories corpusis dramatically smaller than the datasets used inthe work described in Section 3.

5.2 ProcedureWe applied the methodology outlined in Section4 to pairs of sequences from the ROCStories cor-pus. Our first set of experiments varied segmen-tation (clause versus sentence boundaries), dis-tance between segments (N=1 to N=4), and thetype of encoder-decoder (FFN or RNN). Note thatN=4 is the maximum setting when using sentenceboundaries since there are five sentences in eachstory, so here pairs will be formed between all sen-tences. For all experiments, we filtered grammati-cal words (i.e. all words except for adjectives, ad-verbs, nouns, and verbs) and lemmatized all seg-ments, consistent with Luo et al. (2016). COPA

2cs.rochester.edu/nlp/rocstories/

items intentionally do not contain proper nouns, sowe excluded them as well. We assembled a modellexicon that included each word occurring at leastfive times in the data, which totaled 9,299 wordsin the ROCStories. All other words were mappedto a generic <UNKNOWN> token.

The hidden layers of the FFN and RNN modelseach consisted of 500 dimensions. The RNN hadan additional word embedding layer of 300 nodesin order to transform discrete word indices in theinput segments into distributed vectors. They wereboth trained for 50 epochs using the Adam opti-mizer (Kingma and Ba, 2015) with a batch sizeof 100 pairs. After each epoch, we evaluated themodel on the COPA development set and saved theweights that obtained the highest accuracy.

5.3 Results

Table 2 shows the results of these different config-urations in terms of COPA accuracy. We includethe results on the development set as a referencebecause they tended to vary from the test results.Most notably, the FFN outperformed the RNNuniversally, suggesting that the order of words inthe segments did not provide a strong signal forprediction beyond the presence of the words them-selves. Among the FFN results, the model trainedon clauses with N=4 obtained the highest accuracyon the development set (66.0%), and was tied forthe highest test accuracy with the model trainedon clauses with N=3 (66.2%). The model withN=4 was trained on three times as many pairs asthe model with N=1. We can conclude that someof these additional pairs pertained to causality, de-spite not appearing adjacently in the story. The im-pact of clause versus sentence segmentation is lessclear from these results, given that the best resultof 66.2% accuracy using clauses is only triviallybetter than the corresponding result for sentences(66.0% for N=4).

5.4 Other Findings

5.4.1 Alternative Input RepresentationsIn the FFN model evaluated above, the input seg-ments were simply represented as bag-of-wordsvectors indicating the count of each word in thesegment. Alternatively, we explored the use ofpretrained word embeddings to represent the seg-ments. We proposed that because they providethe model with some initial signal about lexicalrelations, the embeddings could facilitate more

54

Page 65: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

ROCStories Instance COPA ItemSusie went away to Nantucket. She wanted to relax.When she got there it was amazing. The waves were sorelaxing. Susie never wanted to leave.

Premise: The man went away for theweekend. What was the cause of this?Alt 1*: He wanted to relax.Alt 2: He felt content.

Albert wanted to enter the spelling bee, but he was a badspeller. He practiced every day for the upcoming con-test. When Albert felt that he was ready, he entered thespelling bee. In the very last round, Albert failed when hemisspelled a word. Albert was very proud of himself forwinning the second place trophy.

Premise: The girl received a trophy.What was the cause of this?Alt 1*: She won a spelling bee.Alt 2: She made a new friend.

Anna was lonely. One day, Anna went to the grocerystore. Outside the store, she met a woman who was givingaway kittens. Anna decided to adopt one of those kittens.Anna no longer felt lonely with her new pet.

Premise: The woman felt lonely. Whathappened as a result?Alt 1: She renovated her kitchen.Alt 2*: She adopted a cat.

April is fascinated by health and medicine. She decidedto become a doctor. She studied very hard in college andmedical school. April graduated at the top of her medicalschool class. April now works in a hospital as a doctor.

Premise: The woman wanted to be adoctor. What happened as a result?Alt 1: She visited the hospital.Alt 2*: She went to medical school.

Table 1: Examples of stories in ROCStories corpus and similar COPA items

FFN RNNSegment N # Pairs Dev Test Dev Test

Sentence

1 389,680 64.8 64.4 63.4 54.42 682,334 65.2 65.4 61.2 57.63 877,963 63.8 63.8 60.2 55.44 976,568 63.8 66.0 59.4 55.6

Clause

1 539,342 64.2 63.6 59.4 56.82 981,677 65.2 65.0 59.2 54.63 1,327,010 65.4 66.2 63.4 58.04 1,575,340 66.0 66.2 61.2 56.6

Table 2: Accuracy by segmentation unit and pair distance (N) for the FFN and RNN encoder-decoders trained onROCStories

55

Page 66: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Model Dev TestFFN (above) 66.0 66.2FFN GloVe 65.0 61.6FFN ConceptNet 61.6 62.4FFN Skip-thought 66.8 63.8

Table 3: Accuracy of FFN trained on ROCStories withdifferent input representations

specifically learning causal relations. We exper-imented with three sets of embedding represen-tations. First, we encoded the words in each in-put segment as the sum of their GloVe embed-dings3 (Pennington et al., 2014), which representwords according to a global log-bilinear regressionmodel trained on word co-occurrence counts in theCommon Crawl corpus. We also did this usingConceptNet embeddings4 (Li et al., 2013), whichapply the word2vec skip-gram model (Mikolovet al., 2013) to tuples that specifically define com-monsense knowledge relations (e.g. soak in hot-spring CAUSES get pruny skin). Lastly, we usedskip-thought vectors5 (Kiros et al., 2015), whichcompute one embedding representation for an en-tire sentence, and thus represent the sentence be-yond just the sum of its individual words. Analo-gous to how word embedding models are trainedto predict words near a given target word in atext, the skip-thought vectors represent sentencesaccording to their relation to adjacent sentences,such that sentences with similar meanings are ex-pected to have similar vectors. The provided skip-thought vectors are trained on the BookCorpusdataset, which is described in Section 6.

We trained the FFN model on the ROCStorieswith each of these three sets of embeddings. Be-cause they obtained the best performance in theprevious experiments, we configured the modelsto use clause segmentation and distance N=4 inconstructing the pairs. Table 3 shows the results ofthese models, compared alongside the best resultfrom above with the standard bag-of-words repre-sentation. Neither the GloVe nor ConceptNet em-beddings performed better than the bag-of-wordsvectors (61.6% and 62.4% test accuracy, respec-tively). The skip-thought vectors performed betterthan bag-of-words representation on the develop-ment set (66.8%), but this improvement did not

3nlp.stanford.edu/projects/glove/4ttic.uchicago.edu/ kgimpel/commonsense.html5github.com/ryankiros/skip-thoughts

scale to the test set (63.8%).

5.4.2 Phrases

Model Dev TestFFN (above) 66.0 66.2FFN Phrases 62.6 64.8

Table 4: Accuracy of FFN trained on ROCStories withexplicit phrase representations

As mentioned above, Sasaki et al. (2017) foundthat modeling multi-word phrases as individualwords was helpful for the CausalNet approach.The RNN encoder-decoder has the opportunityto recognize phrases by modeling sequential de-pendencies between words, but Table 2 indicatedthis model was not successful relative to the FFNmodel. To assess whether the FFN model wouldbenefit from phrase information, we merged allphrases in the training corpus into individual wordtokens in the same manner as Sasaki et al., usingtheir same list of phrases. We again filtered alltokens that occurred fewer than five times in thedata, which resulted in the vocabulary increasingfrom 9,299 words to 10,694 when the phrases wereincluded. We trained the same FFN model in Table2 that achieved the best result (clause segmenta-tion, N=4, and bag-of-words input representation).The test accuracy, relayed for clarity in Table 4alongside the above best result, was 64.8%, indi-cating there was no benefit to modeling phrases inthis particular configuration.

5.4.3 Comparison with Existing Approaches

Model Dev TestFFN (above) 66.0 66.2PMI 60.0 62.4CausalNet 50.2 51.8

Table 5: Accuracy of PMI and CausalNet trained onROCStories

To establish a comparison between our encoder-decoder approach and the existing models appliedto the same dataset, we trained the PMI modelon the ROCStories. Rather than using a fixedword window, we computed the PMI scores for allwords in each story, which generally correspondsto using distance N=4 among sentence segmentsin the encoder-decoder. Table 5 shows that thisapproach had 62.4% test accuracy, so our new ap-proach outperformed it on this particular dataset.

56

Page 67: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

For completeness, we also applied the CausalNetapproach to this dataset. Its poor performance(51.8%) is unsurprising, because the lexical tem-plates used to extract causal pairs only matched4,964 sequences in the ROCStories. This demon-strates that most of the causal information con-tained in these stories is conveyed implicitly.

6 Experiments on Other Datasets

Gordon et al. (2011) found that the PMI approachtrained on blog stories performed better on COPAthan the same model trained on books in ProjectGutenburg6, despite the much larger size of thelatter. Beyond this, there has been limited explo-ration of the impact of different training datasetson COPA prediction, so we were motivated to ex-amine this. Thus, we applied the FFN encoder-decoder approach to the following datasets:

Visual Storytelling (VIST): 50,200 five-sentence stories7 authored through crowdsourcingin support of research on vision-to-language tasks(Huang et al., 2016). Participants were promptedto write a story from a sequence of photographsdepicting salient “storyable” events.

CNN/DailyMail corpus: 312,085 bullet-itemsummaries8 of news articles, which have beenused for work on reading comprehension and sum-marization (Chen et al., 2016; See et al., 2017).

CMU Book/Movie Plot Summaries (CMUPlots): 58,862 plot summaries9 from Wikipedia,which have been used for story modeling tasks likeinferring relations between story characters (Bam-man et al., 2014; Srivastava et al., 2016).

BookCorpus: 8,032 self-published fiction nov-els, a subset of the full corpus10 of 11,000 books.

Blog Stories: 1 million weblog stories used inthe COPA experiments by Gordon et al. (2011)identified above.

ClueWeb Pairs: Approximately 150 millionsequence pairs extracted from the ClueWeb corpusby Sasaki et al. (2017) using the CausalNet lexicaltemplates method.

6.1 Procedure and Results

We trained the FFN model with the best-performing configuration from the ROCStories ex-

6gutenberg.org/7visionandlanguage.net/VIST/8github.com/danqi/rc-cnn-dailymail9cs.cmu.edu/˜ark/personas/;

cs.cmu.edu/˜dbamman/booksummaries.html10yknzhu.wixsite.com/mbweb

Dataset # Pairs Dev TestROCStories-Half 762,130 64.0 62.6VIST 854,810 58.2 49.2ROCStories-Full 1,575,340 66.0 66.2CNN/DailyMail 3,255,010 59.4 51.8CMU Plots 6,094,619 57.8 51.0ClueWeb Pairs 157,426,812 60.8 61.2Blog Stories 222,564,571 58.4 57.2BookCorpus 310,001,015 58.2 55.0

Table 6: Accuracy of the FFN encoder-decoder on dif-ferent datasets

periments (clause segments, N=4, bag-of-wordsinput). After determining that the lexicon usedin the previous experiments included most of thewords (93.5%) in the COPA development set, were-used this same lexicon to avoid the inefficiencyof assembling a new one for each separate corpus.We also trained a model on the initial 45,502 sto-ries in the ROCStories (ROCStories-Half) to fur-ther analyze the impact of this dataset.

Table 6 shows the results for these datasetscompared alongside the ROCStories result fromabove (ROCStories-Full), listed in ascending or-der of the number of training pairs they con-tain. As shown, none of the other datasets reachthe level of accuracy of ROCStories-Full (66.2%).Even the model trained on only the initial halfof this corpus outperforms the others (62.6%).The next closest result is for the ClueWeb Pairs,which had 61.2% test accuracy despite contain-ing 100 times more pairs than the ROCStories.The larger Blog Stories and BookCorpus datasetsdid not have much impact, despite that the BlogStories obtained 65.2% accuracy in the PMI ap-proach. One speculative explanation for this isthat our approach is highly dependent on the den-sity of COPA-relevant knowledge contained in adataset. As mentioned above, authors of the ROC-Stories were instructed to emphasize the most ob-vious possibilities for ‘what happens next’ in pro-totypical scenarios. These expectations align withthe correct COPA alternatives. However, naturallyoccurring stories often focus on events that vio-late commonsense expectations, since these eventsmake for more salient stories (Schank and Able-son, 1995). Thus, they may show greater diversityin ‘what happens next’ relative to the ROCStories.This diversity was seemingly more distracting forour encoder-decoder architecture than for the ex-

57

Page 68: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

isting approaches. Accordingly, despite all beingrelated to narrative, the VIST, CNN/DailyMail,and CMU Plots datasets were also ineffective onthe test set with regard to this model.

7 Conclusion

In summary, we pursued a neural encoder-decoderapproach for predicting causally related events inthe COPA framework. To our knowledge this isthe first work to evaluate a neural-based model forthis task. Our best result obtained 66.2% accu-racy. This is lower than the current state-of-the-artof 71.2%, but our experiments motivate some op-portunities for future work. We demonstrated theusefulness of the ROCStories for this task, as ourmodel appeared to benefit from its density of com-monsense knowledge. The gap between 66.2%and 71.2% is not dramatic in light of the massivesize advantage of the data used to obtain the lat-ter result. However, the ROCStories corpus is acrowdsourced dataset and thus will not grow nat-urally over time like web data, so it may not bepractical to rely exclusively on this type of spe-cially authored resource either. The CausalNetapproach proposed a useful way to isolate com-monsense knowledge in generic text by relying oncausal cues, but because many causal relations arenot marked by specific lexical items, it still over-looks a lot that is relevant to COPA. On the otherhand, not all temporally related events in a storyare causally related. Because we did not make thisdistinction, some of the pairs we modeled werelikely not indicative of causality and thus maynot have contributed accurately to COPA predic-tion. Research on automatically detecting more la-tent linguistic features specifically associated withthe expression of causal knowledge in text wouldlikely have a large impact on this endeavor.

Acknowledgments

We would like to thank the authors of Sasaki et al.(2017) for sharing the data and resources associ-ated with their work.

The projects or efforts depicted were or aresponsored by the U.S. Army. The content or infor-mation presented does not necessarily reflect theposition or the policy of the Government, and noofficial endorsement should be inferred.

ReferencesDavid Bamman, Brendan O’Connor, and Noah A

Smith. 2014. Learning latent personas of film char-acters. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics (ACL).page 352.

Zheng Cai, Lifu Tu, and Kevin Gimpel. 2017. Pay at-tention to the ending: Strong neural baselines forthe ROC Story Cloze Task. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers). As-sociation for Computational Linguistics, pages 616–622.

Danqi Chen, Jason Bolton, and Christopher D. Man-ning. 2016. A thorough examination of theCNN/Daily Mail reading comprehension task. InAssociation for Computational Linguistics (ACL).

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder–decoder ap-proaches. Syntax, Semantics and Structure in Statis-tical Translation page 103.

Kenneth Ward Church and Patrick Hanks. 1990. Wordassociation norms, mutual information, and lexicog-raphy. Computational linguistics 16(1):22–29.

Natalie Dehn. 1981. Story Generation After TALE-SPIN. Proceedings of the 7th International JointConference on Artificial Intelligence (IJCAI ’81)pages 16–18.

Marjan Ghazvininejad, Xing Shi, Yejin Choi, andKevin Knight. 2016. Generating topical poetry. InEMNLP. pages 1183–1191.

Andrew Gordon, Cosmin Adrian Bejan, and KenjiSagae. 2011. Commonsense Causal Reasoning Us-ing Millions of Personal Stories. Twenty-Fifth Con-ference on Artificial Intelligence (AAAI-11) pages1180–1185.

Ting-Hao K. Huang, Francis Ferraro, NasrinMostafazadeh, Ishan Misra, Jacob Devlin, Aish-warya Agrawal, Ross Girshick, Xiaodong He,Pushmeet Kohli, Dhruv Batra, et al. 2016. Visualstorytelling. In 15th Annual Conference of theNorth American Chapter of the Association forComputational Linguistics (NAACL 2016).

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In The Inter-national Conference on Learning Representations(ICLR). San Diego.

Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov,Richard Zemel, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Skip-thought vectors. InC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,and R. Garnett, editors, Advances in Neural Infor-mation Processing Systems 28. Curran Associates,Inc., pages 3294–3302.

58

Page 69: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Michael Lebowitz. 1985. Story-telling as planning andlearning. Poetics 14(6):483–502.

Boyang Li, Stephen Lee-Urban, George Johnston, andMark Riedl. 2013. Story Generation with Crowd-sourced Plot Graphs. In 27th AAAI Conference onArtificial Intelligence.

Zhiyi Luo, Yuchen Sha, Kenny Q. Zhu, Seung-wonHwang, and Zhongyuan Wang. 2016. Common-sense causal reasoning between short texts. In 15thInternational Conference on Principles of Knowl-edge Representation and Reasoning (KR-2016).

James R Meehan. 1977. TALE-SPIN, An InteractiveProgram that Writes Stories. In 5th InternationalJoint Conference on Artificial Intelligence. pages91–98.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems. pages 3111–3119.

Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James Allen. 2016. A cor-pus and cloze evaluation for deeper understandingof commonsense stories. In Proceedings of NAACL-HLT . pages 839–849.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP). pages 1532–1543.

Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S. Gordon. 2011. Choice of Plausible Alterna-tives : An Evaluation of Commonsense Causal Rea-soning. AAAI Spring Symposium: Logical Formal-izations of Commonsense Reasoning pages 90–95.

Shota Sasaki, Sho Takase, Naoya Inoue, NaoakiOkazaki, and Kentaro Inui. 2017. Handling multi-word expressions in causality estimation. In IWCS201712th International Conference on Computa-tional SemanticsShort papers.

Roger C Schank and Robert P Abelson. 1977. Scripts,plans, goals, and understanding: An inquiry into hu-man knowledge structures. Psychology Press.

Roger C Schank and Robert P Ableson. 1995. Knowl-edge and memory: The real story. In. Rober S. Wyer(Ed.), Knowledge and memory: The real story pages1–85.

Abigail See, Peter J Liu, and Christopher D Manning.2017. Get to the point: Summarization with pointer-generator networks. In Association for Computa-tional Linguistics (ACL).

Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben-gio, Aaron C Courville, and Joelle Pineau. 2016.Building end-to-end dialogue systems using gener-ative hierarchical neural network models. In AAAI.pages 3776–3784.

Lifeng Shang, Zhengdong Lu, and Hang Li. 2015.Neural Responding Machine for Short-Text Conver-sation. In Proceedings of the 53th Annual Meet-ing of Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing (ACL-IJCNLP’15). pages1577–1586.

Shashank Srivastava, Snigdha Chaturvedi, and Tom MMitchell. 2016. Inferring interpersonal relations innarrative summaries. In AAAI. pages 2807–2813.

Qixin Wang, Tianyi Luo, Dong Wang, and Chao Xing.2016. Chinese song iambics generation with neu-ral attention-based model. In Proceedings of theTwenty-Fifth International Joint Conference on Ar-tificial Intelligence (IJCAI-16). pages 2943–2949.

59

Page 70: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Proceedings of the First Workshop on Storytelling, pages 60–66New Orleans, Louisiana, June 5, 2018. c©2018 Association for Computational Linguistics

Neural Events Extraction from Movie Descriptions

Alex Tozzo, Dejan Jovanovic and Mohamed R. AmerSRI International

201 Washington RdPrinceton, NJ08540, USA

{dejan.jovanovic, mohamed.amer}@sri.com

AbstractWe present a novel approach for event extrac-tion and abstraction from movie descriptions.Our event frame consists of ‘who”, “did what”“to whom”, “where”, and “when”. We for-mulate our problem using a recurrent neuralnetwork, enhanced with structural features ex-tracted from syntactic parser, and trained usingcurriculum learning by progressively increas-ing the difficulty of the sentences. Our modelserves as an intermediate step towards ques-tion answering systems, visual storytelling,and story completion tasks. We evaluate ourapproach on MovieQA dataset.

1 Introduction

Understanding events is important to understand-ing a narrative. Event complexity varies from onestory to another and the ability to extract and ab-stract them is essential for multiple applications.For question answering systems, a question nar-rows the scope of events to examine for an an-swer. For storytelling, events build an image inthe reader’s mind and constructs a storyline.

For event extraction, we apply Natural Lan-guage Processing (NLP) techniques to constructan event frame consisting of: “who”, “did what”“to whom”, “where”, and “when”. The morecomplex questions of “how” and “why” requiressignificantly more reasoning and beyond this pa-per’s scope. Most syntactic NLP parsers, suchas CoreNLP (Manning et al., 2014) and NLTK(Bird et al., 2009), focused on examining char-acteristics of the words, grammatical structure,word order, and meaning (Chomsky, 1957). Onthe other hand neural NLP approaches, such asSLING (Ringgaard et al., 2017) relies on largecorpora to train such models in addition multi-ple knowledge databases. These approaches per-form event extraction without context (often vi-sual) of a movie or a movie script. When per-

forming event extraction in relation to events ina story, the context can be gleaned from descrip-tions of the set or characters or prior events in asequence. Additionally, we intend to develop anevent extraction framework for a mixed-initiative,human-computer, system and intend to generatea human-readable event structure for user interac-tion.

In departure from syntactic approaches, wepropose a hybrid, neural and symbolic, ap-proach to address this problem. We benefit fromboth neural and symbolic formulations to ex-tract events from movie text. Neural networkshave been successfully applied to NLP problems,specifically, sequence-to-sequence or (sequence-to-vector) models (Sutskever et al., 2014) ap-plied to machine translation and word-to-vector(Mikolov et al., 2013a). Here, we combinethose approaches with supplemental structural in-formation, specifically sentence length. Our ap-proach models local information and global sen-tence structure.

For our training paradigm, we explored curricu-lum learning ((Bengio et al., 2009). To the best ofour knowledge, we are the first to apply it to eventextraction. Curriculum learning proposes a modelcan learn to perform better on a difficult task ifit is first presented with easier training examples.Generally, in prior curriculum learning work, thefinal model attains a higher performance than ifit were trained on the most difficult task from thestart. In this work, we base the curriculum on sen-tence length, reasoning that shorter sentences havea simpler structure. Other difficulty metrics suchas average word length, longest word length, andFleschKincaid readability score were not consid-ered in this experiment, but may be considered forfuture work.

Instead of treating the sentence-to-event prob-lem as a complete black-box putting the burden

60

Page 71: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

on the deep learning model, we simplify the prob-lem by adding structure to the output sentence fol-lowing the event frame structure, where some ofthe components could be present or absent. Fur-thermore, some sentences could contain multipleevents. Weak labels were extracted from each sen-tence using the predicate as an anchor. We usestructure rather than a bag-of-words because it en-codes information about the relationships betweenwords.

Our contributions are three-fold:• New formulation for event extraction in movie

descriptions.• A curriculum learning framework for difficulty

based learning.• Benchmarking symbolic and neural ap-

proaches on MovieQA dataset.The paper is organized as follows: section 2 re-views prior work; Section 3 formulates our ap-proach; Section 4 specifies the learning frame-work; Section 5 presents our experiments; Sec-tion 6 describes our future work and conclusion.

2 Prior Work

Event extraction is a well established researchproblem in NLP. Parsers have been developed toextract events and event structures using a varietyof methods both supervised and unsupervised.

(McClosky et al., 2011) uses dependency pars-ing to extract events from sentences (converted toa dependency tree by a separate classifier) by iden-tifying event anchors in a sentence and graphingrelationships to its arguments.

(Chambers and Jurafsky, 2008) and (Cham-bers and Jurafsky, 2009) develop an unsupervisedmethod to learn narrative relations between eventsthat share a co-reference argument and, later, a se-quence of events over multiple sentences.

(Martin et al., 2017) and (Martin et al., 2018)present a neural technique for generating a mid-level event abstraction that retains the semanticinformation of the original sentence while mini-mizing event sparsity. They formulate the prob-lem as first, the generation of successive events(event2event in their parlance), then generate nat-ural language from events (event2sentence). Theauthors use a 4-tuple event representation withsubject, verb, object, and a modifier of the sen-tence including prepositional phrase, indirect ob-ject, or causal complement. One key conceptis that these events are in generalized Word-

Net forms and are not easily human-readable.Their event2event network is an encoder-decodermodel. The event2sentence model is similar tothe event2event model with the exception of usingbeam search.

(Harrison et al., 2017) introduces a Monte Carloapproach for story generation, a related applica-tion of event extraction. Citing RNN’s difficulty inmaintaining coherence across multiple sentences,they develop a Markov Chain Monte Carlo modelthat can generate arbitrarily long sentences. In thiswork, they use the same event representation as(Martin et al., 2018).

Prior work in curriculum learning ((Bengioet al., 2009)) explored shape recognition and lan-guage modeling. Specifically for language mod-eling, they experiment with a model to predict thebest word following a context of prior words in acorrect English sentence. Their language model-ing experiment expanded the vocabulary, increas-ing the task difficulty as more words were addedto the corpus. More recent work ((Graves et al.,2017)) applied curriculum learning to question-answering problems on the bAbI dataset (Westonet al., 2015), designed to probe reasoning capabil-ities of machine learning systems.

The problem with symbolic approaches is therigidity of the parsers and only basing the parseson the encoded knowledge. The neural approachesare unbounded and produce a huge variety of gen-erated sentences. However, they are not condi-tioned on specific text and the results vary, oftenproducing unrealistic sentences. We propose a hy-brid of the two approaches to provide structuredevents conditioned on realistic content.

3 Approach

Our goal is to extract event frames in movie de-scription in the format of “who” “did what” “towhom or what” “when” and “where”. By extract-ing particular components of an event, it becomeseasier to instantiate an event as an animation us-ing existing software or present the event object toa human user for them to instantiate on their ownterms. Once events are extracted in this format,a sequence of events can be used to animate thescript and generate a short movie. In contrast tothe purely symbolic approach taken by others, wetake a neural approach, applying Recurrent Neu-ral Networks (RNN). The idea is that an RNN willlearn to output a structure mirroring the symbolic

61

Page 72: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Figure 1: The model takes input word2vec vectors ofeach word and outputs their labels.

approaches. The input nodes are encoded as afixed sequence of identical length and the outputare labels of the provided structure. Figure 1 illus-trates our model. We first encode each word in theinput sentence to an M-dimensional vector usingword2vec (Mikolov et al., 2013b). The embeddingoutput vectors input an M-dimensional RNN withLSTM units. We standardized the length of thesentence by padding short sentences and cappingthe length of the longest sentence to be 25 words.The hidden state of each unit is defined in (1) forh(t). The output of each unit is o(t) and is equal tothe hidden state. The internal cell state is definedby Ct. Intermediate variables f (t). i(t), c(t), andu(t) facilitate readability and correspond to forget,input, and candidate gates, respectively. The cellstate and the hidden state are propagated forwardsin time to the next LSTM unit.

f (t) = σ(Wfhh(t−1) +Wfxx

(t) + bf )

c(t) = tanh(Wchh(t−1) +Wcxx

(t) + bc)

i(t) = σ(Wihh(t−1) +Wixx

(t) + bi)

C(t) = C(t−1) � f (t) + i(t) � c(t)

o(t) = σ(Wohh(t−1) +Woxx

(t) + bo)

h(t) = tanh(C(t))� o(t)

(1)A significant hurdle in training any of network inthis instance is class imbalance. Here, the modelis trained using standard back-propagation with aweighted cross-entropy loss function used to avoidover-fitting to the null class.

4 Curriculum Learning

Sentences vary in difficulty due to structure, con-text, vocabulary, and more. As part of our experi-ments, we employed curriculum learning to poten-tially facilitate the learning processes. We com-pare the curriculum training to standard batch pro-cessing.

We divide the training samples into three diffi-culty groups based on sentence length. We trainthe model with the easiest set first for 100 epochsbefore advancing to the medium and hard diffi-culty training samples, training for 100 epochseach. This results in 300 training epochs total, al-though the model is only exposed to a third of thedataset for 100 epochs at a time. We compare thisto models where the training process exposes themodel to the entire corpus for 300 epochs. We usesentence length, assuming that shorter sentencesare easier as they contain fewer descriptive words,but other structural and semantic metrics can beused.

5 Experiments

MovieQA Dataset (Tapaswi et al., 2016): Weuse the descriptive video service (DVS) text fromMovieQA. The DVS sentences tend to be sim-pler and describe the scene explicitly compared tothe plot synopsis sentences. We generate an ini-tial training corpus of extracted events using de-pendency parsers and information extraction an-notators from CoreNLP. A total of 36,898 eventsare generated. Analyzing the corpus of extractedevents, we found the longest sentence length con-tained 62 words. However, by limiting our datasetto sentences with 25 words or less, we retained97% of the data (35791 sentences). Figure 3 showsthe sentence length distribution of the DVS dataand the plot synopsis data. We did not experimentwith the plot synopsis data, rather we wish to high-light the difference in sentence lengths betweenthe 2 sets of data. The DVS data is heavily skewedtowards shorter sentences, most likely due to re-quiring concise descriptions about what is hap-pening on screen at that time. Plot synopsis sen-tences tend to be longer as they tend to summarizemultiple actions and plot points. Sentences withmultiple predicates generated multiple events andthis manifested itself as duplicate sentences in ourdataset with multiple label sequences. Sentenceswith multiple events accounted for about 24% ofthe data. This does lead to complications for train-

62

Page 73: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Figure 2: (Top, left to right) Easy, Medium, Hard Accuracy vs Learning Rate. (Bottom, left to right) Easy, Medium,Hard Accuracy vs Embedding Dimension.

Figure 3: Sentence Length Distribution from DVS(blue) and Plot Synopsis (orange) Data and CumulativeDistribution (solid lines and right axis)

ing as we did not distinguish between events.The difficulty classes computed by sentence

length and are as follows: easy sentences are 8words or less with an average sentence length of6.5 words, medium sentences are 9-12 words withan average length of 10.4 words, and hard sen-tences are 13 words or longer with an averagelength of 16.8 words. Each difficulty class is con-tains one third of the original data set. For trainingwith and without a curriculum, we used an 80-20train-test split. For curriculum learning, each diffi-culty was trained on 80% of each of the respectivedifficulty sets and tested on the remaining 20%.

The extracted events provide weak labels gener-ated by the CoreNLP algorithm approach.

Pre-processing: We tokenized the text assigningan integer to each word after removing capital-ization and apostrophes. Sentences are vectorizedusing this index. The output format assigns inte-gers between 1-5 to parts of the sentence basedon which elements of the sentence are part of thesubject (1), predicate (2), object (3), location (4),or time (5) phrases. Articles, prepositions, con-junctions, adjectives, and adverbs were often as-signed the null class (0) although some may be in-cluded parts of event phrases. Sentences are left-padded with zeros make all sentence vectors thesame length.

Implementation Details: The trained model isa basic LSTM model. We employ two differenttraining approaches. In the first approach, we ig-nore the sentence length and use random batchesof training data. We train for 300 epochs. Second,we use a training curriculum based on sentencelength, starting training with shorter sentences andprogressing to longer sentences. The sentences aredivided into easy, medium, and hard difficulty setswith each set containing roughly one-third of the

63

Page 74: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Easy: H

who

e gently

did what

strokes her c

to whom/what

heek near th

where

e scar

(a)Medium: In the

where

hospital, Caro

who

line stares

did what

down at the d

to whom/what

iary

(b)Hard: The rec

who

ruits are sit

did what

ting at the base of

to whom/what

their bunks assem

ignored

bling their

ignored

rifles

(c)Challenging: Wh

ignored

ile sh

who

e is dist

did what

racted No

who

ah swims

did what

away

(d)

Figure 4: Successful Cases

Easy: H

ignored

e quickly

did what

shuts th

ignored

e gla

who

ss do

ignored

or

(a)Medium: As th

ignored

ey’re walkin

ignored

g away from t

ignored

he lake, Harry’s

ignored

friends are ch

did what

eering hi

to whom/what

m

(b)Hard: Sold

ignored

iers climb

did what

into a truc

ignored

k and Ki

ignored

ng Geo

who

rge walk

did what

s out of the

ignored

prison

(c)

Figure 5: Failure Cases

total data set. We hold out a set of data from eachdifficulty level for validation. We train the modelfor 100 epochs on each level of difficulty to matchthe total of 300 epochs of the random training cur-riculum. The final approach is to start with shortersentence lengths and train with longer sentencestowards the end.

We examine four parameters: the learning rate,the embedding dimension, the hidden dimension,and number of training epochs. A grid search wasemployed to examine the effects of these parame-ters on the validation accuracy. We varied learningrate in powers of 10 from 1e − 5 to 1e − 2. Theembedding dimension was varied from 50-500 inincrements of 50. The hidden dimension was var-ied from 48-512 in increments of 16. Lastly, wevaried the number of training epochs from 300-2000 in increments of 200. Below we include asample of figures from this search where we fixedthe number of epochs (300) and the hidden layerdimension (512) while adjusting the learning rateand the embedding dimension. For these param-eters, we found an embedding dimension of 350performs best on the easy and medium difficultylevels. Additional training is in progress.

We used ADAM (Kingma and Ba, 2014) for

gradient optimization. Our loss function was aweighted categorical cross entropy function us-ing the class distribution from the training data asclass weights. Accuracy is calculated by the fre-quency with which the predicted class matches thelabels. The accuracy is then the total count of ac-tual matches by the total potential matches.

5.1 Qualitative ResultsIn this work, we used the symbolic algorithm asa weak label for the neural network approach.The symbolic approach appears to work well forshorter sentences with simple sentence structure.However, as the sentences become longer and ad-ditional descriptive phrases, the dependency pars-ing becomes more complex. We present 3 exam-ples (1 from each difficulty level) of a simple sen-tence the symbolic algorithm does well.

Figure 4 shows examples where our approachwas successful. Figure 4(a) illustrates an easy dif-ficulty sentence. The parser correctly identifiesthe he as the subject, gently strokes as the verbphrase, her cheek as the object, and near the scaras the location. Figure 4(b) illustrates a mediumdifficulty sentence. The parser identifies Carolineas the subject again, the verb phrase stares downat, the object diary and the location in the hos-

64

Page 75: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

pital. Figure 4(b) illustrates a medium difficultysentence. Figure 4(c) illustrates a hard difficultysentence. The parser identifies recruits as the sub-ject. The verb phrase identified here is are sittingat and the object is base of their bunks. Anotherverb phrase that could be identified is assemblingwith the object their rifles. These examples alsoshow how the model ignored articles in the sen-tences. Finally, Figure 4(d) illustrates a challeng-ing sentence. One pleasantly surprising exampleof the model learning multiple events in a singleshort sentence. The model correctly identifies 2subjects (she, Noah) and 2 verb phrases (is dis-tracted, swims away).

Figure 4 shows examples where our approachfailed. Figure 5(a) illustrates an easy difficultysentence. The model incorrectly identifies glass asa predicate while correctly identifying shuts, sug-gesting the model does not anchor events arounda predicate phrase. Figure 5(b) illustrates anmedium difficulty sentence. The model identifiesthe verb phrase are cheering and the object him,but fails to recognize the subject Harry’s friends.This is odd as it would suggest the model doesn’trecognize possessive apostrophes as part of a nounphrase, but may be confusing it with a contraction.However, other situations show the model does notrecognize the contraction either. Figure 5(c) illus-trates a hard difficulty sentence. The model failsto identify soldiers as a subject of any predicate,yet correctly identifies climb into and walks outof as predicate phrases. It does, however, identifyGeorge as a subject, but not his descriptor of King.

As sentences get longer, the model begins tobreakdown. This may be due to the weak labelsprovided by the symbolic algorithm. It may alsobe due to the non-linear relationship between sub-ject, predicate, and object in the sentence. Theneural model also fails when the location is ageneric place such as a cafe or garage. One im-provement could be to incorporate the WordNetmeaning of each word in the sentence.

5.2 Quantitative Results

We show results from curriculum learning usingthe sentence length as a basis for the curricu-lum. The accuracy is shown in Figure 6 for boththe non-curriculum learning and the length-basedcurriculum. We first train with easy data, witthe easy validation data closely tracking it. Af-ter 200 epochs, we begin training with medium

difficulty data. At this point, the easy valida-tion data changes only a little, while medium andhard difficulty validation data continue to increaseslightly. At 400 epochs, we begin training withthe most difficult data: data containing the longestset of sentences our dataset. Introducing the hardtraining data affects the easy and medium valida-tion accuracy. The hard difficulty validation accu-racy continues to increase while easy and mediumdrop. Due to the semi-supervised method of label-ing the data using symbolic methods, we believelonger sentences tend to be noisier and less accu-rately labeled compared to shorter sentences. Thisintroduces noisy labels for the network, confusingit on previously learned examples leading to de-graded performance. Descriptive phrases contain-ing nouns can complicate the network and hinderidentification of subject or object.

6 Conclusions and Future Work

This work presents an initial study of neural eventextraction. We intend to study a bidirectionalLSTMs and encoder-decoder models in futurework. We anticipate bidirectional models andencoder-decoder models will enable the networkto capture longer-term dependencies between ob-ject and predicate. We also plan to extend the dataset to additional sources with human annotationsfor more accurate ground truth labels.

An additional direction for future work is to in-corporate graphs as a mechanism to enforce struc-ture. In addition, we can extract events from visualinformation and use it to guide the events extractedfrom textual information. Using graphs generatedfrom both visual and textual information will re-sult in a more complete, and less noisy event rep-resentation.

This experiment is a component for our work to-wards developing a mixed-initiative system for vi-sual storytelling. Here, we take preliminary stepstowards extracting events from movie descriptionswith the intention of then instantiating events in ananimation module. The simplified event structurefacilitates the mixed system where either the com-puter or human can suggest events or render theevent in a particular style or genre. Our vision forthe system is to have a human and a computer taketurns suggesting new events in a story or suggest astory arc and generate pertinent, relevant events tojustify the conclusion.

65

Page 76: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Figure 6: (Left) No curriculum learning with all difficulties mixed into training data. Validated against all difficultylevels at each epoch. (Right) Curriculum learning based on sentence length. Each difficulty trained for 100 epochsin easy, medium, hard order. Validated against all difficulty levels at each epoch.

Acknowledgments

This work is funded by DARPA W911NF-15-C-0246. The views, opinions, and/or conclusionscontained in this paper are those of the author andshould not be interpreted as representing the of-ficial views or policies, either expressed or im-plied of the DARPA or the DoD. Special thanks toKarine Megerdoomian for the helpful discussions.

ReferencesYoshua Bengio, Jrme Louradour, Ronan Collobert, and

Jason Weston. 2009. Curriculum learning. ICML.

Steven Bird, Ewan Klein, and Edward Loper. 2009.Natural Language Processing with Python. OReillyMedia Inc.

Nathanael Chambers and Dan Jurafsky. 2008. Unsu-pervised learning of narrative event chains. ACL.

Nathanael Chambers and Dan Jurafsky. 2009. Unsu-pervised learning of narrative schemas and their par-ticipants. IJCNLP.

N. Chomsky. 1957. Syntactic structures.

Alex Graves, Marc G. Bellemare, Jacob Menick, RemiMunos, and Koray Kavukcuoglu. 2017. Automatedcurriculum learning for neural networks. ICML.

Brent Harrison, Christopher Purdy, and Mark O. Riedl.2017. Toward automated story generation withmarkov chain monte carlo methods and deep neu-ral networks. In Workshop on Intelligent NarrativeTechnologie.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In ACL-Systems.

Lara J. Martin, Prithviraj Ammanabrolu, Xinyu Wang,Shruti Singh, Brent Harrison, Pradyumna Tamb-wekar Murtaza Dhuliawala, Animesh Mehta, RichaArora, Nathan Dass, Chris Purdy, and Mark O.Riedl. 2017. Improvisational storytelling agents. InNIPS-Workshops.

Lara J. Martin, Prithviraj Ammanabrolu, Xinyu Wang,Shruti Singh, Brent Harrison, and Mark O. Riedl.2018. Event representations for automated storygeneration with deep neural nets. In AAAI.

David McClosky, Mihai Surdeneau, and Christo-pher D. Manning. 2011. Event extraction as depen-dency parsing. Proceedings of BioNLP Shared Task2011 Workshop.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013a. Efficient estimation of word represen-tations in vector space. CoRR, abs/1301.3781.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013b. Efficient estimation of word represen-tations in vector space. CoRR, abs/1301.3781.

Michael Ringgaard, Rahul Gupta, and Fernando C. N.Pereira. 2017. SLING: A framework for frame se-mantic parsing. CoRR, abs/1710.07032.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. CoRR, abs/1409.3215.

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen,Antonio Torralba, Raquel Urtasun, and Sanja Fi-dler. 2016. Movieqa: Understanding stories inmovies through question-answering. In IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR).

Jason Weston, Antoine Bordes, Sumit Chopra, andTomas Mikolov. 2015. Towards ai-complete ques-tion answering: A set of prerequisite toy tasks.CoRR, abs/1502.05698.

66

Page 77: NAACL HLT 2018in our presentation of extant storytelling research, discourse, and innovations. Rather, we seek to spark dialogues, questions, and cross-discipline exchanges around

Author Index

Amer, Mohamed, 60

Bamman, David, 33

Ghazvininejad, Marjan, 43Gillick, Jon, 33Gordon, Andrew, 14, 50

Hobbs, Reginald, 20

Jovanovic, Dejan, 60

Kasunic, Anna, 1Kaufman, Geoff, 1Knight, Kevin, 43

Lukin, Stephanie, 20

May, Jonathan, 43

Peng, Nanyun, 43

Roemmele, Melissa, 14, 50

Tozzo, Alex, 60

Voss, Clare, 20

67