roft: a tool for evaluating human detection of machine ...ccb/publications/roft-demo.pdf · roft: a...

RoFT: A Tool for EvaluatingHuman Detection of Machine-Generated Text

Liam Dugan*, Daphne Ippolito*, Arun Kirubarajan*, Chris Callison-BurchUniversity of Pennsylvania

{ldugan, daphnei, kiruba, ccb}@seas.upenn.edu

Abstract

In recent years, large neural networks for nat-ural language generation (NLG) have madeleaps and bounds in their ability to generatefluent text. However, the tasks of evaluatingquality differences between NLG systems andunderstanding how humans perceive the gener-ated text remain both crucial and difficult. Inthis system demonstration, we present Real orFake Text (RoFT), a website that tackles bothof these challenges by inviting users to try theirhand at detecting machine-generated text in avariety of domains. We introduce a novel eval-uation task based on detecting the boundaryat which a text passage that starts off human-written transitions to being machine-generated.We show preliminary results of using RoFT toevaluate detection of machine-generated newsarticles.

1 Introduction

Despite considerable advancements in building nat-ural language generation (NLG) systems that canoutput extremely fluent English text, there is stillnot very much understanding of how humans per-ceive machine-generated text. Such an understand-ing is crucial both to evaluate improvements inNLG systems and to analyze the societal ramifica-tions of machine-generated text becoming increas-ingly easy to produce.

When evaluating NLG systems, it is consideredstandard practice to ask evaluators to rate generatedtext on criteria such as fluency, naturalness, or rel-evance to a prompt on a Likert scale (van der Leeet al., 2019). Preference studies, where a rater isshown two generated excerpts and asked which onethey prefer, are also common. Some recent workhas focused on the detection problem: how capableare humans of distinguishing textual excerpts gen-

∗Authors listed alphabetically contributed equally.

sentenceseemssense

make

think

human

just

text

article

seem

word

oddone

wordsplacemakes

way

name

rest

�rst

use

weird

last

writetopic

much

�t

said

context

person

sound

used

style

part looks

will

two

shortnews

say

end

many

little

change

Seems

instead

cansure

story

well

now

sounding

long

might

Sounds

without

natural

follow

�ow

earlier

feel

feels

put

content

Justright

details

know

newtitle

bit

response

prompt

repeating

really

seemed

come

black

mistake

anythingnumbers

even

get

usage

subject

kindfull

repetitive

obituary

formal

New

mention

ones

already

lot

says

leadfact

simple

related

job

common

shorter

repeated

whole

page

unnatural

clear

state

guess

see

statement

line

address

thought

goes

believe

made

choice

appear

actually

read

error

started

point testing

nothing

Repetition

writer

quite

wordedauthor

detailed

good

strangely

placed

generate

others

discussing

awkward

York

wordy

facts

starts

marks

thing

along

format

inconsistent

enough

letters

home

wrote

doesnt

location

saying

work

time

Weird

order

easily

stop

someone

usually

sudden

back

since

tense

usinganswer

high

wedding

market

start

completely

year

abrupt

appeared

caps

markgibberish

public

years

prior

today

relevant

personal

number

another

products

exceptgot

also

attention

bad

general

done

match

repetition

John

mean

look

describe

nowhere

repeat

logical

comes

things

maybe

missing

understand

imagine

collage

always

bone

never

gives

event

want

may

told

tone

Machine

goal

better

million

conversation

ended

political

proper

jump

First

still

reference

yesterday

though

grammatically

given

Maybe

half

city

explained

relation

analysis

appears

preceding

brief

gave

main

various

else

type

ring

succinct

company

o�er

appropriate

coherent

magazine

Museum

pretty

percent

similar

website

direct

felt

copy

veers

dance

value

cars

abruptly

share

give

score

book

jobs

correct

due

next

readerfamily

around

di�erently

making program

wom

en

anyone

War

Earlier

quote

focus

quotes

major

shi�

changes

sounded

commas

East

lack

phrases

opinion

husband

Punctuatio

n

help

robotic

contains

improper

getting

typo

problem

straight

dropped

car

talk

unity

week

�ve

original

talks

break

totally

State

Secretary

United

game

research

redundant

media

report

references

comma

came

uses

headline

Looks

Topic

switched

adjectives

initial

Writing

tied

Repeated

April

friends

explain

phrasing

prices

�awless

writes

feet

suspicious

quick

complex

intelligence

voter

level

funny

charged

don’t

abbreviations

past

run-on

team

�ght

terrorism

Olympic

NYT

excerpt

basement

misspelled

character

National

chosen

certainly

sti�

recite

suppression

according

detail

top

site

date

photographed

issue

furniture

oriented

hear

factor

international

R48minimize

arranged

include

pressure

paragraph

touch

abuzz

unique

means

lived

o�cial

expectService

sta�

CFDA

care

obituaries

President

copy/paste

smooth

check

citation

WWD

meaning

currently

bring

pickle

men

large

reasonsslightly

Also

shi�ed

telling

four

Humans

longer

Friday

regular

dramatic

stats

deceased

early

Bad

thrown

higher

contradicts

remember

Dri�

dollar

Repeating

agoschool

Double

1st

presence

easy

revealedveered

un�nished

crime

movie

Dave

form

spacing

mechine

Makes

unusualdoesn’t

Incoherent

refer

referring

capital

support

take

typical situation

possible

murder

machines

art

close

bot

case

going

piece

keeps

mob

councilman

integrity

link

William

humans

isolated

passage

Steven

day

contain

November

compared

Ben

isn’t

Incoherency

issued

version

double

oddly

overly

hit

pulled

NASA

later

incoherent

opinions

reviewmeant

�rm

danced

individuals

failed

info

forced

GOOD

money

actual

shows

death

States

Complete

reason

able

larger

rise

despite

Williams

continued

dashes

errors

returned

randomly

exhibit

LEGENTHALDIANS

power

forward

enforcement

bridegroom

away

repeatation

judge

law

degrees

sit

ri�

Center

Playboy

Island

Park

calculated

discuss

economy

BORISSA

respect

interest

speak

message

giving

road

working

relate

group

protection

grammer

May

student

wording

nest

refuge

PCB

However

strictly

air

ball

Closing

Devils

nonsense

NEC

League

great

thinking

near

store

rather

France

City

boys

moral

mother

o�en

armor

Baker

idea

box

turning

pace

tie

board

military

big

doubt

typing

history

east

fed

le�

whose Bush

cut

police

heard

MiddleSudden

�uent

criminal

died

post

rigid

points

man

names

accountant

several

posthumous

BETTER

LaGuardia

Fiorello

Scotland

billion

waiter

Galleries

formerly

dunk

Separately

gotten

show

coherence

Highway

Hospital

gurgling

entire

brew

gi�

7th

extol

normally

Bishop

denmark

Murphy

ownds

nineteenth

christians

vague

ambiguous

intellecthyphen

poor

speaks

Excessive

Cleese

Style

Barrymore

extreme

formality

Simple

unclear

Monday

June

no-no

London

bed

o�ce

rupees

Construction

buildingpeacefully

naturally

500,000

Rockville

asterisk

Diocesestroke

Centre

Improper

yes

following

Aged

absolutely

�ags

context

ually

terms

disabilities

Americans

underside

Doesn’t

yards

less

lie

damage

permanent

decorated

guerrilla

diverse

Morton

memorial

�ghter

kindly

suitcase

comment

happy

request

Opening

expands

guesisng

base

beneath

pronoun

Johnson’s

open

Kevin

Eric

tag

deck

Maas

joined

Chicago Pennie

uncomfort

partner

Henry

Skywalker

studied

banjo

horn

French

cra�

wish

ONLINE

Muslim

majority

India's

Kashmir

Jammu

technical

Aristide

ludicrous

develop

mistaken

Haitian

violence

approaching

chain

hardware

qualify

Lyman

jumps

Serbian

spring

fears

heighten

plot

Department

determined

exposure

persons

Black

term

mission

conjugating

Memorials

Judi's

[email protected]

maiden

appreciated

reconnaissance

counter-drug

Cromwell's

Army

cleanest

produce

arrived

search

o�cers

Colombian

soldiers

plane

converted

3-point

picky

likely

digestible

fragment

retired

length

Clear

Area

stations

design

helped

illegall

y

il-legally

competitive

cooper

ating

laborato

ries

virology

Leading

W.H.O

hanging

players

development

reduced

brings

Lower

Side

months

elections

enacted

unilaterally

amendments

national

deviate

O.K

Perhaps

containing

Dece

taxes

expense

Slightly

unconnected

sad

quantity

incongruous

shopping

declined

income

Operating

choreography

Bernardo

Wunderbar

require

okay

probably

obsession

weight-loss

farfetched

tsunami

mystery

suicide

Qaeda

caused

anonymity

bristling

Providence

throws

free

play

seconds

granddaughter

Shor

Toots

saloonkeeper

renowned

bombings

Afghanistan-Pakistan

girl

relevance

send

Get

misspelling

accept

listed

selections

Missing

Run-on

rational

actions

Captions

honestly

ready

region

vocative

It’s

Israeli

Palestinians

three

eyes

Ware

lucky

alto

blunt

times

saxophones

sweating

stepped

dwindled

Falmouth

lobster

rolls

Two

together

relates

Chase

classi�cations

rater

verbatim

fair

caregivers

email

gone

spied

Small

breach

rare

stating

precedents

authority

diversion

arugula

service

third

bassist

ahead

aware

setting

ave

eat

veering

fun

orzo

tuna

roll

river

Free

torts

education

toward

Fitchburg

marathon

stripping

mad

usual

discovering

Rouge

jungles

Weather

spam

hyphenated

dry

real

sister

audience

extending

Washington

Rice

trip

Thursday

handling

trials

best

Joel

must

PROPLEMS

weddings

core

Inevitable

rules

plan

reiterated

feed

process

calls

Rothwax

Believe

run

leave

let

Kim

Jong

credentials

Depp

caved

talked

Even

texts

member

broken

runs

till

tarot

�uid

veto

hid

Block

bills

total

ear

McCoy

units

cast

solemn

oath

vote

lied

speci�c

putting

Boca

regularly

comparison

Tells

boat

need

informal

guessing

turns

fake

scored

Lawrence

pure

tend German

PhD seals

cold

list

entity

Facts

Jewry

bleak

Louis

bigger

Former

forma

fail

Iraq

Inc

glare

room

Murano

vice

chief

product

Kolb

5-year

works

stare

fall

THREA

TS

bio

beat

ways

dra�

God

Paul

NBA

invest

scope

via

note

age

visas

places

Pass

Felt

player

Lack

DMV

walk

likes

far

eight

Darryl

orbit

crew

West

businessDoe

rr

guy's

dead

clip

sort

sea

days

Kelly

souls

raise

cash

Lots

Fair

frills

tell

can‘t

guns

I’ll

Hills

50's

gem

seen

sides

child

Wild

dash

gear

cope

N.J

inch

code

local

bids

abut

Sosa

tra

YES

whenever

vibes

loses

li�ed

rates

run-up

span

late

low

head

steel

race

bee

Easy

Goer

It‘s

bill

fan

add

�le

ever

silk

nee

deal

bits

lurid

setlimp

son

joy

male

taste

sites

wait

loss

sent

Nonsense

kept

�ll

six

south

Figure 1: A word cloud of common words that annota-tors used to describe why they thought sentences weremachine-generated.

erated by a system from those written by anotherhuman (Ippolito et al., 2020; Zellers et al., 2019).

However, due to the prohibitive cost of runninghuman evaluation studies, most prior work in thisarea has been rather limited in scope. For example,analyses usually show results on only a single cate-gory of text (news articles, stories, webtext, etc.).This could be problematic since different domainshave different levels of named entities, world facts,narrative coherence, and other properties that im-pact the success of NLG systems. In addition, mostpapers only evaluate on a very limited selection ofdecoding strategy hyperparameters. Holtzman et al.(2019) and Ippolito et al. (2020) both show thatthe decoding strategy chosen at inference time canhave a significant impact on the quality of gener-ated text.

In this work, we introduce the Real or Fake Text(RoFT) system, a novel application for simultane-ously collecting quality annotations of machine-generated text while allowing the public to as-sess and improve their skill at detecting machine-generated text.

In RoFT, we propose to use the task of detect-ing when text is machine-generated as a qualitycriterion for comparing NLG systems. FollowingIppolito et al. (2020), we make the counter-intuitiveassumption that the worse annotators are at detect-ing that text is machine-generated, the better wecan say that NLG system is at generating text.

In RoFT’s detection task, annotators are shown apassage of text one sentence at a time. The first sev-eral sentences are from a real human-written textsource and the next several sentences are a machine-generated continuation. The user’s goal is to guesswhere the boundary is. When they think that a sen-tence is machine-generated, they are asked to givean explanation for their choice, and afterwards thetrue boundary is revealed.

In the remainder of this paper, we discuss whywe think this task is interesting from a research per-spective and describe the technical details behindour implementation. We show preliminary resultsthat showcase the types of analyses that are possi-ble with the collected data, and finally we discussplans for future work.

The RoFT website is located at http://www.

roft.io/. The source code is available un-der an MIT License at https://github.com/

kirubarajan/roft.

2 Research Motivations

The purpose behind RoFT is to collect annotationson the scale needed to probe the quality of textgenerated under a variety of NLG conditions andsystems. In this section, we describe three researchquestions we aim to answer using RoFT data.

2.1 Length Threshold for Detection

State-of-the-art generative models tend to producetext that is locally fluent but lacking in long-termstructure or coherence. Intuition suggests that flu-ent NLG systems ought to produce text that is highquality for long durations (measured in number ofsentences). As such, we are interested in using thethe boundary detection task—whether annotatorscan detect the boundary between human-writtentext and a machine-generated continuation—as acomparison method for NLG systems. We hypoth-esize that for better quality systems, the generatedtext will be able to fool humans for more sentences.

2.2 Text Genre/Style

Generative language models have now been trainedand fine-tuned on a great diversity of genres andstyles of text, from Reddit posts (Keskar et al.,2019) and short stories (Fan et al., 2018) toWikipedia (Liu et al., 2018) and news articles(Zellers et al., 2019). Each of these datasets hasits own distinct challenges for generation; for ex-ample, in the story domain it is acceptable for agenerator to make up facts while this would be un-acceptable in a Wikipedia article. We are interestedin how these differences might impact the abilityof humans to detect machine-generated text.

2.3 Reasons Text is Low Quality

A study by van der Lee et al. (2019) found thatless than 3% of recent papers on NLG ask for free-text comments when performing human evalua-tions. And yet, understanding why humans thinktext is low quality can be very important for diag-nosing problems in NLG systems (Reiter and Belz,2009). Therefore, the RoFT platform collects free-form textual explanations from our annotators ontheir decisions. Such data, though inevitably noisy,could provide insights into the types of errors thatNLG systems introduce, the types of errors humansare sensitive to, and even the types of errors human-written corpora contain (when a rater inadvertentlypredicts that a human-written sentence is machine-generated).

2.4 Human Factor

The boundary detection task posed by RoFT isan artificial one. We do not expect that real-worlduses of machine-generated text would involve a tidysplit of prompt sentences followed by a machine-generated continuation. However, we believe thateven an artificial framing such as RoFT’s has boththe potential to educate the public on what to lookfor in machine-generated test and give researchersinsights into how humans perceive and react to suchtext. We are in particular interested in whetherannotators improve over time and in how demo-graphic (for example, paid crowd worker vs. uni-versity student) impacts detection skill.

3 System Overview

This section gives an overview of RoFT’s design,including the task that annotators are asked to com-plete and methods for encouraging organic traffic.

http://www.roft.io/

http://www.roft.io/

https://github.com/kirubarajan/roft

https://github.com/kirubarajan/roft

3.1 Task Definition

The RoFT annotation task is posed as a game.Users first choose which category they would liketo play in, where different categories correspondto different text domains or NLG systems. The“game” then consists of a series of rounds. Eachround starts with the user being presented a singlesentence that is guaranteed to be human-written.For example, this might be the first sentence of aNew York Times article. Afterwards, users mayselect to display more sentences, one at a time. Ateach step, they must decide if they believe that themost recent sentence is still written by a human.When the user decides they are confident that a ma-chine has written the most recent sentence (i.e. theyhave found the “boundary sentence”), the roundends. The user is then asked to provide a naturallanguage explanation of what prompted their deci-sion. In essence, the annotators’ goal is to identifythe exact sentence where a machine “takes over”and the text is no longer human-written. Figure 2gives screenshots of the flow of a single round.

3.2 Implementation

The RoFT annotation website is designed to col-lect data needed to answer a variety of researchquestions, including those posed in Section 2. Inparticular, our system stores detailed metadata foreach annotation. These include the order in whicha user completed annotations, the type of user ac-count associated with each annotation (e.g. paidworker or organic traffic), the NLG system used toproduce each generation, and the amount of timeeach annotation takes. The system was developedin Python using the Django Framework and a SQLdatabase. The use of a relational database enablessophisticated queries to be made on the collectedannotations for analysis. We plan to make dumpsof the database available to other researchers tofurther promote research into the evaluation of gen-erated text.

3.3 Gamification

Since the cost of collecting human annotations viaa crowd platform such as Amazon Mechanical Turkcan be prohibitively expensive for large studies, weaimed to build the RoFT website in a manner thatwould encourage sustained participation withoutthe need for a financial incentive.

Each user has a Profile page (shown in Figure3) where they can see statistics on the total number

(a) The user is shown an initial sentence and then one sentence of continuation at a time. At each step, the user decides if the latest sentence is human-written or machine-generated and presses the appropriate button.

(b) When the user decides that the most recent sentence is machine-generated, they are asked to provide an explanation for their decision.

(c) The true boundary is then revealed. In this case, the user would be alerted that they received 5 points since they guessed the boundary correctly.

Figure 2: The user interface for annotation.

Figure 3: A user’s profile page.

of annotations they’ve done, how many points theyhave earned, and how many questions they have an-swered perfectly. There is also a leaderboard whereusers can check how their point count compares toother raters. The leaderboard encourages users todo more annotations, since this is the only way tomove up on the rankings.

We received unsolicited compliments from ourinitial annotators such as “Interesting, fun task” and“Really convincing passages.” We intend to add fur-ther gamification elements, including leaderboardsbroken down by text domain, comprehensive statis-tics on user progress and skill, and the ability to seeand up-vote the free-text comments of other users.

3.4 Generations

We ultimately plan to use RoFT to study differencesin detection performance across a variety of NLGsystems and text domains. The initial version ofRoFT includes two complementary categories oftext: news and fictional stories. Users have theoption to choose which category they would like toannotate.

For the news category, prompts are drawn fromthe New York Times Annotated Corpus (Sandhaus,2008) and are truncated to between 1 and 10 sen-tences long. GROVER (Zellers et al., 2019) is thenconditioned on these starting sentences and askedto complete the article. Finally, the outputs fromGROVER are truncated so that the sum total num-ber of sentences for each example is 10.

The data on fictional stories was prepared sim-ilarly except that the Reddit Writting Promptsdataset (Fan et al., 2018) was used for the prompts,and the GPT-2 XL model (Radford et al., 2019)was used for generation.

Each category contains over 1,500 examples,where for each example the number of human-written context sentences as well as the values ofthe decoding strategy hyperparameters were cho-sen randomly. For our initial seeding of data, Nu-cleus sampling (Holtzman et al., 2019) was used forall decoding, where the p hyperparameter, whichcontrols the diversity of the generated text, wasrandomly selected to be anywhere from p = 0(argmax) to p = 1.0 (full random sampling).

4 Case Study

To show the efficacy of RoFT as an evaluationtool, we present a case study from our initial pilotof over 3000 annotations of generations from thenews article domain.

4.1 Data Collection

While our eventual hope is for the RoFT websiteto have enough organic traffic for useful data to becollected, for the purposes of this study, two hun-dred Amazon Mechanical Turk workers were paidto complete 10 annotations each on the website.In total, we collected 3244 annotations (7.9% ofannotators continued past the minimum of 10 ques-tions they were required to do to get paid). 10% ofexamples the crowd workers saw were designatedattention check questions in which the prompt ex-plicitly stated they should select “human-written”at every step. About 25% of crowd workers failedthis check, and after filtering out these annotators,we were left with a total of 1848 high-quality an-notations, which we will refer to as the filteredannotation set.

4.2 Inter-Annotator Agreement

There were 768 examples which had at least twocrowd workers provide annotations for them (645of which had at least three annotations provided).This led to 6,115 instances of pairs of annota-tions on the same examples. Of these, 18.3% pre-dicted the exact same sentence as the boundary,and 28.4%, predicted boundaries at most one sen-tence apart from each other. When consideringonly the filtered annotation set, there were 2,064pairs of annotations. Of these, 18.6% predicted theexact same sentence as the boundary, and 28.3%predicted boundaries at most one sentence apartfrom each other.

Figure 4: A histogram of the filtered annotation setgrouped by the distance (in number of sentences) be-tween the sentence selected by the annotator and thetrue boundary sentence.

4.3 Evaluation Measures

We consider three methods for evaluating annotatorability.

4.3.1 AccuracyAmong annotators that passed our attention check,15.8% of the filtered annotations correctly identi-fied the exact boundary between machine and gen-erated text. Additionally, the average annotationfrom our filtered set was 1.989 sentences after thetrue boundary. This is consistent with our intuition,namely that current state-of-the-art NLG systemsare capable of fooling humans but typically onlyfor one or two sentences.

4.3.2 Distance from BoundaryIn Figure 4, we show a histogram of the numberof annotations which were each possible distanceaway from the true boundary. As a note, valuescloser to zero are more likely by construction be-cause there are more opportunities for these dis-tances to be selected. For example, a distance of -9is only possible if the generation boundary is at the10th sentence, while a distance of 0 is possible inevery configuration. However, the observed distri-bution of distances is far from symmetric, with theleft tail (composed of annotators picking human-written sentences) dropping off precipitously whilethe right tail (composed of machine-generated sen-tences) decreases more linearly. This indicates thatour annotators are successfully picking up on cluesin the generated text, and thus the sentence-by-sentence structure of the RoFT experiment is an

effective way to evaluate text. These preliminaryresults bode well for future large-scale use of thetool.

4.3.3 Points AwardedWhile accuracy may be a simple and intuitive met-ric for assessing performance, it is sub-optimal forour purposes as it does not give partial credit forguesses that are after the boundary, despite suchguesses being successful identifications of gener-ated text. Average distance (in sentences) fromboundary is not sufficient either, as it does notweight all guesses before the boundary equally neg-atively and thus over-penalizes too-early annota-tions on examples with late-occurring boundaries.

To combat these issues, we developed a pointsystem to better capture annotator ability. Aftereach annotation, a user is assigned points based ontheir performance: 5 points for guessing exactly onthe boundary and a linearly decreasing number ofpoints for each sentence beyond the boundary. Nopoints are awarded for guesses that appear beforethe boundary. We use the average points per anno-tation as our metric for the experiments shown inFigure 5.

4.4 Skill Range of Annotators

There was a significant range in detection abilityacross the crowd workers. The top 5% of the fil-tered worker pool earned an average of 3.34 pointsper annotations while the bottom 5% earned an av-erage of 0.35. Since it is difficult to separate out theinfluence of inherent skill from that of misalignedincentives (AMT workers were paid for comple-tion, not correctness), more research is necessaryto understand differences in annotator ability.

4.5 Impact of Decoding Strategy

During our small-scale case study, we did not seea noticeable correlation between the values of theNucleus Sampling (Holtzman et al., 2019) hyper-parameter p and the detection accuracy of humansas reported in Figure 5b. This is likely due to thelow number of annotations per value of p (n=180)and we hope to run a more comprehensive versionof this experiment with more data in the future.

4.6 Impact of Revealing the Boundary

As part of the gamification aspect of the RoFT plat-form, we reveal the true boundary to our annotatorsafter every annotation they complete. This featureadds a level of interactivity to the process and is

0 2 4 6 8Index of Annotation

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Aver

age

Poin

ts

Average Points over time

(a)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Value of p

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Aver

age

Poin

ts

Average Points per p value

(b)

Figure 5: In (a) we show the average number of points (Section 4.3) received per annotation in the filtered annota-tion set grouped by the temporal order in which they were shown to the annotators 0 (first) to 9 (last). In (b) weshow average number of points received per item in the filtered annotation for each values of p used for decoding.Error bars are standard deviation. No statistically significant trends were observed in this preliminary study.

crucial for ensuring that the RoFT experiment isenjoyable and appeals to the general public. To bet-ter understand how this decision affected annotatorskill, we analyzed if our annotators got more accu-rate as they did more annotations. Figure 5a showsthat over a session of 10 annotations, annotators ex-hibit little to no improvement at the annotation taskover time. Future studies using the RoFT platformwill further investigate if human annotators can betrained to detect generated text over long periodsof time and multiple gameplay sessions.

4.7 Free-form CommentsOur proposed annotation system allows annotatorsto provide a natural language explanation of whythey made a particular decision (e.g. classifying asentence as human-written or machine-generated).Due to minimal oversight, many annotators re-usedor copy/pasted their comments across annotations.Filtering for duplicates, we collected over 1200unique comments, out of around 3000 annotations.Manual inspection shows that many annotations re-lied on similar clues such as: problems with entail-ment, formatting (i.e. punctuation), and repetition.These responses can be used to inform future im-provements to existing NLG systems and decodingstrategies. Additionally, it is possible to use datamining techniques to extract an error taxonomyfrom the provided natural langauge description oferrors.

5 Related Work

Nearly all papers in NLG do some form of humanevaluation, usually using Amazon Mechanical Turk

Sample Annotation

Seems like a conversational statement that doesnt logi-cally follow from a book title reference

not relevant to preceding sentences

I don’t think that a human would write about tarot cardsin an obituary and it says obituaries plural.

The sentence is too short and simple, sweatingcomputerized.

First time I heard of dinosaur-eating mammals

The sentence is left hanging.

Repeated the second line again and To is written as TO

Table 1: Examples of explanations crowd workersgave for why they thought a sentence was machine-generated.

(van der Lee et al., 2019). Typically the interfacesfor these evaluations are simple web forms. van derLee et al. (2019) offers a survey of many of thesemethods. Custom-designed websites for collectingor displaying human evaluations of generated texthave become increasingly prominent in the open-ended dialog domain, with ChatEval (Sedoc et al.,2018) and ConvAI (Pavlopoulos et al., 2019) beingtwo examples.

However, RoFT was primarily influenced byother “real or fake” websites that attempt togamify the detection task, such as http://www.

whichfaceisreal.com/ for generated face im-ages and https://faketrump.ai/ for generatedTweets. Our task is similar to the one used for hu-

http://www.whichfaceisreal.com/

http://www.whichfaceisreal.com/

https://faketrump.ai/

man evaluation in Ippolito et al. (2020), except intheir task the text shown to raters was either entirelyhuman-written or entirely machine-generated.

The boundary detection task we propose wasinspired by the Dialog Breakdown Detection Chal-lenge (Higashinaka et al., 2016), in which the goalis to automatically detect the first system utterancein a conversation between a human and a chatbotsystem that causes a dialogue breakdown.

6 Conclusion and Future Work

In this work, we have introduced RoFT and shownhow it can be used to collect annotations on howwell human raters can tell when an article transi-tions from being human-written to being machine-generated.

Ultimately, we plan to use RoFT to conduct alarge-scale systematic study of the impact of de-coding strategy, fine-tuning dataset, prompt genre,and other factors on the detectability of machine-generated text. We also intend to collect and releasea large dataset of natural language explanations forwhy humans think text is machine-generated. Wehope that these will provide insights into problemswith both the human-written text we use as promptsand into the types of errors that NLG systems make.

Such a study will require tens of thousands ofhuman annotations. We hope that by gamifying theannotation process and encouraging organic trafficto the website, we can ultimately bypass the needfor crowd workers who, since they are paid by theannotation, are disincentivized from taking the timeto provide high quality annotations.

We believe that RoFT provides a powerful toolfor understanding the strengths and limitations of agreat variety of NLG systems, and we look forwardto working with researchers interested in testingout their own model outputs within the RoFT eval-uation framework.

Acknowledgements

This research is based upon work supported inpart by the DARPA KAIROS Program (contractFA8750-19-2-1004), the DARPA LwLL Program(contract FA8750-19-2-0201), and the IARPA BET-TER Program (contract 2019-19051600004). Ap-proved for Public Release, Distribution Unlimited.The views and conclusions contained herein arethose of the authors and should not be interpretedas necessarily representing the official policies, ei-ther expressed or implied, of DARPA, IARPA, or

the U.S. Government. The RoFT website is alsosupported by a grant from the Google Cloud Plat-form research credits program.

We thank the members of our lab for their feed-back on the design of the RoFT user interface.

ReferencesAngela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-

erarchical neural story generation. In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 889–898.

Ryuichiro Higashinaka, Kotaro Funakoshi, YukaKobayashi, and Michimasa Inaba. 2016. The dia-logue breakdown detection challenge: Task descrip-tion, datasets, and evaluation metrics. In Proceed-ings of the Tenth International Conference on Lan-guage Resources and Evaluation (LREC’16), pages3146–3150.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, andYejin Choi. 2019. The curious case of neural text de-generation. In International Conference on Learn-ing Representations.

Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. 2020. Automatic detec-tion of generated text is easiest when humans arefooled. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 1808–1822.

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,Caiming Xiong, and Richard Socher. 2019. Ctrl: Aconditional transformer language model for control-lable generation. SalesForce Einstein.ai blog.

Chris van der Lee, Albert Gatt, Emiel van Miltenburg,Sander Wubben, and Emiel Krahmer. 2019. Bestpractices for the human evaluation of automaticallygenerated text. In Proceedings of the 12th Interna-tional Conference on Natural Language Generation,pages 355–368.

Peter J Liu, Mohammad Saleh, Etienne Pot, BenGoodrich, Ryan Sepassi, Lukasz Kaiser, and NoamShazeer. 2018. Generating wikipedia by summariz-ing long sequences. In International Conference onLearning Representations.

John Pavlopoulos, Nithum Thain, Lucas Dixon, andIon Androutsopoulos. 2019. Convai at semeval-2019 task 6: Offensive language identification andcategorization with perspective and bert. In Proceed-ings of the 13th International Workshop on SemanticEvaluation, pages 571–576.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. OpenAIBlog, 1(8):9.

Ehud Reiter and Anja Belz. 2009. An investigation intothe validity of some metrics for automatically evalu-ating natural language generation systems. Compu-tational Linguistics, 35(4):529–558.

Evan Sandhaus. 2008. The new york times annotatedcorpus. Linguistic Data Consortium, Philadelphia,6(12):e26752.

Joao Sedoc, Daphne Ippolito, Arun Kirubarajan, JaiThirani, Lyle Ungar, and Chris Callison-Burch.2018. Chateval: A tool for the systematic evalua-tion of chatbots. In Proceedings of the Workshop onIntelligent Interactive Systems and Language Gener-ation (2IS&NLG), pages 42–44.

Rowan Zellers, Ari Holtzman, Hannah Rashkin,Yonatan Bisk, Ali Farhadi, Franziska Roesner, andYejin Choi. 2019. Defending against neural fakenews. In Advances in Neural Information Process-ing Systems, pages 9054–9065.

roft: a tool for evaluating human detection of machine ...ccb/publications/roft-demo.pdf · roft: a...

Documents