document-level text quality: models for organization and

48
Document-level Text Quality: Models for Organization and Reader Interest Annie Louis October 16, 2015 Université Catholique de Louvain Joint work with Ani Nenkova

Upload: others

Post on 09-Dec-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Document-level Text Quality: Models for Organization and

Document-levelTextQuality:ModelsforOrganizationandReaderInterest

AnnieLouisOctober16,2015

Université Catholique deLouvain

JointworkwithAni Nenkova

Page 2: Document-level Text Quality: Models for Organization and

Peoplespontaneouslyrespondtodifferencesinwriting

2

Page 3: Document-level Text Quality: Models for Organization and

3

“Finnegans Wake islong,dense,andlinguisticallyknotty,yethugelyrewarding,ifyou'rewillingtolearnhowtoreadit...”

http://www.publishersweekly.com

Page 4: Document-level Text Quality: Models for Organization and

“MyFaith:WhyIdon'tsingthe'StarSpangledBanner’”

“What a poorly written article. Strays off topic and hardly even addresses the point of the article.

The only brief mention of why they don’t play the national anthem is that they believe in church and state. This just was one long rant about his religion.”

4

http://www.cnn.com

Page 5: Document-level Text Quality: Models for Organization and

5

http://www.vocabula.com

Page 6: Document-level Text Quality: Models for Organization and

TextQualityPrediction

Thisarticleiswell-written.Nextone..

Canweteachcomputerstomakesimilarjudgements?

6

o Howtoformulatethetask?

o Getsuitabledatawithdistinctions

o Findcorrelatesintext

Page 7: Document-level Text Quality: Models for Organization and

Whydowecare?• Informationretrieval,articlerecommendation

– Allarticlesarenotofthesamequality– Canfilterbyqualityinadditiontorelevance

• Authoringsupport,educationalassessment– Automaticassessmentischeap,consistentandquick– Spellingandgrammarcorrectionarecommerciallysuccessful

• Textgenerationsystems– Systemscanunderstandhowtogeneratecoherenttext– Canevaluatesystemoutput

7

Page 8: Document-level Text Quality: Models for Organization and

Thistalk

• Definingtextqualityandcreatingacorpusofoverallarticleratings– Largescalerealisticsampleofwritingdifferences

• Twomodels– Amodelfororganizationusingsyntaxpatterns– Amodelforreaderinterest

• Document-levelqualityprediction– Incontrasttospellingandgrammar– Oftennotabinary,correct/in-correctdistinction

8

Page 9: Document-level Text Quality: Models for Organization and

>>DefiningTextQuality

9

• Aspectsofquality• Whoistheaudience?

Page 10: Document-level Text Quality: Models for Organization and

Aspectsofquality

• Weadoptadefinitionfromtheeducationfield

10

Page 11: Document-level Text Quality: Models for Organization and

Ideasanddevelopment

Organization(Smooth

transitions)

Voice(Personaltouch)

Wordchoice(vivid, lively)

Sentencefluency(Rhythm)

Conventions(Mechanics)

SixTraits[Spandel 2004]Detailsandtheirpresentation

Flowbetweensentences

Interestingnature,beautifulwriting

Spelling,grammar,layout

11

Page 12: Document-level Text Quality: Models for Organization and

Audiencefortextquality– Anexpert

lowcompetency highcompetency

ExperiencedNLPresearcher

Readerofmachine-generatedtext

Adult readerofnewspaper

• Increasedfocusonlinguisticpropertiesofthetext

12

Page 13: Document-level Text Quality: Models for Organization and

Relationshiptoreadability• Readabilityhasastrongfocusoncomprehension

Gradelevel1

Gradelevel2..

Gradelevel12

13

• Audiencedistinctions– childvs.adult,novicevs.expert,cognitivedisabilityornot

Page 14: Document-level Text Quality: Models for Organization and

>>ACorpusforDocument-levelQuality

14

Louis&Nenkova,DiscourseandDialogue,2013

Page 15: Document-level Text Quality: Models for Organization and

Sciencejournalism:examplesnippet

Sarah Lewis is fluent in firefly.

On this night she walks through a farm field in eastern Massachusetts, watching the first fireflies of the evening rise into the air and begin to blink on and off.

Dr. Lewis, an evolutionary ecologist at Tufts University...

15

Page 16: Document-level Text Quality: Models for Organization and

Category1:VERYGOODarticles• Seedset=63NewYorkTimesarticlesthatappearedintheBestAmericanScienceWritingseries

• WechooseonlytheNYTarticles– WeusetheNYTCorpustoexpandourcategory– Normalizefordifferencesinwritingduetosource

16

Page 17: Document-level Text Quality: Models for Organization and

Topicsintheseedset

17

Tag AppearanceMedicineandHealth 22

Space 14Physics 10Biology andBiochemistry 8

GeneticsandHeredity 8

Archaeology andAnthropology 7

ComputersandtheInternet 4

Page 18: Document-level Text Quality: Models for Organization and

ExpandingtheVERYGOODset

• Assume:~40authorsoftheseedsetareexcellentwriters

• OtherarticlesfromtheNYTwrittenbythesameauthors– whichareresearchrelated– duringthesame10yearperiod– onsimilartopics– similarlengths

18

Page 19: Document-level Text Quality: Models for Organization and

Category2:TYPICALwritingintheNYT

• Othersciencearticlesaroundthesametime,butnotwrittenbythepopularauthors

Category TotalArticles

VERY GOOD 3,530

TYPICAL 20,242

Thegeneralcorpus:

19

Page 20: Document-level Text Quality: Models for Organization and

Atopic-pairedcorpus

• Thegeneralcategoriesmixdifferenttopics– geography,biology,astronomy,linguistics…

• ButanIRsystemcomparesarticlesonthesametopic

• ForeachVERYGOODarticle,get10mostsimilarTYPICALarticles(basedonthecontent)

• Enumerateallpairsof(VERYGOOD,TYPICAL)

• 35,300pairs

20

Page 21: Document-level Text Quality: Models for Organization and

Twoqualitypredictiontasks

`Same-topic’– whicharticleinthepairistheVERYGOODone?

2categoriesGOOD(~3500)TYPICAL(~3500)

Topicallysimilarpairs<VERYGOOD,TYPICAL>

~35,000pairs

`Any-topic’– isthisarticleVERYGOODorTYPICAL?

21

Page 22: Document-level Text Quality: Models for Organization and

Propertiesofthedataset

• Distinguishesaveragewritingfromverygood

• Allowtofocusonaspectssuchasbeautifulwriting– Lesslikelytohavespellingandgrammarerrors

• Largescaleandrealisticsampleofwritingdifferences– Previousworkoftenusedmachinegeneratedtextorartificiallymanipulatedtext

22

Page 23: Document-level Text Quality: Models for Organization and

>>Predictingorganizationquality

23

Louis&Nenkova,EMNLP2012

Page 24: Document-level Text Quality: Models for Organization and

Somesequencesofsentencetypesconveytheoverallpurposebetter

24

Solving X is useful for many applications.

We present a new approach to address X.

Results show that our method works well.

Motivation

Introduceapproach

Results

Page 25: Document-level Text Quality: Models for Organization and

Intentionalstructureofanarticle• Everytexthasapurposethattheauthorwishestoconvey

• Influentialearlytheoriesdiscussitatlength

[Grosz&Sidner 1986]

• Particularlyforacademicwriting,itispopulartoseearticlesasasequenceofintentions

[Swales1990,Teufel 2000]

Narrative

Explanation Critique

25

Page 26: Document-level Text Quality: Models for Organization and

Oraclemodelofintentionalstructure• UsingmanualannotationsofintentionsonACLarticles

26

STARTBackground

AimOwn

Contrast

TextualEND0.7

0.1

0.3

MarkovChainonIntroductionsections

[corpusbyTeufel,2000]

0.8

0.4

Otherswork0.1

Page 27: Document-level Text Quality: Models for Organization and

Mainideaofthework• Annotatingsentencetypesishard.Pre-definingthesetof

sentencetypesisharder

• Assume

27

Syntax~roughproxyforsentencetype

Page 28: Document-level Text Quality: Models for Organization and

28

Syntacticpatternsinexplanations

• An aqueduct is awatersupplyornavigablechannelconstructedtoconveywater.Inmodernengineering, thetermisusedforanysystemofpipes,canals,tunnels,andotherstructuresusedforthispurpose.

• A cytokinereceptoris areceptor thatbindscytokines.Inrecentyears, thecytokinereceptorshavecometodemandmoreattentionbecausetheirdeficiencyhasnowbeendirectlylinkedtocertaindebilitatingimmunodeficiencystates.

Definitionslooklikethis

Descriptivearticleslooklike

this

indefinitearticletermtodefine

Relativeclause

is/are NPMorespecific:topicalizedPP

Page 29: Document-level Text Quality: Models for Organization and

Syntax-basedHMMmodel

START END

0.5

0.3

0.2

0.3

0.2

VPà VBZNP

NPà DTADJP

NPà NPPP

….

“Definitions”

NPà NNPCCNNP

NPà CD

NPà NP,NP

“Citations”

VPàMDVP

VPà VBVP

VPà VBPP

“Speculations”

29*Usesgrammaticalproductions

Page 30: Document-level Text Quality: Models for Organization and

30

• Moreinformationaboutadjacentconstituents• APOStagsequencelosesallabstraction

• D-sequence– controlabstractionusingaparameter“depth”(d)

Asecondmodel:basedond-sequences

S

NP”,S“ VP .

NP VP

DT VBZ NP

NN

NNPNNP VBD

JJ

[“DTVBZJJNN,”NNPNNPVBD.]

Page 31: Document-level Text Quality: Models for Organization and

31

Step1– depthcutoffROOT

S

NP”,S“ VP .

NP VP

DT VBZ NP

JJ NN

NNPNNP VBD

Chooseadepthd

Terminatetreeatd

Readoffnewleavesfrom lefttoright

d=2

“S,”NPVP.

d=3“NPVP,”NNPNNPVBD.

“That’sgoodnews,”Dr.Leaksaid.

Page 32: Document-level Text Quality: Models for Organization and

32

Step2:NodeaugmentationROOT

S

NP”,S“ VP .

NP VP

DT VBZ NP

JJ NN

NNPNNP VBD

Forphrasalnodes ind-sequence,

- Annotatewithleftmostleafinfulltree

d=2

“ SDT , ” NPNNP VPVBD .

d=3

“ NPDTVPVBZ ,” NNPNNPVBD.

DT NNP VBD

DT VBZ

Page 33: Document-level Text Quality: Models for Organization and

Evaluationtaskonacademicwriting

• ACLanthologycorpus– abstract,introduction,relatedwork

• Approximatedistinction fororganizationquality– Originalarticleà well-organized– Randompermutationoforiginalà poorly-organized– Createpairs<original,permutation?

• Task:identifytheoriginalversioninthepair– Baseline50%accuracy

33

Page 34: Document-level Text Quality: Models for Organization and

Summaryofresultsonacademicwriting

• Correct=higherlikelihoodfororiginalarticle– versuspermutedarticle

• D-seq model

34

ACLconference Accuracy

Abstract 62.9

Introduction 68.8

Relatedwork 72.7

Baseline=50%

Page 35: Document-level Text Quality: Models for Organization and

DosentencetypesdistinguishVERYGOODandTYPICALsciencenews?

• CreatetheHMMonVERYGOODtrainingarticles

• Getlikelihoodandmostlikelystatesequenceforanewarticle– Computefeaturesbasedonthese

• AclassifieristrainedtopredicttheVERYGOODarticle

35

Page 36: Document-level Text Quality: Models for Organization and

Resultsonourcorpus

36

AnyTopic:Givenanarticle,isit“VERYGOOD”or“TYPICAL”?

System Accuracy

Baseline(random) 50%

HMM-productions 61%

§ 10foldcrossvalidationresults§ SVMclassifier

SameTopic:Givenapairofarticlesonthesametopic,whichoneis“VERYGOOD”?

System Accuracy

Baseline(random) 50%

HMM-productions 63%

Page 37: Document-level Text Quality: Models for Organization and

>>Predictingreaderinterest

37

Louis&Nenkova,TACL2013

Page 38: Document-level Text Quality: Models for Organization and

Predictinginterest:Anewtask

• Alotofworkonidentifyingwhatiswrongwithatext– Spellingmistakes,grammarerrors,incoherentwriting

• Itisnotknownhowtocharacterizewritingthatisengaging,interestingandnice

38

Page 39: Document-level Text Quality: Models for Organization and

Approachtofeaturedevelopment

• Focusoninterpretablefeatures– Only41features– Eachfeatureisacompositeone:indicatesanaspectdirectly– Linguisticallyinteresting

• Confirmthatfeaturesrepresenttheintendedaspect– Tunebycheckingfeaturevaluesonrandomsnippetsoftext

39

Page 40: Document-level Text Quality: Models for Organization and

1.UnusualwordsandphrasesIsthephrasingandlanguageuseunique?

• Word-based– highperplexityunderaphonemen-grammodel– Eg:‘undersheriff’,‘powwow’,‘chihuahua’,‘qipao’

• Wordpairs--based– adjective-noun,noun-noun,adverb-verb,subject-verbpairs– perplexityunderalanguagemodel– Eg:‘plastickywoman’,‘so-calledsuperkids’

40

Page 41: Document-level Text Quality: Models for Organization and

2.VisualnatureIstherescenesetting?

• Creatingalargelexiconofvisualterms– Source:animage-taggedcorpus– Largesourceofpotentiallyvisualwords,butnoisy

• CreateLDA-basedtopicsonthetagset– UsethemanualMRCtermstofilteroutnon-visualtopics

41

grass,mountain,green,hill,blue,field,sand...round,ball,circles,logo,dots,square,sphere...silver,white,diamond,gold,necklace,chain...

Page 42: Document-level Text Quality: Models for Organization and

Humaninterestandtextstructure

3. UseofpeopleinthestoryDoesthestoryrevolvearoundaperson?

– animacy informationfromNEs,pronouns,ngram patterns

4. Sub-genreIsthearticleisanarrative,interviewordialog

– Eg:narrativescore~pasttenseverbs,pronouns,propernames

42

Page 43: Document-level Text Quality: Models for Organization and

SentimentandResearch

5. AffectIsthereanemotionalangletothestory?

– usingsentimentworddictionaries

6. ResearchcontentHowmuchexplicitresearchdescriptionispresent?

– usingahand-builtdictionaryofresearchwords

43

Page 44: Document-level Text Quality: Models for Organization and

Howthefeaturesvaryinarandomsampleofverygoodandtypicalarticles(t-test)

Higher valuesinVERYGOODset

ü Visualwordsinbeginningandendofarticles

☓ Totalvisual words

☓Animacy countsü Unusualwordsandphrases

☓Narrative,interviewordialogformat

ü Sentimentwords,negativepolarity

ü Researchwords44

Page 45: Document-level Text Quality: Models for Organization and

AccuraciesonthetwotasksAnyTopic:Givenanarticle,isit“VERYGOOD”or“TYPICAL”?

System Accuracy

Baseline(random) 50%

Interesting-sciencefeatures

75%

§ 10foldcrossvalidationresults§ SVMclassifier

SameTopic:Givenapairofarticlesonthesametopic,whichoneis“VERYGOOD”?

System Accuracy

Baseline(random) 50%

Interesting-sciencefeatures

68%

45

Page 46: Document-level Text Quality: Models for Organization and

Combininginterestwithotheraspects

46

Featureset anytopic sametopic

Interesting science 75.3 68.0

PreviousmethodsforpredictingotheraspectsReadable(article length,language

model, cohesion,syntax)16features

65.5 63.0

Well-written(entitygrid[BL08],discourserelations[PN08])

23features

59.1 59.9

Interesting-fiction[ML09]22features

67.9 62.8

Combination ofallfeaturesAllwriting aspects 76.7 74.7

Differentaspectsofwritinghave

complementarystrengths

Genre-specificmeasuresarestrongerthangenericones

Page 47: Document-level Text Quality: Models for Organization and

Conclusions

• Textqualityisaninterestingandchallengingtask

• Moresuccessonthetopicrecently– applicationtonovels,tweets,essays

• Futurework– Alottobedoneintermsofformalizingthetasks,collectingdata,modelsandevaluation

– Transferringtheknowledgetogeneratingtexts

47

Page 48: Document-level Text Quality: Models for Organization and

Thankyou!

48