part three our social world - higher ed ebooks & … · 2009-11-05 · part three our social...

13
PART THREE Our Social World Communicating with Others Deciding Authorship 115 Children’s Recall of Pictorial Information 126 The Meanings of Words 132 The Sizes of Things 142 People at Work How Accountants Save Money by Sampling 151 Preliminary Evaluation of a New Food Product 161 Statistical Determination of Numerical Color Tolerances 170 People at School and Play Making Test Scores Fairer With Statistics 178 Statistics, Sports, and Some Other Things 188 Counting People and Their Goods The Consumer Price Index 198 How to Count Better: Using Statistics to Improve the Census 208 How the Nation’s Employment and Unemployment Estimates Are Made 218 The Development and Analysis of Economic Indicators 227 113

Upload: phunglien

Post on 18-Aug-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

PART THREE

Our Social WorldCommunicating with OthersDecidingAuthorship 115Children’sRecall ofPictorial Information 126The Meaningsof Words 132The Sizesof Things 142

People at WorkHow AccountantsSaveMoneyby Sampling 151Preliminary Evaluation of a New Food Product 161Statistical Determinationof Numerical Color Tolerances 170

People at School and PlayMaking TestScoresFairer With Statistics 178Statistics, Sports, and SomeOther Things 188

Counting People and Their GoodsThe ConsumerPrice Index 198How to CountBetter: UsingStatisticsto Improvethe Census 208How the Nation’s Employmentand Unemployment

EstimatesAre Made 218The DevelopmentandAnalysisof EconomicIndicators 227

113

Deciding AuthorshipFrederick Mosteller Harvard University

David L. Wallace University of Chicago

Art, music, literature,and the social, biological, and physical sciencesshareacommonneedto classify things:What artistpaintedthe picture?Who composedthepiece?Whowrote the document?If paroled,will theprisonerrepeatthe crime?What diseasedoesthe patienthave?What tracechemicalis damaging the process?In the field of statistics,we call thesequestionsclassificationor discrimination problems.

Questionsof authorshipare frequentandsometimesimportant. Mostpeoplehaveheardof the Shakespeare-Bacon-Marlowecontroversyoverwho wrote thegreatplays usually attributedto Shakespeare.A less well known but carefullystudiedquestiondealswith the authorshipof a numberof Christian religiouswritings called the Paulines,somebeing booksin the New Testament:Whichoneswerewritten by Paulandwhich by others?In many authorshipquestionsthe solutionis easyoncewe set aboutcountingsomethingsystematically.Butwe treat here an especially difficult problem from American history, the

115

116 PARTTHREE: OUR SOCIAL WORLD

controversyover the authorshipof the 12 Federalistpapersclaimedby bothAlexanderHamiltonandJamesMadison,andwe showhow a statisticalanalysiscan contribute to the resolution of historical questions.

The Federalistpaperswerepublishedanonymouslyin 1787-1788by AlexanderHamilton, JohnJay, andJamesMadison to persuadethe citizensof thestateof New York to ratify the Constitution. Seventy-sevenpapersappearedas lettersin New York newspapersover the pseudonymPublius.Togetherwitheight more essays,they were publishedin book form in 1788 and havebeenrepublishedrepeatedlyboth in the United Statesand abroad.The Federalistremainstoday animportantwork in political philosophy.It is also theleadingsourceof information for studying the intent of the framersof the Constitution, as, for example,in decisionson congressionalreapportionment,sinceMadisonhad takencopiousnotes at the ConstitutionalConvention.

It wasgenerallyknownwho hadwritten TheFederalist,but no public assignment of specific papers to authors occurreduntil 1807, three years afterHamilton’s deathas a resultof his duelwith AaronBurr. Madisonmadehis listingof authorsonly in 1818 after he hadretired from the presidency.A variety oflists with conflicting claimshavebeendisputedfor acenturyanda half. Thereis generalagreementon the authorshipof 70 papers-5by Jay, 14 by Madison,and 51 by Hamilton. Of the remaining15, 12 are in disputebetweenHamiltonandMadison,and 3 arejoint works to a disputedextent.No doubttheprimaryreasonthe disputeexists is that MadisonandHamilton did not hurry to entertheir claims. Within a few yearsafter writing the essays,they hadbecomebitter political enemiesand eachoccasionallytook positions opposingsomeofhis own Federalistwritings.

The political contentof the essayshasneverprovidedconvincingevidencefor authorship.SinceHamilton and Madisonwerewriting a brief in favor ofratification, they werelike lawyersworking for a client; they did not needtobelieveor endorseeveryargumentthey put forward favoringthe newConstitution. While thisdoesnot meanthat they would go out of their wayto misrepresenttheir personalpositions,it doesmeanthat we cannotargue,"Hamiltonwouldn’t havesaid that becausehe believedotherwise."And, as we haveoftenseen,personalpolitical positionschange.Thus thepolitical contentof a disputedessaycannotgive strongevidencein favor of Hamilton’s or of Madison’shaving written it.

The acceptanceof thevariousclaims by historianshastendedto changewithpolitical climate.Hamilton’s claimswerefavoredduring the lasthalfof the nineteenthcentury,Madison’ssincethen. While the thorough historical studiesofthehistorianDouglassAdair overseveraldecadessupporttheMadisonclaims,the total historical evidenceis today not much different from that which historianslike the elderHenryCabotLodgeinterpretedas favoringHamilton. Newevidencewasneededto obtaindefinite attributions,andinternalstatisticalstylistic evidenceprovides one possibility; developing that evidenceand themethodologyfor interpreting it is the heartof our work.

The writings of HamiltonandMadisonaredifficult to tell apartbecausebothauthorswere mastersof the popular Spectatorstyleof writing-complicated

Mostellei Wallace: DecidingAuthorship 117

and oratorical. To illustrate the difficulty, in 1941 Frederick Williams andFrederickMosteller countedsentencelengthsfor the undisputedpapersandgot averagesof 34.5 and 34.6wordsfor Hamilton and Madison, respectively.For sentencelength,a measureusedsuccessfullyto distinguishotherauthors,Hamilton and Madisonare practically twins.

MARKER WORDS

Althoughsentencelengthdoesmeasurecomplexityand an averageof 35 wordsshowsthat the materialis verycomplex,sentencelength is not sensitiveenoughto distinguishreliably betweenauthorswriting in similar styles.The variablesused in severalstudiesof disputedauthorshipare the ratesof occurrenceofspecific individual words. Ourstudy wasstimulatedby Adair’s discovery-orrediscoveryas it turned out-that Madison and Hamilton differ consistentlyin their choice betweenthe alternativewords while and whilst. In the 14Federalistessaysacknowledgedto be written by Madison,while never occurswhereaswhilstoccursin 8 of them. Whileoccursin 15 of 48 Hamilton essays,but never a whilst. We have here an instanceof what are called markers-items whosepresenceprovidesa strong indication of authorshipfor one ofthemen.Thus thepresenceof whilstin 5 of the disputedpaperspoints towardMadison’sauthorshipof those5.

Markerscontributea lot to discriminationwhenthey canbe found,but theyalso presentdifficulties. First, while or whilst occurs in less thanhalf of thepapers.They are absentfrom the otherhalf andhencegive no evidenceeitherway. Wemight hopeto surmountthisby finding enoughdifferentmarkerwordsor constructionsso that oneor morewill alwaysbe present.A secondandmoreseriousdifficulty is that from the evidencein 14 essaysby Madison,wecannotbesurethathe would neverusewhile. Otherwritings of Madisonwereexaminedand, indeed,he did lapseon two occasions.The presenceof while then is agoodbut notsure indication of Hamilton’s authorship;thepresenceof whilstis a better,but still imperfect, indicatorof Madison’sauthorship,for Hamiltontoo might lapse.

A centraltask of statisticsis makinginferencesin the presenceof uncertainty.Giving up the notion of perfectmarkersleadsus to a statisticalproblem. Wemust find evidence,assessits strength,and combineit into a compositeconclusion.Although the theoreticalandpracticalproblemsmay be difficult, theopportunityexiststo assemblefar morecompelling evidencethanevena fewnearly perfectmarkerscould provide.

RATES OF WORD USE

Insteadof thinking of a word as a marker whosepresenceor absencesettlesthe authorshipof an essay,we can take the rate or relativefrequencyof theuse of eachword as a measurepointing toward one or the other author. Of

118 PARTTHREE: OUR SOCIAL WORLD

course,mostwords won’t help becausetheywereusedat aboutthe samerateby both authors.But since we havethousandsof words available,somemayhelp. Words form a huge pool of possiblediscriminators.From a systematicexplorationof this pool of words, we found no morepairs like while-whilst,but wedid find singlewordsusedby oneauthorregularlybutrarelyby the other.

Table 1 showsthebehaviorof threewords:commonly,innovation, andwarThe table summarizesdata from 48 political essaysknown to be written byHamiltonand50 knownto be by Madison.Somepolitical essaysfrom outsideThe Federalist,butknown to beby Hamiltonor Madison,havebeenincludedin thisstudyto give abroaderbasefor the inference.Not all of Hamilton’s laterFederalistpapershavebeenincluded.We gatheredmorepapersfrom outsideThe Federalist for Madison.

NeitherHamilton nor Madisonusedcommonlymuch, but Hamilton’s useis muchmorefrequentthanMadison’s.The tableshowsthat in 31 of 48 Hamilton papers,the word commonlynever occurs,but that in the other 17 it occursoneormoretimes.Madisonusedit only oncein the 50 papersin our study.Thepapersvaryin length from 900 to 3,500words,with 2,000aboutaverage.Even one occurrencein 900 words is a heavierusagethan two occurrencesin 3,500words, so insteadof working with the numberof occurrencesin apaper,we use the rateof occurrence,with 1,000words as a convenientbase.Thus, for example,the paperwith the highestrate1.33 per 1,000words forcommonlyis apaperof 1,500wordswith two occurrences.Innovationbehavessimilarly, but it is a markerfor Madison.Foreachof thesetwo words,thehighestratesare a little over 1 per 1,000.

Table 1 Frequencydistributionsof rateper 1,000 wordsin 48 Hamiltonand 50Madisonpapersfor commonly,innovation, andwar

CommoRateper

1,000 Words

nly

H M

InnovaRateper

1,000 Words

tion

H M

WarRateper

1,000 Words H M

0 exactly* 31 49 0 exactly* 47 34 0 exactly* 23 15O-O.2 Cannot

occurt0-0.2 Cannot

occurt0-2 16 13

0.2-0.4 3 1 0.2-0.4 6 2-4 4 50.4-0.6 6 0.4-0.6 1 6 4-6 2 40.6-0.8 3 0.6-0.8 1 6-8 1 30.8-1.0 2 0.8-1.0 2 8-10 1 31.0-1.2 2 1.0-1.2 1 10-12 31.2-1.4

Totals

1

48 50 Totals 48 50

12-1414-16Totals

148

22

50

‘Each interval, except0 exactly, excludesits upperendpoint. Thus a 2,000-wordpaperin which commonlyappearstwice givesrise to a rateof 1.0 per 1,000exactly, and the paperappearsin the count for the 1.0-1.2interval.1-With the given lengths of the papersused,it accidentallyhappensthat a rate in this interval cannotoccur.Forexample,if a paperhas2,000 words,a rateof I per 1,000means2 words, and a singleoccurrencemeansa rate of 0.5 per 1,000. Hencea 2,000-wordpapercannotleadto a rateperthousandgreaterthan 0 and lessthan 0.5.

Source:Mostelkr and Wallace1984.

Mosteller Wallace: DecidingAuthorship 119

The word war has spectacularlydifferent behavior. Although absentfromhalf of Hamilton’s papers,when presentit is usedfrequently-in one paperat a rateof 14 per 1,000words. TheFederalistpapersdealwith specific topicsin theConstitutionandhuge variationsin the ratesof suchwordsas wai law,executive,liberty, and trade canbe expectedaccordingto the context of thepaper.EventhoughMadisonuseswarconsiderablymoreoftenthanHamiltonin the undisputedpapers,we explain this more by the division of tasks thanby Madison’spredilectionfor usingwar Datafrom useof aword like war wouldgive thesametroublesomesort of evidencethat historianshavedisagreedaboutover thelasthundredyears.Indeed,thedisputehascontinuedbecauseevidencefrom subjectand content hasbeenhopelesslyinconclusive.

USE OF NONCONTEXTUAL WORDS

For thestatisticalargumentsto be valid, information from meaningful,contextual wordsmustbelargelydiscarded.Suchastudy of authorshipwill not thencontributedirectly to anyunderstandingof thegreatnessof the papers,but theevidenceof authorshipcan be both strengthenedand madeindependentofevidenceprovided by historical analysis.

Avoidanceof judgmentsabout meaningfulnessor importanceis commonin classificationand identification procedures.Whenart critics try to authenticatea picture, in additionto the historical record,they considerlittle things:how fingernailsand earsare paintedandwhat kind of paint and canvaswereused.Relatively little of thefinal judgmentis baseduponthepainting’s artisticexcellence.In the sameway, police often identify peopleby their fingerprints,dental records,and scars,without referenceto their personality,occupation,or position in society.For literary identification, we neednot necessarilybecleveraboutthe appraisalof literarystyle, althoughit helpsin someproblems.To identify an object, we neednot appreciateits full value or meaning.

What noncontextualwordsaregood candidatesfor discriminatingbetweenauthors?Most attractiveare the filler wordsof the language:prepositions,conjunctions,articles. Many othermoremeaningfulwordsalso seemrelatively freefrom context:abverbssuchas commonly,consequently,particularly, or evenabstractnounslike vigor or innovation. We wantwordswhoseuseis unrelatedto the topic and maybe regardedas reflecting minoror perhapsunconsciouspreferencesof the author.

Considerwhat canbe donewith filler words. Someof theseare the mostusedwords in thelanguage:the, and, of to, andsoon. No onewrites withoutthem, but we may find that their ratesof use differ from author to author.Table 2 shows the distribution of ratesfor threeprepositions-by,from, andto. First, note thevariationfrom paperto paper.Madisonusesby typically about12 timesper 1,000words,but sometimeshe hasratesas high as 18 or as lowas 6. Even on inspectionthough, the variation does not obscureMadison’ssystematictendencyto useby more oftenthanHamilton does.Thus low ratesfor by suggestHamilton’s authorship,andhigh ratesMadison’s.Ratesfor to runin the oppositedirection. Very high ratesfor from point to Madisonbut low

120 PARTTHREE: OUR SOCIAL WORLD

Table 2 Frequencydistributionof rateper 1,000wordsin 48 Hamiltonand50 Madisonpapersfor by, from, and to

By

Rateper1,000 Words H M

From

Rateper1,000 Words H M

To

Rateper1,000 Words H M

1_3* 2 1_3* 3 3 20-25 33-5 7 3-5 15 19 25-30 2 55-7 12 5 5-7 21 17 30-35 6 197-9 18 7 7-9 9 6 35-40 14 129-11 4 8 9-11 1 40-45 15 9

11-13 5 16 11-13 3 45-50 8 213-15 6 13-15 1 50-55 215-17 5 55-60 117-19 3Totals 48 50 Totals 48 50 Totals 48 50

‘Each interval excludesits upperendpoint. Thus a paperwith a rate of exactly 3 per 1,000words wouldappearin the count for the 3-5 interval.

Source:Mostellerand Wallace1984.

ratesgive practically no information. The more widely the distributions areseparated,the strongerthe discriminatingpower of the word. Here, by discriminatesbetter than to, which in turn is better thanfrom.

PROBABILITY MODELS

To applyany of thetheoryof statisticalinferenceto evidencefrom word rates,wemust constructan acceptableprobability model to representthe variabilityin word ratefrom paperto paper.Settingup a completemodel for the occurrenceof evena singleword would be a hopelesstask, for the fine structurewithin a sentenceis determinedin large measureby nonrandomelementsofgrammar,meaning,andstyle. But if our interestis restrictedto the ratesof useof oneor morewords in blocks of text of at least 100 or 200words,weexpectthat detailedstructureof phrasesand sentencesoughtnot to be very important. The simplestmodel can bedescribedin the languageof balls in an urn,socommon in classicalprobability. To representMadison’susageof the wordby, we supposethereis a typical Madison rate, which would be somewherenear 12 per 1,000, andwe imagine an urn filled with many thousandsof redand black balls, with the red occurringin the proportion 12 per 1,000. Ourprobabilitymodelfor theoccurrenceof by is the sameas the probabilitymodelfor successivedrawsfrom the urn, with a red ball correspondingto by and ablack ball correspondingto all other words. To extend the model to thesimultaneousstudyof two or morewords, we would needballs of threeormorecolors. No grammaticalstructureor meaningis a part of this model,andit is not intendedto representbehaviorwithin sentences.What is desiredisthat it explain the variation in rates-incountsof occurrencesin long blocksof words, correspondingto the essays.

Mostellei Wallace: DecidingAuthorship 121

We testedthe modelby comparingits predictionswith actualcountsof wordfrequenciesin the papers.We found that while this urn schemereproducedvariability well for many words, for otherwordsadditionalvariability wasrequired. The randomvariationof theurn schemerepresentedmostof the variation in countsfrom oneessayto another,but in someessaystheauthorschangedtheir basicratesa bit. We hadto complicatethe theoreticalmodel to allow forthis, and the model we used is called the negativebinomial distribution.

Thetestshowedalso thatpronounslike his andher areexceedinglyunreliableauthorshipindicators,worse eventhan words like war

INFERENCE AND RESULTS

Eachpossibleroutefrom constructionof modelsto quantitativeassessmentof,say,Madison’sauthorshipof somedisputedpaper,requiredsolutionsof serioustheoreticalstatisticalproblems,andnew mathematicshadto be developed.Achiefmotivation for us was to use the Federalistproblemas a casestudy forcomparingseveraldifferentstatisticalapproaches,with specialattentionto one,called the Bayesianmethod, that expressesits final results in terms of probabilities, or odds, of authorship.

By whatevermethodsare used,theresultsare the same:overwhelmingevidencefor Madison’sauthorshipof the disputedpapers.For only onepaperisthe evidencemore modest,andeventherethe most thoroughstudy leadstoodds of 80 to 1 in favor of Madison.

Figures 1 and 2 illustrate how the 12 disputedpapersfit the distributionsof Hamilton’s and Madison’s rates for two of the words finally chosenasdiscriminators.In Figure 1 thetop two histogramsportraythe datafor by thatwasgiven earlier in Table 2. Madison’srate runs higheron the average.Comparethebottomhistogramfor thedisputedpapersfirst with the top histogramfor Hamiltonpapers,thenwith the secondonefor Madisonpapers.The ratesin the disputedpapersare, takenas a whole, very Madisonian,although 3 ofthe 12 papersby themselvesare slightly on the Hamilton side of the typicalrates.Figure 2 showsthe correspondingfacts for to. Here againthe disputedpapersare consistentwith Madison’s distribution, but further away from theHamilton behavior thanare the known Madison papers.

Table3 shows the 30 wordsusedin the final inference,along with the estimatedmean rates per thousandin Hamilton’s and Madison’s writings. Thegroupsare baseduponthe degreeof contextualityanticipatedby MostellerandWallace 1984 prior to the analysis.

The combinedevidencefrom nine common filler wordsshown as groupB washuge-muchmoreimportantthanthe combinedevidencefrom 20 lowfrequencymarkerwords like while-whilst. These20 are shownas groupsC,D, and E.

Thereremainsoneword that showedup earlyas apowerful discriminator,sufficient almostby itself. Whenshould onewrite upon insteadof on? Evenauthoritativebookson Englishusagedon’t providegood rules.Hamilton and

122 PARTTHREE: OUR SOCIAL WORLD

z0

&

Figure 1 Distribution ofratesofoccurrenceofby in 48 Hamiltonpapers,50Madisonpapers, 12 disputedpapers.

HAMILTON

23 29 35 41 47 53 59

41

MADSON

F23293541

DISPUTED

RATE PER 1000 WORO5

Figure 2 Distribution ofratesofoccurrenceofto in 48 Hamiltonpapers,50 Madisonpapers, 12 disputedpapers.

Madison differ tremendously.Hamilton writes on and upon almost equally,about 3 timesper 1,000words.Madison,on theotherhand,rarely usesupon.Table4 showsthe distributionsfor upon. In 48 papersHamilton neverfailed touseupon; indeed,he neverusedit lessthan twice. Madisonusedit in only 9 of50 papers,andthenonly with low rates.The disputedpapersare clearly Madisonianwith upon occurringin only 1 paper.That paper,fortunately, is stronglyclassifiedby the otherwords. It is not the paperwith modestoverall odds.

vol

I-IAMILTON

13 17

8,

5 ‘3 13 17

5 9 13 7

RATE PER 000 WORDS

Mosteller Wallace: DecidingAuthorship 123

Table 3 Wordsusedin final discriminationandadjustedratesof usein text by MadisonandHamilton

WordRatePer 1,000

Words WordRatePer 1,000

Words

Hamilton Madison Hamilton Madison

Group A Group DUpon 3.24 0.23 Commonly 017 0.05

Group BAlsoAnBy

0.32 0.675.95 4.587.32 11.43

ConsequentlyConsiderablelyAccordingApt

0.10 0.420.37 0.170.17 0.540.27 0.08

Of 64.51 57.89 GroupEOn 3.38 7-75 Direction 0.17 0.08ThereThisTo

3.20 1.337.77 6.00

40.79 35.21

InnovationsLanguageVigorous

0.06 0.150.08 0.180.18 0.08

Group CAlthoughBothEnoughWhile

0.06 0.170.52 1.040.25 0.100.21 0.07

KindMattersParticularlyProbabilityWorks

0.69 0.170.36 0.090.15 0.370.27 0.090.13 0.27

Whilst 0.08 0.42Always 0.58 0.20Though 0.91 0.51

Source:Mostellerand Wallace1984.

Table 4 Frequency distribution of rate per 1,000 words in 48Hamilton, 50 Madison,and 12 disputedpapersfor upon

Rateper 1,000 Words Hamilton Madison Disputed

0 exactly*0-0.40.4-0.80.8-1.21.2-1.61.6-2.02.0-3.03.0-4.04.0-5.05.0-6.06.0-7.07.0-8.0

236

111110

311

412412

11

1

Totals 48 50 12

* Each interval, except 0 exactly, excludesits upperendpoint. Thus a paperwith a rateof exactly 3 per 1,000 words would appearin the count for the 3.0-4.0 interval.

Source:Mosteller and Wallace1984.

124 PARTTHREE: OUR SOCIAL WORLD

Of course,combiningand assessingthe total evidenceis a large statisticalandcomputationaltask. High-speedcomputerswereemployedfor manyhoursin making the calculations,both mathematicalcalculationsfor thetheory andempirical ones for the data.

You may havewonderedaboutJohnJay.Might he not havehada hand inthe disputedpapers?Table 5 showsthe ratesper thousandfor nine words ofhighestfrequencyin theEnglishlanguagemeasuredin the writingsof Hamilton,Madison,Jay, and,for a changeof pace,inJamesJoyce’sUlysses. The tablesupports the repeatedassertionthat Madison and Hamilton are similar. Joyceismuch different, but so is JohnJay. The wordsof and to with ratecomparisons65/58 and41/35 were amongthe final discriminatorsbetweenHamilton andMadison. Seehow much more easily Jay could be discriminatedfrom eitherHamiltonor Madisonby using the, of and, a, and that. The disputedpapersare not at all consistentwith Jay’s rates,andthereis no reasonto questionhisomissionfrom the dispute.

SUMMARY OF RESULTS

Our data independentlysupplementthe evidenceof the historians.Madisonis extremelylikely, in the senseof degreeof belief, to havewritten thedisputedFederalist papers,with the possibleexceptionof paper 55, and there ourevidenceyieldsoddsof 80 to 1 for Madison-strong,but not overwhelming.Paper56, next weakest,is a very strong800 to 1 for Madison. The dataareoverwhelmingfor all the rest, includingthe two papershistoriansfeel weakestabout,papers62 and 63.

For a moreextensivediscussionof this problem,includinghistoricaldetails,discussionof actual techniques,anda variety of alternativeanalyses,as wellas for abriefreview of authorshipstudiessince 1969,seeMostellerandWallace1984.

Table 5 Word ratesfor high-frequencywords ratesper 1,000 words

Hamilton94,000 *

Madison114,000*

Jay5,000 *

JoyceUlysses260,000*

The 91 94 67 57Of 65 58 44 30To 41 35 36 18And 25 28 45 28In 24 23 21 19A 23 20 14 25Be 20 16 19 3That 15 14 20 12It 14 13 17 9

‘The number of words of text countedto determine rates.

Sources:Hanley 1937; Mosteller and Wallace1984.

Mosteller Wallace: DecidingAuthorship 125

PROBLEMS

1. Why can’t theauthorshipof thedisputedpapersbe determinedby literarystyleor political philosophy?

2. a. What is a discriminator?b. Distinguishat least two categoriesof discriminators.c. Why is by a good discriminator?Refer to Table 2.

3. What is a "noncontextualword"?

4. Why do the authorsuse word frequencyper thousandwords insteadofjust the numberof occurrences?

5. Refer to Table 1. In how many of the Hamilton papersstudieddoestheword commonlyappearat least once?

6. Refer to Table2. In what percentageof the Madisonpapersstudieddoesfrom occur 3-7 timesper 1,000words?Note: The interval 3-7 usestheauthors’conventionon intervals.

7. Considerthe "balls in an urn" model. How many colors of balls wouldwe needto extendthe model to the simultaneousstudy of five words?Of n words?

8. Consider Figure 1. True or false: More than ‘/3 of the Hamilton papersstudieduseby 3-7 times per 1,000words.

9. Study Figure 2. Does the graph for the disputedpapers look more likethe graph for the Hamilton or the Madisonpapers?

10. ConsiderTable 3. Looking at group B, which word would you say wasthe bestHamilton/Madisondiscriminator?Whatwasyour word-selectioncriterion? Answer the samequestionsfor group D.

11. Table 1 showsthe relativefrequencyof war Why doesn’twar appearinTable 3?

REFERENCES

Miles L. Hanley. 1937. WordIndextoJamesJoyce’s"Ulysses."Madison,Wis.: Universityof Wisconsin.

F. MostellerandD. L. Wallace. 1964. InferenceandDisputedAuthorship:TheFederalist. Reading,Mass.: Addison-Wesley.

F. MostellerandD. L. Wallace. 1984.AppliedBayesianandClassicalInference:TheCaseof "TheFederalistPapers"2nded.of InferenceandDisputedAuthorship:TheFederalist. New York:Springer-Verlag.