computational and inferential thinking

Post on 09-Jan-2017

502 Views

Category:

Data & Analytics

19 Downloads

Preview:

Click to see full reader

TRANSCRIPT

0

1

1.1

1.1.1

1.1.2

1.2

1.2.1

1.3

1.3.1

1.3.2

1.3.3

1.3.4

1.3.5

1.4

1.4.1

1.4.2

1.4.3

1.4.4

1.4.5

1.5

1.5.1

1.5.2

1.6

1.6.1

1.6.2

2

2.1

2.2

3

3.1

TableofContentsIntroduction

DataScience

Introduction

ComputationalTools

StatisticalTechniques

WhyDataScience?

Example:PlottingtheClassics

CausalityandExperiments

ObservationandVisualization:JohnSnowandtheBroadStreetPump

Snow’s“GrandExperiment”

EstablishingCausality

Randomization

Endnote

Expressions

Names

Example:GrowthRates

CallExpressions

DataTypes

Comparisons

Sequences

Arrays

Ranges

Tables

Functions

FunctionsandTables

Visualization

BarCharts

Histograms

Sampling

Sampling

ComputationalandInferentialThinking

1

3.2

3.3

3.4

3.5

3.6

3.7

4

4.1

4.2

4.3

4.4

4.5

4.6

4.7

4.8

5

5.1

5.2

5.3

5.4

5.5

5.6

5.7

6

6.1

Iteration

Estimation

Center

Spread

TheNormalDistribution

Exploration:Privacy

Prediction

Correlation

Regression

Prediction

Higher-OrderFunctions

Errors

MultipleRegression

Classification

Features

Inference

ConfidenceIntervals

DistanceBetweenDistributions

HypothesisTesting

Permutation

A/BTesting

RegressionInference

RegressionDiagnostics

Probability

ConditionalProbability

ComputationalandInferentialThinking

2

ComputationalandInferentialThinking

TheFoundationsofDataScienceByAniAdhikariandJohnDeNero

ContributionsbyDavidWagner

ThisisthetextbookfortheFoundationsofDataScienceclassatUCBerkeley.

ViewthistextbookonlineonGitbooks.

ComputationalandInferentialThinking

3Introduction

WhatisDataScienceDataScienceisaboutdrawingusefulconclusionsfromlargeanddiversedatasetsthroughexploration,prediction,andinference.Explorationinvolvesidentifyingpatternsininformation.Predictioninvolvesusinginformationweknowtomakeinformedguessesaboutvalueswewishweknew.Inferenceinvolvesquantifyingourdegreeofcertainty:willthosepatternswefoundalsoappearinnewobservations?Howaccurateareourpredictions?Ourprimarytoolsforexplorationarevisualizations,forpredictionaremachinelearningandoptimization,andforinferencearestatisticaltestsandmodels.

Statisticsisacentralcomponentofdatasciencebecausestatisticsstudieshowtomakerobustconclusionswithincompleteinformation.Computingisacentralcomponentbecauseprogrammingallowsustoapplyanalysistechniquestothelargeanddiversedatasetsthatariseinreal-worldapplications:notjustnumbers,buttext,images,videos,andsensorreadings.Datascienceisallofthesethings,butitmorethanthesumofitspartsbecauseoftheapplications.Throughunderstandingaparticulardomain,datascientistslearntoaskappropriatequestionsabouttheirdataandcorrectlyinterprettheanswersprovidedbyourinferentialandcomputationaltools.

ComputationalandInferentialThinking

4DataScience

Chapter1:IntroductionDataaredescriptionsoftheworldaroundus,collectedthroughobservationandstoredoncomputers.Computersenableustoinferpropertiesoftheworldfromthesedescriptions.Datascienceisthedisciplineofdrawingconclusionsfromdatausingcomputation.Therearethreecoreaspectsofeffectivedataanalysis:exploration,prediction,andinference.Thistextdevelopsaconsistentapproachtoallthree,introducingstatisticalideasandfundamentalideasincomputerscienceconcurrently.Wefocusonaminimalsetofcoretechniquesthattheyapplytoavastrangeofreal-worldapplications.Afoundationindatasciencerequiresnotonlyunderstandingstatisticalandcomputationaltechniques,butalsorecognizinghowtheyapplytorealscenarios.

Forwhateveraspectoftheworldwewishtostudy---whetherit'stheEarth'sweather,theworld'smarkets,politicalpolls,orthehumanmind---datawecollecttypicallyofferanincompletedescriptionofthesubjectathand.Thecentralchallengeofdatascienceistomakereliableconclusionsusingthispartialinformation.

Inthisendeavor,wewillcombinetwoessentialtools:computationandrandomization.Forexample,wemaywanttounderstandclimatechangetrendsusingtemperatureobservations.Computerswillallowustouseallavailableinformationtodrawconclusions.Ratherthanfocusingonlyontheaveragetemperatureofaregion,wewillconsiderthewholerangeoftemperaturestogethertoconstructamorenuancedanalysis.Randomnesswillallowustoconsiderthemanydifferentwaysinwhichincompleteinformationmightbecompleted.Ratherthanassumingthattemperaturesvaryinaparticularway,wewilllearntouserandomnessasawaytoimaginemanypossiblescenariosthatareallconsistentwiththedataweobserve.

Applyingthisapproachrequireslearningtoprogramacomputer,andsothistextinterleavesacompleteintroductiontoprogrammingthatassumesnopriorknowledge.Readerswithprogrammingexperiencewillfindthatwecoverseveraltopicsincomputationthatdonotappearinatypicalintroductorycomputersciencecurriculum.Datasciencealsorequirescarefulreasoningaboutquantities,butthistextdoesnotassumeanybackgroundinmathematicsorstatisticsbeyondbasicalgebra.Youwillfindveryfewequationsinthistext.Instead,techniquesaredescribedtoreadersinthesamelanguageinwhichtheyaredescribedtothecomputersthatexecutethem---aprogramminglanguage.

ComputationalandInferentialThinking

5Introduction

ComputationalToolsThistextusesthePython3programminglanguage,alongwithastandardsetofnumericalanddatavisualizationtoolsthatareusedwidelyincommercialapplications,scientificexperiments,andopen-sourceprojects.Pythonhasrecruitedenthusiastsfrommanyprofessionsthatusedatatodrawconclusions.BylearningthePythonlanguage,youwilljoinamillion-person-strongcommunityofsoftwaredevelopersanddatascientists.

GettingStarted.TheeasiestandrecommendedwaytostartwritingprogramsinPythonistologintothecompanionsiteforthistext,data8.berkeley.edu.Ifyouareenrolledinacourseofferingthatusesthistext,youshouldhavefullaccesstotheprogrammingenvironmenthostedonthatsite.Thefirsttimeyoulogin,youwillfindabrieftutorialdescribinghowtoproceed.

Youarenotatallrestrictedtousingthisweb-basedprogrammingenvironment.APythonprogramcanbeexecutedbyanycomputer,regardlessofitsmanufactureroroperatingsystem,providedthatsupportforthelanguageisinstalled.IfyouwishtoinstalltheversionofPythonanditsaccompanyinglibrariesthatwillmatchthistext,werecommendtheAnacondadistributionthatpackagestogetherthePython3languageinterpreter,IPythonlibraries,andtheJupyternotebookenvironment.

Thistextincludesacompleteintroductiontoallofthesecomputationaltools.Youwilllearntowriteprograms,generateimagesfromdata,andworkwithreal-worlddatasetsthatarepublishedonline.

ComputationalandInferentialThinking

6ComputationalTools

StatisticalTechniquesThedisciplineofstatisticshaslongaddressedthesamefundamentalchallengeasdatascience:howtodrawconclusionsabouttheworldusingincompleteinformation.Oneofthemostimportantcontributionsofstatisticsisaconsistentandprecisevocabularyfordescribingtherelationshipbetweenobservationsandconclusions.Thistextcontinuesinthesametradition,focusingonasetofcoreinferentialproblemsfromstatistics:testinghypotheses,estimatingconfidence,andpredictingunknownquantities.

Datascienceextendsthefieldofstatisticsbytakingfulladvantageofcomputing,datavisualization,andaccesstoinformation.ThecombinationoffastcomputersandtheInternetgivesanyonetheabilitytoaccessandanalyzevastdatasets:millionsofnewsarticles,fullencyclopedias,databasesforanydomain,andmassiverepositoriesofmusic,photos,andvideo.

Applicationstorealdatasetsmotivatethestatisticaltechniquesthatwedescribethroughoutthetext.Realdataoftendonotfollowregularpatternsormatchstandardequations.Theinterestingvariationinrealdatacanbelostbyfocusingtoomuchattentiononsimplisticsummariessuchasaveragevalues.Computersenableafamilyofmethodsbasedonresamplingthatapplytoawiderangeofdifferentinferenceproblems,takeintoaccountallavailableinformation,andrequirefewassumptionsorconditions.Althoughthesetechniqueshaveoftenbeenreservedforgraduatecoursesinstatistics,theirflexibilityandsimplicityareanaturalfitfordatascienceapplications.

ComputationalandInferentialThinking

7StatisticalTechniques

WhyDataScience?Mostimportantdecisionsaremadewithonlypartialinformationanduncertainoutcomes.However,thedegreeofuncertaintyformanydecisionscanbereducedsharplybypublicaccesstolargedatasetsandthecomputationaltoolsrequiredtoanalyzethemeffectively.Data-drivendecisionmakinghasalreadytransformedatremendousbreadthofindustries,includingfinance,advertising,manufacturing,andrealestate.Atthesametime,awiderangeofacademicdisciplinesareevolvingrapidlytoincorporatelarge-scaledataanalysisintotheirtheoryandpractice.

Studyingdatascienceenablesindividualstobringthesetechniquestobearontheirwork,theirscientificendeavors,andtheirpersonaldecisions.Criticalthinkinghaslongbeenahallmarkofarigorouseducation,butcritiquesareoftenmosteffectivewhensupportedbydata.Acriticalanalysisofanyaspectoftheworld,mayitbebusinessorsocialscience,involvesinductivereasoning---conclusionscanrarelybeenprovenoutright,onlysupportedbytheavailableevidence.Datascienceprovidesthemeanstomakeprecise,reliable,andquantitativeargumentsaboutanysetofobservations.Withunprecedentedaccesstoinformationandcomputing,criticalthinkingaboutanyaspectoftheworldthatcanbemeasuredwouldbeincompletewithouttheinferentialtechniquesthatarecoretodatascience.

Theworldhastoomanyunansweredquestionsanddifficultchallengestoleavethiscriticalreasoningtoonlyafewspecialists.However,alleducatedadultscanbuildthecapacitytoreasonaboutdata.Thetools,techniques,anddatasetsareallreadilyavailable;thistextaimstomakethemaccessibletoeveryone.

ComputationalandInferentialThinking

8WhyDataScience?

Example:PlottingtheClassicsInteractInthisexample,wewillexplorestatisticsfortwoclassicnovels:TheAdventuresofHuckleberryFinnbyMarkTwain,andLittleWomenbyLouisaMayAlcott.Thetextofanybookcanbereadbyacomputeratgreatspeed.Bookspublishedbefore1923arecurrentlyinthepublicdomain,meaningthateveryonehastherighttocopyorusethetextinanyway.ProjectGutenbergisawebsitethatpublishespublicdomainbooksonline.UsingPython,wecanloadthetextofthesebooksdirectlyfromtheweb.

ThefeaturesofPythonusedinthisexamplewillbeexplainedindetaillaterinthecourse.Thisexampleismeanttoillustratesomeofthebroadthemesofthistext.Don'tworryifthedetailsoftheprogramdon'tyetmakesense.Instead,focusoninterpretingtheimagesgeneratedbelow.The"Expressions"sectionlaterinthischapterwilldescribemostofthefeaturesofthePythonprogramminglanguageusedbelow.

First,wereadthetextofbothbooksintolistsofchapters,calledhuck_finn_chaptersandlittle_women_chapters.InPython,anamecannotcontainanyspaces,andsowewillofoftenuseanunderscore_tostandinforaspace.The=inthelinesbelowgiveanameonthelefttotheresultofsomecomputationdescribedontheright.AuniformresourcelocatororURLisanaddressontheInternetforsomecontent;inthiscase,thetextofabook.

#Readtwobooks,fast!

huck_finn_url='http://www.gutenberg.org/cache/epub/76/pg76.txt'

huck_finn_text=read_url(huck_finn_url)

huck_finn_chapters=huck_finn_text.split('CHAPTER')[44:]

little_women_url='http://www.gutenberg.org/cache/epub/514/pg514.txt'

little_women_text=read_url(little_women_url)

little_women_chapters=little_women_text.split('CHAPTER')[1:]

Whileacomputercannotunderstandthetextofabook,wecanstilluseittoprovideuswithsomeinsightintothestructureofthetext.Thenamehuck_finn_chaptersiscurrentlyboundtoalistofallthechaptersinthebook.Wecanplacethosechaptersintoatabletoseehoweachbegins.

ComputationalandInferentialThinking

9Example:PlottingtheClassics

Table([huck_finn_chapters],['Chapters'])

Chapters

I.YOUdon'tknowaboutmewithoutyouhavereadabook...

II.WEwenttiptoeingalongapathamongstthetreesbac...

III.WELL,Igotagoodgoing-overinthemorningfromo...

IV.WELL,threeorfourmonthsrunalong,anditwaswel...

V.Ihadshutthedoorto.ThenIturnedaroundandther...

VI.WELL,prettysoontheoldmanwasupandaroundagai...

VII."GITup!Whatyou'bout?"Iopenedmyeyesandlook...

VIII.THEsunwasupsohighwhenIwakedthatIjudged...

IX.Iwantedtogoandlookataplacerightaboutthem...

X.AFTERbreakfastIwantedtotalkaboutthedeadmana...

...(33rowsomitted)

EachchapterbeginswithachapternumberinRomannumerals,followedbythefirstsentenceofthechapter.ProjectGutenberghasprintedthefirstwordofeachchapterinuppercase.

ThebookdescribesajourneythatHuckandJimtakealongtheMississippiRiver.TomSawyerjoinsthemtowardstheendastheactionheatsup.Wecanquicklyvisualizehowmanytimesthesecharactershaveeachbeenmentionedatanypointinthebook.

counts=Table([np.char.count(huck_finn_chapters,"Jim"),

np.char.count(huck_finn_chapters,"Tom"),

np.char.count(huck_finn_chapters,"Huck")],

["Jim","Tom","Huck"])

counts.cumsum().plot(overlay=True)

ComputationalandInferentialThinking

10Example:PlottingtheClassics

Intheplotabove,thehorizontalaxisshowschapternumbersandtheverticalaxisshowshowmanytimeseachcharacterhasbeenmentionedsofar.YoucanseethatJimisacentralcharacterbythelargenumberoftimeshisnameappears.NoticehowTomishardlymentionedformuchofthebookuntilafterChapter30.HiscurveandJim'srisesharplyatthatpoint,astheactioninvolvingbothofthemintensifies.AsforHuck,hisnamehardlyappearsatall,becauseheisthenarrator.

LittleWomenisastoryoffoursistersgrowinguptogetherduringthecivilwar.Inthisbook,chapternumbersarespelledoutandchaptertitlesarewritteninallcapitalletters.

Table('Chapters':little_women_chapters)

Chapters

ONEPLAYINGPILGRIMS"Christmaswon'tbeChristmaswitho...

TWOAMERRYCHRISTMASJowasthefirsttowakeinthegr...

THREETHELAURENCEBOY"Jo!Jo!Whereareyou?"criedMe...

FOURBURDENS"Oh,dear,howharditdoesseemtotakeup...

FIVEBEINGNEIGHBORLY"Whatintheworldareyougoingt...

SIXBETHFINDSTHEPALACEBEAUTIFULThebighousedidpr...

SEVENAMY'SVALLEYOFHUMILIATION"Thatboyisaperfect...

EIGHTJOMEETSAPOLLYON"Girls,whereareyougoing?"as...

NINEMEGGOESTOVANITYFAIR"Idothinkitwasthemost...

TENTHEP.C.ANDP.O.Asspringcameon,anewsetofam...

...(37rowsomitted)

ComputationalandInferentialThinking

11Example:PlottingtheClassics

Wecantrackthementionsofmaincharacterstolearnabouttheplotofthisbookaswell.TheprotagonistJointeractswithhersistersMeg,Beth,andAmyregularly,upuntilChapter27whenshemovestoNewYorkalone.

people=["Meg","Jo","Beth","Amy","Laurie"]

people_counts=Table(pp:np.char.count(little_women_chapters,pp)

people_counts.cumsum().plot(overlay=True)

Laurieisayoungmanwhomarriesoneofthegirlsintheend.Seeifyoucanusetheplotstoguesswhichone.

Inadditiontovisualizingdatatofindpatterns,datascienceisoftenconcernedwithwhethertwosimilarpatternsarereallythesameordifferent.Inthecontextofnovels,theword"character"hasasecondmeaning:aprintedsymbolsuchasaletterornumber.Below,wecomparethecountsofthistypeofcharacterinthefirstchapterofLittleWomen.

fromcollectionsimportCounter

chapter_one=little_women_chapters[0]

counts=Counter(chapter_one.lower())

letters=Table([counts.keys(),counts.values()],['Letters','Chapter1Count'

letters.sort('Letters').show(20)

ComputationalandInferentialThinking

12Example:PlottingtheClassics

Letters Chapter1Count

4102

! 28

" 190

' 125

, 423

- 21

. 189

; 4

? 16

_ 6

a 1352

b 290

c 366

d 788

e 1981

f 339

g 446

h 1069

i 1071

j 51

...(16rowsomitted)

HowdothesecountscomparewiththecharactercountsinChapter2?Byplottingbothsetsofcountstogether,wecanseeslightvariationsforeverycharacter.Isthatdifferencemeaningful?Howsurecanwebe?Suchquestionswillbeaddressedpreciselyinthistext.

counts=Counter(little_women_chapters[1].lower())

two_letters=Table([counts.keys(),counts.values()],['Letters','Chapter2Count'

compare=letters.join('Letters',two_letters)

compare.barh('Letters',overlay=True)

ComputationalandInferentialThinking

13Example:PlottingtheClassics

ComputationalandInferentialThinking

14Example:PlottingtheClassics

Insomecases,therelationshipsbetweenquantitiesallowustomakepredictions.Thistextwillexplorehowtomakeaccuratepredictionsbasedonincompleteinformationanddevelopmethodsforcombiningmultiplesourcesofuncertaininformationtomakedecisions.

Asanexample,letusfirstusethecomputertogetussomeinformationthatwouldbereallytedioustoacquirebyhand:thenumberofcharactersandthenumberofperiodsineachchapterofbothbooks.

chlen_per_hf=Table([[len(s)forsinhuck_finn_chapters],

np.char.count(huck_finn_chapters,'.')],

["HFChapterLength","Numberofperiods"])

chlen_per_lw=Table([[len(s)forsinlittle_women_chapters],

np.char.count(little_women_chapters,'.')],

["LWChapterLength","Numberofperiods"])

HerearethedataforHuckleberryFinn.Eachrowofthetablebelowcorrespondstoonechapterofthenovelanddisplaysthenumberofcharactersaswellasthenumberofperiodsinthechapter.Notsurprisingly,chapterswithfewercharactersalsotendtohavefewerperiods,ingeneral;theshorterthechapter,thefewersentencestheretendtobe,andviceversa.Therelationisnotentirelypredictable,however,assentencesareofvaryinglengthsandcaninvolveotherpunctuationsuchasquestionmarks.

chlen_per_hf

HFChapterLength Numberofperiods

7026 66

11982 117

8529 72

6799 84

8166 91

14550 125

13218 127

22208 249

8081 71

7036 70

...(33rowsomitted)

ComputationalandInferentialThinking

15Example:PlottingtheClassics

HerearethecorrespondingdataforLittleWomen.

chlen_per_lw

LWChapterLength Numberofperiods

21759 189

22148 188

20558 231

25526 195

23395 255

14622 140

14431 131

22476 214

33767 337

18508 185

...(37rowsomitted)

YoucanseethatthechaptersofLittleWomenareingenerallongerthanthoseofHuckleberryFinn.Letusseeifthesetwosimplevariables–thelengthandnumberofperiodsineachchapter–cantellusanythingmoreaboutthetwoauthors.Onewayforustodothisistoplotbothsetsofdataonthesameaxes.

Intheplotbelow,thereisadotforeachchapterineachbook.BluedotscorrespondtoHuckleberryFinnandyellowdotstoLittleWomen.Thehorizontalaxisrepresentsthenumberofcharactersandtheverticalaxisrepresentsthenumberofperiods.

plots.figure(figsize=(10,10))

plots.scatter([len(s)forsinhuck_finn_chapters],np.char.count(huck_finn_chapters

plots.scatter([len(s)forsinlittle_women_chapters],np.char.count

<matplotlib.collections.PathCollectionat0x109ebee10>

ComputationalandInferentialThinking

16Example:PlottingtheClassics

TheplotshowsusthatmanybutnotallofthechaptersofLittleWomenarelongerthanthoseofHuckleberryFinn,aswehadobservedbyjustlookingatthenumbers.Butitalsoshowsussomethingmore.Noticehowthebluepointsareroughlyclusteredaroundastraightline,asaretheyellowpoints.Moreover,itlooksasthoughbothcolorsofpointsmightbeclusteredaroundthesamestraightline.

Indeed,itappearsfromlookingattheplotthatonaveragebothbookstendtohavesomewherebetween100and150charactersbetweenperiods.Perhapsthesetwogreat19thcenturynovelsweresignalingsomethingsoveryfamiliarusnow:the140-characterlimitofTwitter.

ComputationalandInferentialThinking

17Example:PlottingtheClassics

CausalityandExperiments"Theseproblemsare,andwillprobablyeverremain,amongtheinscrutablesecretsofnature.Theybelongtoaclassofquestionsradicallyinaccessibletothehumanintelligence."—TheTimesofLondon,September1849,onhowcholeraiscontractedandspread

Doesthedeathpenaltyhaveadeterrenteffect?Ischocolategoodforyou?Whatcausesbreastcancer?

Allofthesequestionsattempttoassignacausetoaneffect.Acarefulexaminationofdatacanhelpshedlightonquestionslikethese.Inthissectionyouwilllearnsomeofthefundamentalconceptsinvolvedinestablishingcausality.

Observationisakeytogoodscience.Anobservationalstudyisoneinwhichscientistsmakeconclusionsbasedondatathattheyhaveobservedbuthadnohandingenerating.Indatascience,manysuchstudiesinvolveobservationsonagroupofindividuals,afactorofinterestcalledatreatment,andanoutcomemeasuredoneachindividual.

Itiseasiesttothinkoftheindividualsaspeople.Inastudyofwhetherchocolateisgoodforthehealth,theindividualswouldindeedbepeople,thetreatmentwouldbeeatingchocolate,andtheoutcomemightbeameasureofbloodpressure.Butindividualsinobservationalstudiesneednotbepeople.Inastudyofwhetherthedeathpenaltyhasadeterrenteffect,theindividualscouldbethe50statesoftheunion.Astatelawallowingthedeathpenaltywouldbethetreatment,andanoutcomecouldbethestate’smurderrate.

Thefundamentalquestioniswhetherthetreatmenthasaneffectontheoutcome.Anyrelationbetweenthetreatmentandtheoutcomeiscalledanassociation.Ifthetreatmentcausestheoutcometooccur,thentheassociationiscausal.Causalityisattheheartofallthreequestionsposedatthestartofthissection.Forexample,oneofthequestionswaswhetherchocolatedirectlycausesimprovementsinhealth,notjustwhethertherethereisarelationbetweenchocolateandhealth.

Theestablishmentofcausalityoftentakesplaceintwostages.First,anassociationisobserved.Next,amorecarefulanalysisleadstoadecisionaboutcausality.

ComputationalandInferentialThinking

18CausalityandExperiments

ObservationandVisualization:JohnSnowandtheBroadStreetPumpOneoftheearliestexamplesofastuteobservationeventuallyleadingtotheestablishmentofcausalitydatesbackmorethan150years.Togetyourmindintotherighttimeframe,trytoimagineLondoninthe1850’s.Itwastheworld’swealthiestcitybutmanyofitspeopleweredesperatelypoor.CharlesDickens,thenattheheightofhisfame,waswritingabouttheirplight.Diseasewasrifeinthepoorerpartsofthecity,andcholerawasamongthemostfeared.Itwasnotyetknownthatgermscausedisease;theleadingtheorywasthat“miasmas”werethemainculprit.Miasmasmanifestedthemselvesasbadsmells,andwerethoughttobeinvisiblepoisonousparticlesarisingoutofdecayingmatter.PartsofLondondidsmellverybad,especiallyinhotweather.Toprotectthemselvesagainstinfection,thosewhocouldaffordtoheldsweet-smellingthingstotheirnoses.

Forseveralyears,adoctorbythenameofJohnSnowhadbeenfollowingthedevastatingwavesofcholerathathitEnglandfromtimetotime.Thediseasearrivedsuddenlyandwasalmostimmediatelydeadly:peoplediedwithinadayortwoofcontractingit,hundredscoulddieinaweek,andthetotaldeathtollinasinglewavecouldreachtensofthousands.Snowwasskepticalofthemiasmatheory.Hehadnoticedthatwhileentirehouseholdswerewipedoutbycholera,thepeopleinneighboringhousessometimesremainedcompletelyunaffected.Astheywerebreathingthesameair—andmiasmas—astheirneighbors,therewasnocompellingassociationbetweenbadsmellsandtheincidenceofcholera.

Snowhadalsonoticedthattheonsetofthediseasealmostalwaysinvolvedvomitinganddiarrhea.Hethereforebelievedthatthatinfectionwascarriedbysomethingpeopleateordrank,notbytheairthattheybreathed.Hisprimesuspectwaswatercontaminatedbysewage.

AttheendofAugust1854,cholerastruckintheovercrowdedSohodistrictofLondon.Asthedeathsmounted,Snowrecordedthemdiligently,usingamethodthatwentontobecomestandardinthestudyofhowdiseasesspread:hedrewamap.Onastreetmapofthedistrict,herecordedthelocationofeachdeath.

HereisSnow’soriginalmap.Eachblackbarrepresentsonedeath.Theblackdiscsmarkthelocationsofwaterpumps.Themapdisplaysastrikingrevelation–thedeathsareroughlyclusteredaroundtheBroadStreetpump.

ComputationalandInferentialThinking

19ObservationandVisualization:JohnSnowandtheBroadStreetPump

Snowstudiedhismapcarefullyandinvestigatedtheapparentanomalies.AllofthemimplicatedtheBroadStreetpump.Forexample:

ThereweredeathsinhousesthatwerenearertheRupertStreetpumpthantheBroadStreetpump.ThoughtheRupertStreetpumpwascloserasthecrowflies,itwaslessconvenienttogettobecauseofdeadendsandthelayoutofthestreets.TheresidentsinthosehousesusedtheBroadStreetpumpinstead.Therewerenodeathsintwoblocksjusteastofthepump.ThatwasthelocationoftheLionBrewery,wheretheworkersdrankwhattheybrewed.Iftheywantedwater,thebreweryhaditsownwell.TherewerescattereddeathsinhousesseveralblocksawayfromtheBroadStreetpump.ThosewerechildrenwhodrankfromtheBroadStreetpumpontheirwaytoschool.Thepump’swaterwasknowntobecoolandrefreshing.

ThefinalpieceofevidenceinsupportofSnow’stheorywasprovidedbytwoisolateddeathsintheleafyandgenteelHampsteadarea,quitefarfromSoho.SnowwaspuzzledbytheseuntilhelearnedthatthedeceasedwereMrs.SusannahEley,whohadoncelivedinBroad

ComputationalandInferentialThinking

20ObservationandVisualization:JohnSnowandtheBroadStreetPump

Street,andherniece.Mrs.EleyhadwaterfromtheBroadStreetpumpdeliveredtoherinHampsteadeveryday.Shelikeditstaste.

LateritwasdiscoveredthatacesspitthatwasjustafewfeetawayfromthewelloftheBroadStreetpumphadbeenleakingintothewell.Thusthepump’swaterwascontaminatedbysewagefromthehousesofcholeravictims.

SnowusedhismaptoconvincelocalauthoritiestoremovethehandleoftheBroadStreetpump.Thoughthecholeraepidemicwasalreadyonthewanewhenhedidso,itispossiblethatthedisablingofthepumppreventedmanydeathsfromfuturewavesofthedisease.

TheremovaloftheBroadStreetpumphandlehasbecomethestuffoflegend.AttheCentersforDiseaseControl(CDC)inAtlanta,whenscientistslookforsimpleanswerstoquestionsaboutepidemics,theysometimesaskeachother,“Whereisthehandletothispump?”

Snow’smapisoneoftheearliestandmostpowerfulusesofdatavisualization.Diseasemapsofvariouskindsarenowastandardtoolfortrackingepidemics.

TowardsCausality

ThoughthemapgaveSnowastrongindicationthatthecleanlinessofthewatersupplywasthekeytocontrollingcholera,hewasstillalongwayfromaconvincingscientificargumentthatcontaminatedwaterwascausingthespreadofthedisease.Tomakeamorecompellingcase,hehadtousethemethodofcomparison.

Scientistsusecomparisontoidentifyanassociationbetweenatreatmentandanoutcome.Theycomparetheoutcomesofagroupofindividualswhogotthetreatment(thetreatmentgroup)totheoutcomesofagroupwhodidnot(thecontrolgroup).Forexample,researcherstodaymightcomparetheaveragemurderrateinstatesthathavethedeathpenaltywiththeaveragemurderrateinstatesthatdon’t.

Iftheresultsaredifferent,thatisevidenceforanassociation.Todeterminecausation,however,evenmorecareisneeded.

ComputationalandInferentialThinking

21ObservationandVisualization:JohnSnowandtheBroadStreetPump

Snow’s“GrandExperiment”EncouragedbywhathehadlearnedinSoho,Snowcompletedamorethoroughanalysisofcholeradeaths.Forsometime,hehadbeengatheringdataoncholeradeathsinanareaofLondonthatwasservedbytwowatercompanies.TheLambethwatercompanydrewitswaterupriverfromwheresewagewasdischargedintotheRiverThames.Itswaterwasrelativelyclean.ButtheSouthwarkandVauxhall(S&V)companydrewitswaterbelowthesewagedischarge,andthusitssupplywascontaminated.

SnownoticedthattherewasnosystematicdifferencebetweenthepeoplewhoweresuppliedbyS&VandthosesuppliedbyLambeth.“Eachcompanysuppliesbothrichandpoor,bothlargehousesandsmall;thereisnodifferenceeitherintheconditionoroccupationofthepersonsreceivingthewaterofthedifferentCompanies…thereisnodifferencewhateverinthehousesorthepeoplereceivingthesupplyofthetwoWaterCompanies,orinanyofthephysicalconditionswithwhichtheyaresurrounded…”

Theonlydifferencewasinthewatersupply,“onegroupbeingsuppliedwithwatercontainingthesewageofLondon,andamongstit,whatevermighthavecomefromthecholerapatients,theothergrouphavingwaterquitefreefromimpurity.”

Confidentthathewouldbeabletoarriveataclearconclusion,Snowsummarizedhisdatainthetablebelow.

SupplyArea Numberofhouses

choleradeaths

deathsper10,000houses

S&V 40,046 1,263 315

Lambeth 26,107 98 37

RestofLondon 256,423 1,422 59

ThenumberspointedaccusinglyatS&V.ThedeathratefromcholeraintheS&VhouseswasalmosttentimestherateinthehousessuppliedbyLambeth.

ComputationalandInferentialThinking

22Snow’s“GrandExperiment”

EstablishingCausalityInthelanguagedevelopedearlierinthesection,youcanthinkofthepeopleintheS&Vhousesasthetreatmentgroup,andthoseintheLambethhousesatthecontrolgroup.AcrucialelementinSnow’sanalysiswasthatthepeopleinthetwogroupswerecomparabletoeachother,apartfromthetreatment.

Inordertoestablishwhetheritwasthewatersupplythatwascausingcholera,Snowhadtocomparetwogroupsthatweresimilartoeachotherinallbutoneaspect–theirwatersupply.Onlythenwouldhebeabletoascribethedifferencesintheiroutcomestothewatersupply.Ifthetwogroupshadbeendifferentinsomeotherwayaswell,itwouldhavebeendifficulttopointthefingeratthewatersupplyasthesourceofthedisease.Forexample,ifthetreatmentgroupconsistedoffactoryworkersandthecontrolgroupdidnot,thendifferencesbetweentheoutcomesinthetwogroupscouldhavebeenduetothewatersupply,ortofactorywork,orboth,ortoanyothercharacteristicthatmadethegroupsdifferentfromeachother.Thefinalpicturewouldhavebeenmuchmorefuzzy.

Snow’sbrilliancelayinidentifyingtwogroupsthatwouldmakehiscomparisonclear.Hehadsetouttoestablishacausalrelationbetweencontaminatedwaterandcholerainfection,andtoagreatextenthesucceeded,eventhoughthemiasmatistsignoredandevenridiculedhim.Ofcourse,Snowdidnotunderstandthedetailedmechanismbywhichhumanscontractcholera.Thatdiscoverywasmadein1883,whentheGermanscientistRobertKochisolatedtheVibriocholerae,thebacteriumthatentersthehumansmallintestineandcausescholera.

InfacttheVibriocholeraehadbeenidentifiedin1854byFilippoPaciniinItaly,justaboutwhenSnowwasanalyzinghisdatainLondon.BecauseofthedominanceofthemiasmatistsinItaly,Pacini’sdiscoverylanguishedunknown.Butbytheendofthe1800’s,themiasmabrigadewasinretreat.SubsequenthistoryhasvindicatedPaciniandJohnSnow.Snow’smethodsledtothedevelopmentofthefieldofepidemiology,whichisthestudyofthespreadofdiseases.

Confounding

Letusnowreturntomoremoderntimes,armedwithanimportantlessonthatwehavelearnedalongtheway:

Inanobservationalstudy,ifthetreatmentandcontrolgroupsdifferinwaysotherthanthetreatment,itisdifficulttomakeconclusionsaboutcausality.

Anunderlyingdifferencebetweenthetwogroups(otherthanthetreatment)iscalledaconfoundingfactor,becauseitmightconfoundyou(thatis,messyouup)whenyoutrytoreachaconclusion.

ComputationalandInferentialThinking

23EstablishingCausality

Example:Coffeeandlungcancer.Studiesinthe1960’sshowedthatcoffeedrinkershadhigherratesoflungcancerthanthosewhodidnotdrinkcoffee.Becauseofthis,somepeopleidentifiedcoffeeasacauseoflungcancer.Butcoffeedoesnotcauselungcancer.Theanalysiscontainedaconfoundingfactor–smoking.Inthosedays,coffeedrinkerswerealsolikelytohavebeensmokers,andsmokingdoescauselungcancer.Coffeedrinkingwasassociatedwithlungcancer,butitdidnotcausethedisease.

Confoundingfactorsarecommoninobservationalstudies.Goodstudiestakegreatcaretoreduceconfounding.

ComputationalandInferentialThinking

24EstablishingCausality

RandomizationAnexcellentwaytoavoidconfoundingistoassignindividualstothetreatmentandcontrolgroupsatrandom,andthenadministerthetreatmenttothosewhowereassignedtothetreatmentgroup.Randomizationkeepsthetwogroupssimilarapartfromthetreatment.

Ifyouareabletorandomizeindividualsintothetreatmentandcontrolgroups,youarerunningarandomizedcontrolledexperiment,alsoknownasarandomizedcontrolledtrial(RCT).Sometimes,people’sresponsesinanexperimentareinfluencedbytheirknowingwhichgrouptheyarein.Soyoumightwanttorunablindexperimentinwhichindividualsdonotknowwhethertheyareinthetreatmentgrouporthecontrolgroup.Tomakethiswork,youwillhavetogivethecontrolgroupaplacebo,whichissomethingthatlooksexactlylikethetreatmentbutinfacthasnoeffect.

Randomizedcontrolledexperimentshavelongbeenagoldstandardinthemedicalfield,forexampleinestablishingwhetheranewdrugworks.Theyarealsobecomingmorecommonlyusedinotherfieldssuchaseconomics.

Example:WelfaresubsidiesinMexico.InMexicanvillagesinthe1990’s,childreninpoorfamilieswereoftennotenrolledinschool.Oneofthereasonswasthattheolderchildrencouldgotoworkandthushelpsupportthefamily.SantiagoLevy,aministerinMexicanMinistryofFinance,setouttoinvestigatewhetherwelfareprogramscouldbeusedtoincreaseschoolenrollmentandimprovehealthconditions.HeconductedanRCTonasetofvillages,selectingsomeofthematrandomtoreceiveanewwelfareprogramcalledPROGRESA.Theprogramgavemoneytopoorfamiliesiftheirchildrenwenttoschoolregularlyandthefamilyusedpreventivehealthcare.Moremoneywasgivenifthechildrenwereinsecondaryschoolthaninprimaryschool,tocompensateforthechildren’slostwages,andmoremoneywasgivenforgirlsattendingschoolthanforboys.Theremainingvillagesdidnotgetthistreatment,andformedthecontrolgroup.Becauseoftherandomization,therewerenoconfoundingfactorsanditwaspossibletoestablishthatPROGRESAincreasedschoolenrollment.Forboys,theenrollmentincreasedfrom73%inthecontrolgroupto77%inthePROGRESAgroup.Forgirls,theincreasewasevengreater,from67%inthecontrolgrouptoalmost75%inthePROGRESAgroup.Duetothesuccessofthisexperiment,theMexicangovernmentsupportedtheprogramunderthenewnameOPORTUNIDADES,asaninvestmentinahealthyandwelleducatedpopulation.

Insomesituationsitmightnotbepossibletocarryoutarandomizedcontrolledexperiment,evenwhentheaimistoinvestigatecausality.Forexample,supposeyouwanttostudytheeffectsofalcoholconsumptionduringpregnancy,andyourandomlyassignsomepregnant

ComputationalandInferentialThinking

25Randomization

womentoyour“alcohol”group.Youshouldnotexpectcooperationfromthemifyoupresentthemwithadrink.Insuchsituationsyouwillalmostinvariablybeconductinganobservationalstudy,notanexperiment.Bealertforconfoundingfactors.

ComputationalandInferentialThinking

26Randomization

EndnoteIntheterminologyofthatwehavedeveloped,JohnSnowconductedanobservationalstudy,notarandomizedexperiment.Buthecalledhisstudya“grandexperiment”because,ashewrote,“Nofewerthanthreehundredthousandpeople…weredividedintotwogroupswithouttheirchoice,andinmostcases,withouttheirknowledge…”

StudiessuchasSnow’saresometimescalled“naturalexperiments.”However,truerandomizationdoesnotsimplymeanthatthetreatmentandcontrolgroupsareselected“withouttheirchoice.”

Themethodofrandomizationcanbeassimpleastossingacoin.Itmayalsobequiteabitmorecomplex.Buteverymethodofrandomizationconsistsofasequenceofcarefullydefinedstepsthatallowchancestobespecifiedmathematically.Thishastwoimportantconsequences.

1. Itallowsustoaccount–mathematically–forthepossibilitythatrandomizationproducestreatmentandcontrolgroupsthatarequitedifferentfromeachother.

2. Itallowsustomakeprecisemathematicalstatementsaboutdifferencesbetweenthetreatmentandcontrolgroups.Thisinturnhelpsusmakejustifiableconclusionsaboutwhetherthetreatmenthasanyeffect.

Inthiscourse,youwilllearnhowtoconductandanalyzeyourownrandomizedexperiments.Thatwillinvolvemoredetailthanhasbeenpresentedinthissection.Fornow,justfocusonthemainidea:totrytoestablishcausality,runarandomizedcontrolledexperimentifpossible.Ifyouareconductinganobservationalstudy,youmightbeabletoestablishassociationbutnotcausation.Beextremelycarefulaboutconfoundingfactorsbeforemakingconclusionsaboutcausalitybasedonanobservationalstudy.

Terminology

observationalstudytreatmentoutcomeassociationcausalassociationcausalitycomparisontreatmentgroupcontrolgroupepidemiology

ComputationalandInferentialThinking

27Endnote

confoundingrandomizationrandomizedcontrolledexperimentrandomizedcontrolledtrial(RCT)blindplacebo

Funfacts

1. JohnSnowissometimescalledthefatherofepidemiology,buthewasananesthesiologistbyprofession.OneofhispatientswasQueenVictoria,whowasanearlyrecipientofanestheticsduringchildbirth.

2. FlorenceNightingale,theoriginatorofmodernnursingpracticesandfamousforherworkintheCrimeanWar,wasadie-hardmiasmatist.Shehadnotimefortheoriesaboutcontagionandgerms,andwasnotoneformincingherwords.“Thereisnoendtotheabsurditiesconnectedwiththisdoctrine,”shesaid.“Sufficeittosaythatintheordinarysenseoftheword,thereisnoproofsuchaswouldbeadmittedinanyscientificenquirythatthereisanysuchthingascontagion.”

3. AlaterRCTestablishedthattheconditionsonwhichPROGRESAinsisted–childrengoingtoschool,preventivehealthcare–werenotnecessarytoachieveincreasedenrollment.Justthefinancialboostofthewelfarepaymentswassufficient.

Goodreads

TheStrangeCaseoftheBroadStreetPump:JohnSnowandtheMysteryofCholerabySandraHempel,publishedbyourownUniversityofCaliforniaPress,readslikeawhodunit.Itwasoneofthemainsourcesforthissection'saccountofJohnSnowandhiswork.Awordofwarning:someofthecontentsofthebookarestomach-churning.

PoorEconomics,thebestsellerbyAbhijitV.BanerjeeandEstherDufloofMIT,isanaccessibleandlivelyaccountofwaystofightglobalpoverty.ItincludesnumerousexamplesofRCTs,includingthePROGRESAexampleinthissection.

ComputationalandInferentialThinking

28Endnote

ExpressionsInteractProgrammingcandramaticallyimproveourabilitytocollectandanalyzeinformationabouttheworld,whichinturncanleadtodiscoveriesthroughthekindofcarefulreasoningdemonstratedintheprevioussection.Indatascience,thepurposeofwritingaprogramistoinstructacomputertocarryoutthestepsofananalysis.Computerscannotstudytheworldontheirown.Peoplemustdescribepreciselywhatstepsthecomputershouldtakeinordertocollectandanalyzedata,andthosestepsareexpressedthroughprograms.

Tolearntoprogram,wewillbeginbylearningsomeofthegrammarrulesofaprogramminglanguage.Programsaremadeupofexpressions,whichdescribetothecomputerhowtocombinepiecesofdata.Forexample,amultiplicationexpressionconsistsofa*symbolbetweentwonumericalexpressions.Expressions,suchas3*4,areevaluatedbythecomputer.Thevalue(theresultofevaluation)ofthelastexpressionineachcell,12inthiscase,isdisplayedbelowthecell.

3*4

12

Therulesofaprogramminglanguagearerigid.InPython,the*symbolcannotappeartwiceinarow.Thecomputerwillnottrytointerpretanexpressionthatdiffersfromitsprescribedexpressionstructures.Instead,itwillshowaSyntaxErrorerror.TheSyntaxofalanguageisitssetofgrammarrules,andaSyntaxErrorindicatesthatsomerulehasbeenviolated.

3**4

File"<ipython-input-4-d90564f70db7>",line1

3**4

^

SyntaxError:invalidsyntax

ComputationalandInferentialThinking

29Expressions

Smallchangestoanexpressioncanchangeitsmeaningentirely.Below,thespacebetweenthe*'shasbeenremoved.Because**appearsbetweentwonumericalexpressions,theexpressionisawell-formedexponentiationexpression(thefirstnumberraisedtothepowerofthesecond:3times3times3times3).Thesymbols*and**arecalledoperators,andthevaluestheycombinearecalledoperands.

3**4

81

CommonOperators.Datascienceofteninvolvescombiningnumericalvalues,andthesetofoperatorsinaprogramminglanguagearedesignedtosothatexpressionscanbeusedtoexpressanysortofarithmetic.InPython,thefollowingoperatorsareessential.

ExpressionType Operator Example Value

Addition + 2+3 5

Subtraction - 2-3 -1

Multiplication * 2*3 6

Division / 7/3 2.66667

Remainder % 7%3 1

Exponentiation ** 2**0.5 1.41421

Pythonexpressionsobeythesamefamiliarrulesofprecedenceasinalgebra:multiplicationanddivisionoccurbeforeadditionandsubtraction.Parenthesescanbeusedtogrouptogethersmallerexpressionswithinalargerexpression.

1+2*3*4*5/6**3+7+8

16.555555555555557

1+2*((3*4*5/6)**3)+7+8

2016.0

ComputationalandInferentialThinking

30Expressions

Thischapterintroducesmanytypesofexpressions.Learningtoprograminvolvestryingouteverythingyoulearnincombination,investigatingthebehaviorofthecomputer.Whathappensifyoudividebyzero?Whathappensifyoudividetwiceinarow?Youdon'talwaysneedtoaskanexpert(ortheInternet);manyofthesedetailscanbediscoveredbytryingthemoutyourself.

ComputationalandInferentialThinking

31Expressions

NamesInteractNamesaregiventovaluesinPythonusinganassignmentexpression.Inanassignment,anameisfollowedby=,whichisfollowedbyanotherexpression.Thevalueoftheexpressiontotherightof=isassignedtothename.Onceanamehasavalueassignedtoit,thevaluewillbesubstitutedforthatnameinfutureexpressions.

a=10

b=20

a+b

30

Apreviouslyassignednamecanbeusedintheexpressiontotherightof=.

quarter=2

whole=4*quarter

whole

8

However,onlythecurrentvalueofanexpressionisassigedtoaname.Ifthatvaluechangeslater,namesthatweredefinedintermsofthatvaluewillnotchangeautomatically.

quarter=4

whole

8

Namesmuststartwithaletter,butcancontainbothlettersandnumbers.Anamecannotcontainaspace;instead,itiscommontouseanunderscorecharacter_toreplaceeachspace.Namesareonlyasusefulasyoumakethem;it'suptotheprogrammertochoose

ComputationalandInferentialThinking

32Names

namesthatareeasytointerpret.Typically,moremeaningfulnamescanbeinventedthanaandb.Forexample,todescribethesalestaxona$5purchaseinBerkeley,CA,thefollowingnamesclarifythemeaningofthevariousquantitiesinvolved.

purchase_price=5

state_tax_rate=0.075

county_tax_rate=0.02

city_tax_rate=0

sales_tax_rate=state_tax_rate+county_tax_rate+city_tax_rate

sales_tax=purchase_price*sales_tax_rate

sales_tax

0.475

ComputationalandInferentialThinking

33Names

Example:GrowthRatesInteractTherelationshipbetweentwomeasurementsofthesamequantitytakenatdifferenttimesisoftenexpressedasagrowthrate.Forexample,theUnitedStatesfederalgovernmentemployed2,766,000peoplein2002and2,814,000peoplein2012\footnotehttp://www.bls.gov/opub/mlr/2013/article/industry-employment-and-output-projections-to-2022-1.htm.Tocomputeagrowthrate,wemustfirstdecidewhichvaluetotreatastheinitialamount.Forvaluesovertime,theearliervalueisanaturalchoice.Then,wedividethedifferencebetweenthechangedandinitialamountbytheinitialamount.

initial=2766000

changed=2814000

(changed-initial)/initial

0.01735357917570499

Itismoretypicaltosubtractonefromtheratioofthetwomeasurements,whichyieldsthesamevalue.

(changed/initial)-1

0.017353579175704903

Thisvalueisthegrowthrateover10years.Ausefulpropertyofgrowthratesisthattheydon'tchangeevenifthevaluesareexpressedindifferentunits.So,forexample,wecanexpressthesamerelationshipbetweenthousandsofpeoplein2002and2012.

initial=2766

changed=2814

(changed/initial)-1

0.017353579175704903

ComputationalandInferentialThinking

34Example:GrowthRates

In10years,thenumberofemployeesoftheUSFederalGovernmenthasincreasedbyonly1.74%.Inthattime,thetotalexpendituresoftheUSFederalGovernmentincreasedfrom$2.37trillionto$3.38trillionin2012.

initial=2.37

changed=3.38

(changed/initial)-1

0.4261603375527425

A42.6%increaseinthefederalbudgetismuchlargerthanthe1.74%increaseinfederalemployees.Infact,thenumberoffederalemployeeshasgrownmuchmoreslowlythanthepopulationoftheUnitedStates,whichincreased9.21%inthesametimeperiodfrom287.6millionpeoplein2002to314.1millionin2012.

initial=287.6

changed=314.1

(changed/initial)-1

0.09214186369958277

Agrowthratecanbenegative,representingadecreaseinsomevalue.Forexample,thenumberofmanufacturingjobsintheUSdecreasedfrom15.3millionin2002to11.9millionin2012,a-22.2%growthrate.

initial=15.3

changed=11.9

(changed/initial)-1

-0.2222222222222222

Anannualgrowthrateisagrowthrateofsomequantityoverasingleyear.Anannualgrowthrateof0.035,accumulatedeachyearfor10years,givesamuchlargerten-yeargrowthrateof0.41(or41%).

ComputationalandInferentialThinking

35Example:GrowthRates

1.035*1.035*1.035*1.035*1.035*1.035*1.035*1.035*1.035

0.410598760621121

Thissamecomputationcanbeexpressedusingnamesandexponentsforclarity.

annual_growth_rate=0.035

ten_year_growth_rate=(1+annual_growth_rate)**10-1

ten_year_growth_rate

0.410598760621121

Likewise,aten-yeargrowthratecanbeusedtocomputeanequivalentannualgrowthrate.Below,tisthenumberofyearsthathavepassedbetweenmeasurements.Thefollowingcomputestheannualgrowthrateoffederalexpendituresoverthelast10years.

initial=2.37

changed=3.38

t=10

(changed/initial)**(1/t)-1

0.03613617208346853

Thetotalgrowthover10yearsisequivalenttoa3.6%increaseeachyear.

ComputationalandInferentialThinking

36Example:GrowthRates

CallExpressionsInteractCallexpressionsinvokefunctions,whicharenamedoperations.Thenameofthefunctionappearsfirst,followedbyexpressionsinparentheses.

abs(-12)

12

round(5-1.3)

4

max(2,2+3,4)

5

Inthislastexample,themaxfunctioniscalledonthreearguments:2,5,and4.Thevalueofeachexpressionwithinparenthesesispassedtothefunction,andthefunctionreturnsthefinalvalueofthefullcallexpression.Themaxfunctioncantakeanynumberofargumentsandreturnsthemaximum.

Afewfunctionsareavailablebydefault,suchasabsandround,butmostfunctionsthatarebuiltintothePythonlanguagearestoredinacollectionoffunctionscalledamodule.Animportstatementisusedtoprovideaccesstoamodule,suchasmathoroperator.

importmath

importoperator

math.sqrt(operator.add(4,5))

3.0

ComputationalandInferentialThinking

37CallExpressions

Anequivalentexpressioncouldbeexpressedusingthe+and**operatorsinstead.

(4+5)**0.5

3.0

Operatorsandcallexpressionscanbeusedtogetherinanexpression.Thepercentdifferencebetweentwovaluesisusedtocomparevaluesforwhichneitheroneisobviouslyinitialorchanged.Forexample,in2014Floridafarmsproduced2.72billioneggswhileIowafarmsproduced16.25billioneggs\footnotehttp://quickstats.nass.usda.gov/.Thepercentdifferenceis100timestheabsolutevalueofthedifferencebetweenthevalues,dividedbytheiraverage.Inthiscase,thedifferenceislargerthantheaverage,andsothepercentdifferenceisgreaterthan100.

florida=2.72

iowa=16.25

100*abs(florida-iowa)/((florida+iowa)/2)

142.6462836056932

Learninghowdifferentfunctionsbehaveisanimportantpartoflearningaprogramminglanguage.AJupyternotebookcanassistinrememberingthenamesandeffectsofdifferentfunctions.Wheneditingacodecell,pressthetabkeyaftertypingthebeginningofanametobringupalistofwaystocompletethatname.Forexample,presstabaftermath.toseeallofthefunctionsavailableinthemathmodule.Typingwillnarrowdownthelistofoptions.Tolearnmoreaboutafunction,placea?afteritsname.Forexample,typingmath.log?willbringupadescriptionofthelogfunctioninthemathmodule.

math.log?

log(x[,base])

Returnthelogarithmofxtothegivenbase.

Ifthebasenotspecified,returnsthenaturallogarithm(basee)ofx.

Thesquarebracketsintheexamplecallindicatethatanargumentisoptional.Thatis,logcanbecalledwitheitheroneortwoarguments.

ComputationalandInferentialThinking

38CallExpressions

math.log(16,2)

4.0

math.log(16)/math.log(2)

4.0

ComputationalandInferentialThinking

39CallExpressions

DataTypesInteractPythonincludesseveraltypesofvaluesthatwillallowustocomputeaboutmorethanjustnumbers.

TypeName PythonClass Example1 Example2

Integer int 1 -2

Floating-PointNumber float 1.5 -2.0

Boolean bool True False

String str 'helloworld' "False"

Thefirsttwotypesrepresentnumbers.A"floating-point"numberdescribesitsrepresentationbyacomputer,whichisatopicforanothertext.Practically,thedifferencebetweenanintegerandafloating-pointnumberisthatthelattercanrepresentfractionsinadditiontowholevalues.

Booleanvalues,namedforthelogicianGeorgeBoole,representtruthandtakeonlytwopossiblevalues:TrueandFalse.

Strings¶Stringsarethemostflexibletypeofall.Theycancontainarbitrarytext.Astringcandescribeanumberoratruthvalue,aswellasanywordorsentence.Muchoftheworld'sdataistext,andtextisrepresentedincomputersasstrings.

Themeaningofanexpressiondependsbothuponitsstructureandthetypesofvaluesthatarebeingcombined.So,forinstance,addingtwostringstogetherproducesanotherstring.Thisexpressionisstillanadditionexpression,butitiscombiningadifferenttypeofvalue.

"data"+"science"

'datascience'

Additioniscompletelyliteral;itcombinesthesetwostringstogetherwithoutregardfortheircontents.Itdoesn'taddaspacebecausethesearedifferentwords;that'suptotheprogrammer(you)tospecify.

ComputationalandInferentialThinking

40DataTypes

"data"+"sc"+"ienc"+"e"

'datascience'

Singleanddoublequotescanbothbeusedtocreatestrings:'hi'and"hi"areidenticalexpressions.Doublequotesareoftenpreferredbecausetheyallowyoutoincludeapostrophesinsideofstrings.

"Thiswon'tworkwithasingle-quotedstring!"

"Thiswon'tworkwithasingle-quotedstring!"

Whynot?Tryitout.

Thestrfunctionreturnsastringrepresentationofanyvalue.Usingthisfunction,stringscanbeconstructedthathaveembeddedvalues.

"That's"+str(1+1)+''+str(True)

"That's2True"

StringMethods¶

Fromanexistingstring,relatedstringscanbeconstructedusingstringmethods,whicharefunctionsthatoperateonstrings.Thesemethodsarecalledbyplacingadotafterthestring,thencallingthefunction.

Forexample,thefollowingmethodgeneratesanuppercasedversionofastring.

"loud".upper()

'LOUD'

Perhapsthemostimportantmethodisreplace,whichreplacesallinstancesofasubstringwithinthestring.Thereplacemethodtakestwoarguments,thetexttobereplacedanditsreplacement.

ComputationalandInferentialThinking

41DataTypes

'hitchhiker'.replace('hi','ma')

'matchmaker'

Stringmethodscanalsobeinvokedusingvariablenames,aslongasthosenamesareboundtostrings.So,forinstance,thefollowingtwo-stepprocessgeneratestheword"degrade"startingfrom"train"byfirstcreating"ingrain"andthenapplyingasecondreplacement.

s="train"

t=s.replace('t','ing')

u=t.replace('in','de')

u

'degrade'

ComputationalandInferentialThinking

42DataTypes

ComparisonsInteractBooleanvaluesmostoftenarisefromcomparisonoperators.Pythonincludesavarietyofoperatorsthatcomparevalues.Forexample,3islargerthan1+1.

3>1+1

True

ThevalueTrueindicatesthatthecomparisonisvalid;Pythonhasconfirmedthissimplefactabouttherelationshipbetween3and1+1.Thefullsetofcommoncomparisonoperatorsarelistedbelow.

Comparison Operator Trueexample FalseExample

Lessthan < 2<3 2<2

Greaterthan > 3>2 3>3

Lessthanorequal <= 2<=2 3<=2

Greaterorequal >= 3>=3 2>=3

Equal == 3==3 3==2

Notequal != 3!=2 2!=2

Anexpressioncancontainmultiplecomparisons,andtheyallmustholdinorderforthewholeexpressiontobeTrue.Forexample,wecanexpressthat1+1isbetween1and3usingthefollowingexpression.

1<1+1<3

True

Theaverageoftwonumbersisalwaysbetweenthesmallernumberandthelargernumber.Weexpressthisrelationshipforthenumbersxandybelow.Youcantrydifferentvaluesofxandytoconfirmthisrelationship.

ComputationalandInferentialThinking

43Comparisons

x=12

y=5

min(x,y)<=(x+y)/2<=max(x,y)

True

Stringscanalsobecompared,andtheirorderisalphabetical.Ashorterstringislessthanalongerstringthatbeginswiththeshorterstring.

"Dog">"Catastrophe">"Cat"

True

ComputationalandInferentialThinking

44Comparisons

SequencesInteractValuescanbegroupedtogetherintocollections,whichallowsprogrammerstoorganizethosevaluesandrefertoallofthemwithasinglename.Separatingexpressionsbycommaswithinsquarebracketsplacesthevaluesofthoseexpressionsintoalist,whichisabuilt-insequentialcollection.Afteracomma,it'sokaytostartanewline.Below,wecollectfourdifferenttemperaturesintoalistcalledtemps.ThesearetheestimatedaveragedailyhightemperaturesoveralllandonEarth(indegreesCelsius)forthedecadessurrounding1850,1900,1950,and2000,respectively,expressedasdeviationsfromtheaverageabsolutehightemperaturebetween1951and1980,whichwas14.48degrees.\footnotehttp://berkeleyearth.lbl.gov/regions/global-land

baseline_high=14.48

highs=[baseline_high-0.880,baseline_high-0.093,

baseline_high+0.105,baseline_high+0.684]

highs

[13.6,14.387,14.585,15.164]

Listsallowustopassmultiplevaluesintoafunctionusingasinglename.Forinstance,thesumfunctioncomputesthesumofallvaluesinalist,andthelenfunctioncomputesthelengthofthelist.Usingthemtogether,wecancomputetheaverageofthelist.

sum(highs)/len(highs)

14.434000000000001

Tuples.Valuescanalsobegroupedtogetherintotuplesinadditiontolists.Atupleisexpressedbyseparatingelementsusingcommaswithinparentheses:(1,2,3).Inmanycases,listsandtuplesareinterchangeable.Wewilluseatupletoholdtheaveragedailylowtempuraturesforthesamefourtimeranges.Theonlychangeinthecodeistheswitchtoparenthesesinsteadofbrackets.

ComputationalandInferentialThinking

45Sequences

baseline_low=3.00

lows=(baseline_low-0.872,baseline_low-0.629,

baseline_low-0.126,baseline_low+0.728)

lows

(2.128,2.371,2.874,3.7279999999999998)

Thecompletechartofdailyhighandlowtemperaturesappearsbelow.

MeanofDailyHighTemperature¶

MeanofDailyLowTemperature¶

ComputationalandInferentialThinking

46Sequences

ComputationalandInferentialThinking

47Sequences

ArraysInteractManyexperimentsanddatasetsinvolvemultiplevaluesofthesametype.Anarrayisacollectionofvaluesthatallhavethesametype.Thenumpypackage,abbreviatednpinprograms,providesPythonprogrammerswithconvenientandpowerfulfunctionsforcreatingandmanipulatingarrays.

importnumpyasnp

Anarrayiscreatedusingthenp.arrayfunction,whichtakesalistortupleasanargument.Here,wecreatearraysofaveragedailyhighandlowtemperaturesforthedecadessurrounding1850,1900,1950,and2000.

baseline_high=14.48

highs=np.array([baseline_high-0.880,

baseline_high-0.093,

baseline_high+0.105,

baseline_high+0.684])

highs

array([13.6,14.387,14.585,15.164])

baseline_low=3.00

lows=np.array((baseline_low-0.872,baseline_low-0.629,

baseline_low-0.126,baseline_low+0.728))

lows

array([2.128,2.371,2.874,3.728])

Arraysdifferfromlistsandtuplesbecausetheycanbeusedinarithmeticexpressionstocomputeovertheircontents.Whenanarrayiscombinedwithasinglenumber,thatnumberiscombinedwitheachelementofthearray.Therefore,wecanconvertallofthesetemperaturestoFarenheitbywritingthefamiliarconversionformula.

ComputationalandInferentialThinking

48Arrays

(9/5)*highs+32

array([56.48,57.8966,58.253,59.2952])

Whentwoarraysarecombinedtogetherusinganarithmeticoperator,theirindividualvaluesarecombined.Forexample,wecancomputethedailytemperaturerangesbysubtracting.

highs-lows

array([11.472,12.016,11.711,11.436])

Arraysalsohavemethods,whicharefunctionsthatoperateonthearrayvalues.Themeanofacollectionofnumbersisitsaveragevalue:thesumdividedbythelength.Eachpairofparenthesesintheexamplebelowispartofacallexpression;it'scallingafunctionwithnoargumentstoperformacomputationonthearraycalledtemps.

[highs.size,highs.sum(),highs.mean()]

[4,57.736000000000004,14.434000000000001]

It'scommonthatsmallroundingerrorsappearinarithmeticonacomputer.Forinstance,themean(average)aboveisinfact14.434,butanextra0.0000000000000001hasbeenaddedduringthecomputationoftheaverage.Arithmeticroundingerrorsareanimportanttopicinscientificcomputing,butwon'tbeafocusofthistext.Everyonewhoworkswithdatamustbeawarethatsomeoperationswillbeapproximatedsothatvaluescanbestoredandmanipulatedefficiently.

FunctionsonArrays¶

Inadditiontobasicarithmeticandmethodssuchasminandmax,manipulatingarraysofteninvolvescallingfunctionsthatarepartofthenppackage.Forexample,thedifffunctioncomputesthedifferencebetweeneachadjacentpairofelementsinanarray.Thefirstelementofthediffisthesecondelementminusthefirst.

np.diff(highs)

ComputationalandInferentialThinking

49Arrays

array([0.787,0.198,0.579])

ThefullNumpyreferenceliststhesefunctionsexhaustively,butonlyasmallsubsetareusedcommonlyfordataprocessingapplications.Thesearegroupedintodifferentpackageswithinnp.LearningthisvocabularyisanimportantpartoflearningthePythonlanguage,soreferbacktothislistoftenasyouworkthroughexamplesandproblems.

Eachofthesefunctionstakesanarrayasanargumentandreturnsasinglevalue.

Function Description

np.prod Multiplyallelementstogether

np.sum Addallelementstogether

np.allTestwhetherallelementsaretruevalues(non-zeronumbersaretrue)

np.anyTestwhetheranyelementsaretruevalues(non-zeronumbersaretrue)

np.count_nonzero Countthenumberofnon-zeroelements

Eachofthesefunctionstakesanarrayasanargumentandreturnsanarrayofvalues.

Function Description

np.diff Differencebetweenadjacentelements

np.round Roundeachnumbertothenearestinteger(wholenumber)

np.cumprod Acumulativeproduct:foreachelement,multiplyallelementssofar

np.cumsum Acumulativesum:foreachelement,addallelementssofar

np.exp Exponentiateeachelement

np.log Takethenaturallogarithmofeachelement

np.sqrt Takethesquarerootofeachelement

np.sort Sorttheelements

Eachofthesefunctionstakesanarrayofstringsandreturnsanarray.

ComputationalandInferentialThinking

50Arrays

Function Description

np.char.lower Lowercaseeachelement

np.char.upper Uppercaseeachelement

np.char.strip Removespacesatthebeginningorendofeachelement

np.char.isalpha Whethereachelementisonlyletters(nonumbersorsymbols)

np.char.isnumeric Whethereachelementisonlynumeric(noletters)

Eachofthesefunctionstakesbothanarrayofstringsandasearchstring;eachreturnsanarray.

Function Description

np.char.countCountthenumberoftimesasearchstringappearsamongtheelementsofanarray

np.char.findThepositionwithineachelementthatasearchstringisfoundfirst

np.char.rfindThepositionwithineachelementthatasearchstringisfoundlast

np.char.startswith Whethereachelementstartswiththesearchstring

ComputationalandInferentialThinking

51Arrays

RangesInteractArangeisanarrayofnumbersinincreasingordecreasingorder,eachseparatedbyaregularinterval.Rangesaredefinedusingthenp.arangefunction,whichtakeseitherone,two,orthreearguments.

np.arange(end):Anarraystartingwith0ofincreasingintegersuptoend

np.arange(start,end):Anarrayofincreasingintegersfromstartuptoend

np.arange(start,end,step):Arangewithstepbetweeneachpairofconsecutivevalues

Arangealwaysincludesitsstartvalue,butdoesnotincludeitsendvalue.Thestepcanbeeitherpositiveornegativeandmaybeawholenumberorafraction.

by_four=np.arange(1,100,4)

by_four

array([1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,

69,73,77,81,85,89,93,97])

Rangeshavemanyuses.Forinstance,arangecanbeusedtocomputepartoftheLeibnizformulaforπ,whichistypicallywrittenas

4*sum(1/by_four-1/(by_four+2))

3.1215946525910097

TheBirthdayProblem¶

Therearekstudentsinaclass.Whatisthechancethatatleasttwoofthestudentshavethesamebirthday?

Assumptions

ComputationalandInferentialThinking

52Ranges

1. Noleapyears;everyyearhas365days2. Birthsaredistributedevenlythroughouttheyear3. Nostudent'sbirthdayisaffectedbyanyother(e.g.twins)

Let'sstartwithaneasycase:kis4.We'llfirstfindthechancethatallfourpeoplehavedifferentbirthdays.

all_different=(364/365)*(363/365)*(362/365)

Thechancethatthereisatleastonepairwiththesamebirthdayisequivalenttothechancethatthebirthdaysarenotalldifferent.Giventhechanceofsomeeventoccurring,thechancethattheeventdoesnotoccurisoneminusthechancethatitdoes.Withonly4people,thechancethatanytwohavethesamebirthdayislessthan2%.

1-all_different

0.016355912466550215

Usingarange,wecanexpressthissamecomputationcompactly.Webeginwiththenumeratorsofeachfactor.

k=4

numerators=np.arange(364,365-k,-1)

numerators

array([364,363,362])

Then,wedivideeachnumeratorby365toformthefactors,multiplythemalltogether,andsubtractfrom1.

1-np.prod(numerators/365)

0.016355912466550215

Ifkis40,thechanceoftwobirthdaysbeingthesameismuchhigher:almost90%!

ComputationalandInferentialThinking

53Ranges

k=40

numerators=np.arange(364,365-k,-1)

1-np.prod(numerators/365)

0.89123180981794903

Usingranges,wecaninvestigatehowthischancechangesaskincreases.Thenp.cumprodfunctioncomputesthecumulativeproductofanarray.Thatis,itcomputesanewarraythathasthesamelengthasitsinput,buttheithelementistheproductofthefirstiterms.Below,thefourthtermoftheresultis1*2*3*4.

ten=np.arange(1,9)

np.cumprod(ten)

array([1,2,6,24,120,720,5040,40320])

Thefollowingcellcomputesthechanceofmatchingbirthdaysforeveryclasssizefrom2upto365.Scrollingthroughtheresult,youwillseethataskincreases,thechanceofmatchingbirthdaysreaches1longbeforetheendofthearray.Infact,foranyksmallerthan365,thereisachancethatallkstudentsinaclasscanhavedifferentbirthdays.Thechanceissosmall,however,thatthedifferencefrom1isroundedawaybythecomputer.

numerators=np.arange(364,0,-1)

chances=1-np.cumprod(numerators/365)

chances

array([0.00273973,0.00820417,0.01635591,0.02713557,0.04046248,

0.0562357,0.07433529,0.09462383,0.11694818,0.14114138,

0.16702479,0.19441028,0.22310251,0.25290132,0.28360401,

0.31500767,0.34691142,0.37911853,0.41143838,0.44368834,

0.47569531,0.50729723,0.53834426,0.5686997,0.59824082,

0.62685928,0.65446147,0.68096854,0.70631624,0.73045463,

0.75334753,0.77497185,0.79531686,0.81438324,0.83218211,

0.84873401,0.86406782,0.87821966,0.89123181,0.90315161,

0.91403047,0.92392286,0.93288537,0.9409759,0.94825284,

0.9547744,0.96059797,0.96577961,0.97037358,0.97443199,

ComputationalandInferentialThinking

54Ranges

0.97800451,0.98113811,0.98387696,0.98626229,0.98833235,

0.99012246,0.99166498,0.99298945,0.99412266,0.9950888,

0.99590957,0.99660439,0.99719048,0.99768311,0.9980957,

0.99844004,0.99872639,0.99896367,0.99915958,0.99932075,

0.99945288,0.99956081,0.99964864,0.99971988,0.99977744,

0.99982378,0.99986095,0.99989067,0.99991433,0.99993311,

0.99994795,0.99995965,0.99996882,0.999976,0.99998159,

0.99998593,0.99998928,0.99999186,0.99999385,0.99999537,

0.99999652,0.9999974,0.99999806,0.99999856,0.99999893,

0.99999922,0.99999942,0.99999958,0.99999969,0.99999978,

0.99999984,0.99999988,0.99999992,0.99999994,0.99999996,

0.99999997,0.99999998,0.99999998,0.99999999,0.99999999,

0.99999999,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

ComputationalandInferentialThinking

55Ranges

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.])

Theitemmethodofanarrayallowsustoselectaparticularelementfromthearraybyitsposition.Thestartingpositionofanarrayisitem(0),sofindingthechanceofmatchingbirthdaysina40-personclassinvolvesextractingitem(40-2)(becausethestartingitemisfora2-personclass).

chances.item(40-2)

0.891231809817949

ComputationalandInferentialThinking

56Ranges

TablesInteractTablesareafundamentalobjecttypeforrepresentingdatasets.Atablecanbeviewedintwoways.Tablesareasequenceofnamedcolumnsthateachdescribeasingleaspectofallentriesinadataset.Tablesarealsoasequenceofrowsthateachcontainallinformationaboutasingleentryinadataset.

Inordertousetables,importallofthemodulecalleddatascience,amodulecreatedforthistext.

fromdatascienceimport*

EmptytablescanbecreatedusingtheTablefunction,whichoptionallytakesalistofcolumnlabels.Thewith_rowandwith_rowsmethodsofatablereturnnewtableswithoneormoreadditionalrows.Forexample,wecouldcreateatableofOddandEvennumbers.

Table(['Odd','Even']).with_row([3,4])

Odd Even

3 4

Table(['Odd','Even']).with_rows([[3,4],[5,6],[7,8]])

Odd Even

3 4

5 6

7 8

Tablescanalsobeextendedwithadditionalcolumns.Thewith_columnandwith_columnsmethodsreturnnewtableswithadditionallabeledcolumns.Below,webegineachexamplewithanemptytablethathasnocolumns.

Table().with_column('Odd',[3,5,7])

ComputationalandInferentialThinking

57Tables

Odd

3

5

7

Table().with_columns([

'Odd',[3,5,7],

'Even',[4,6,8]

])

Odd Even

3 4

5 6

7 8

Tablesaremoretypicallycreatedfromfilesthatcontaincomma-separatedvalues,calledCSVfiles.Thefilebelowcontains"AnnualEstimatesoftheResidentPopulationbySingleYearofAgeandSexfortheUnitedStates."

census_url='http://www.census.gov/popest/data/national/asrh/2014/files/NC-EST2014-AGESEX-RES.csv'

full_census_table=Table.read_table(census_url)

full_census_table

SEX AGE CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010

0 0 3944153 3944160 3951330

0 1 3978070 3978090 3957888

0 2 4096929 4096939 4090862

0 3 4119040 4119051 4111920

0 4 4063170 4063186 4077552

0 5 4056858 4056872 4064653

0 6 4066381 4066412 4073013

0 7 4030579 4030594 4043047

0 8 4046486 4046497 4025604

0 9 4148353 4148369 4125415

ComputationalandInferentialThinking

58Tables

...(296rowsomitted)

Adescriptionofthetableappearsonline.TheSEXcolumncontainsnumericcodes:0standsforthetotal,1formale,and2forfemale.TheAGEcolumncontainsages,butthespecialvalue999isasumofthetotalpopulation.TherestofthecolumnscontainestimatesoftheUSpopulation.

Typically,apublictablewillcontainmoreinformationthannecessaryforaparticularinvestigationoranalysis.Inthiscase,letussupposethatweareonlyinterestedinthepopulationchangesfrom2010to2014.Wecanselectonlyasubsetofthecolumnsusingtheselectmethod.

partial_census_table=full_census_table.select(['SEX','AGE','POPESTIMATE2010'

partial_census_table

SEX AGE POPESTIMATE2010 POPESTIMATE2014

0 0 3951330 3948350

0 1 3957888 3962123

0 2 4090862 3957772

0 3 4111920 4005190

0 4 4077552 4003448

0 5 4064653 4004858

0 6 4073013 4134352

0 7 4043047 4154000

0 8 4025604 4119524

0 9 4125415 4106832

...(296rowsomitted)

Therelabeledmethodcreatesanalternativeversionofthetablethatreplacesacolumnlable.Usingthismethod,wecansimplifythelabelsoftheselectedcolumns.

simple=partial_census_table.relabeled('POPESTIMATE2010','2010').

simple

ComputationalandInferentialThinking

59Tables

SEX AGE 2010 2014

0 0 3951330 3948350

0 1 3957888 3962123

0 2 4090862 3957772

0 3 4111920 4005190

0 4 4077552 4003448

0 5 4064653 4004858

0 6 4073013 4134352

0 7 4043047 4154000

0 8 4025604 4119524

0 9 4125415 4106832

...(296rowsomitted)

Thecolumnmethodreturnsanarrayofthevaluesinaparticularcolumn.Eachcolumnofatableisanarrayofthesamelength,andsocolumnscanbecombinedusingarithmetic.

simple.column('2014')-simple.column('2010')

array([-2980,4235,-133090,...,6717,13410,4662996])

Atablewithanadditionalcolumncanbecreatedfromanexistingtableusingthewith_columnmethod,whichtakesalabelandasequenceofvaluesasarguments.Theremustbeeitherasmanyvaluesasthereareexistingrowsinthetableorasinglevalue(whichfillsthecolumnwiththatvalue).

change=simple.column('2014')-simple.column('2010')

annual_growth_rate=(simple.column('2014')/simple.column('2010'))

census=simple.with_columns(['Change',change,'Growth',annual_growth_rate

census

ComputationalandInferentialThinking

60Tables

SEX AGE 2010 2014 Change Growth

0 0 3951330 3948350 -2980 -0.000188597

0 1 3957888 3962123 4235 0.000267397

0 2 4090862 3957772 -133090 -0.00823453

0 3 4111920 4005190 -106730 -0.0065532

0 4 4077552 4003448 -74104 -0.00457471

0 5 4064653 4004858 -59795 -0.00369821

0 6 4073013 4134352 61339 0.00374389

0 7 4043047 4154000 110953 0.00679123

0 8 4025604 4119524 93920 0.00578232

0 9 4125415 4106832 -18583 -0.00112804

...(296rowsomitted)

Althoughthecolumnsofthistablearesimplyarraysofnumbers,theformatofthosenumberscanbechangedtoimprovetheinterpretabilityofthetable.Theset_formatmethodtakesFormatterobjects,whichexistfordates(DateFormatter),currencies(CurrencyFormatter),numbers,andpercentages.

census.set_format('Growth',PercentFormatter)

census.set_format(['2010','2014','Change'],NumberFormatter)

SEX AGE 2010 2014 Change Growth

0 0 3,951,330 3,948,350 -2,980 -0.02%

0 1 3,957,888 3,962,123 4,235 0.03%

0 2 4,090,862 3,957,772 -133,090 -0.82%

0 3 4,111,920 4,005,190 -106,730 -0.66%

0 4 4,077,552 4,003,448 -74,104 -0.46%

0 5 4,064,653 4,004,858 -59,795 -0.37%

0 6 4,073,013 4,134,352 61,339 0.37%

0 7 4,043,047 4,154,000 110,953 0.68%

0 8 4,025,604 4,119,524 93,920 0.58%

0 9 4,125,415 4,106,832 -18,583 -0.11%

...(296rowsomitted)

ComputationalandInferentialThinking

61Tables

Theinformationinatablecanbeaccessedinmanyways.Theattributesdemonstratedbelowcanbeusedforanytable.

census.labels

('SEX','AGE','2010','2014','Change','Growth')

census.num_columns

6

census.num_rows

306

census.row(5)

Row(SEX=0,AGE=5,2010=4064653,2014=4004858,Change=-59795,Growth=-0.0036982077956316806)

census.column(2)

array([3951330,3957888,4090862,...,26074,45058,

157257573])

Anindividualiteminatablecanbeaccessedviaaroworacolumn.

census.row(0).item(2)

3951330

ComputationalandInferentialThinking

62Tables

census.column(2).item(0)

3951330

Let'stakealookatthegrowthratesofthetotalnumberofmalesandfemalesbyselectingonlytherowsthatsumoverallages.Thissumisexpressedwiththespecialvalue999accordingtothisdatasetdescription.

census.where('AGE',999)

SEX AGE 2010 2014 Change Growth

0 999 309,347,057 318,857,056 9,509,999 0.76%

1 999 152,089,484 156,936,487 4,847,003 0.79%

2 999 157,257,573 161,920,569 4,662,996 0.73%

Whatagesofmalesaredrivingthisrapidgrowth?Wecanfirstfilterthecensustabletokeeponlythemaleentries,thensortbygrowthrateindecreasingorder.

males=census.where('SEX',1)

males.sort('Growth',descending=True)

SEX AGE 2010 2014 Change Growth

1 99 6,104 9,037 2,933 10.31%

1 100 9,351 13,729 4,378 10.08%

1 98 9,504 13,649 4,145 9.47%

1 93 60,182 85,980 25,798 9.33%

1 96 22,022 31,235 9,213 9.13%

1 94 43,828 62,130 18,302 9.12%

1 97 14,775 20,479 5,704 8.50%

1 95 31,736 42,824 11,088 7.78%

1 91 104,291 138,080 33,789 7.27%

1 92 83,462 109,873 26,411 7.12%

...(92rowsomitted)

ComputationalandInferentialThinking

63Tables

ThefactthattherearemoremenwithAGEof100than99lookssuspicious;shouldn'ttherebefewer?Acarefullookatthedescriptionofthedatasetrevealsthatthe100categoryactuallyincludesallmenwhoare100orolder.Thegrowthratesinmenattheseveryoldagescouldhaveseveralexplanations,suchasalargeinfluxfromanothercountry,butthemostnaturalexplanationisthatpeoplearesimplylivinglongerin2014than2010.

Thewheremethodcanalsotakeanarrayofbooleanvalues,constructedbyapplyingsomecomparisonoperatortoacolumnofthetable.Forexample,wecanfindalloftheagegroupsamongmalesforwhichtheabsoluteChangeissubstantial.Theshowmethoddisplaysallrowswithoutabbreviating.

males.where(males.column('Change')>200000).show()

SEX AGE 2010 2014 Change Growth

1 23 2,151,095 2,399,883 248,788 2.77%

1 24 2,161,380 2,391,398 230,018 2.56%

1 34 1,908,761 2,192,455 283,694 3.52%

1 57 1,910,028 2,110,149 200,121 2.52%

1 59 1,779,504 2,006,900 227,396 3.05%

1 64 1,291,843 1,661,474 369,631 6.49%

1 65 1,272,693 1,607,688 334,995 6.02%

1 66 1,239,805 1,589,127 349,322 6.40%

1 67 1,270,148 1,653,257 383,109 6.81%

1 71 903,263 1,169,356 266,093 6.67%

1 999 152,089,484 156,936,487 4,847,003 0.79%

Byfarthelargestchangesareclumpedtogetherinthe64-67range.In2014,thesepeoplewouldbebornbetween1947and1951,theheightofthepost-WWIIbabyboomintheUnitedStates.

Itispossibletospecifymultipleconditionsusingthefunctionsnp.logical_andandnp.logical_or.Whentwoconditionsarecombinedwithlogical_and,bothmustbetrueforarowtoberetained.Whenconditionsarecombinedwithlogical_or,theneitheroneofthemmustbetrueforarowtoberetained.Herearetwodifferentwaystoselect18and19yearolds.

ComputationalandInferentialThinking

64Tables

both=census.where(census.column('SEX')!=0)

both.where(np.logical_or(both.column('AGE')==18,

both.column('AGE')==19))

SEX AGE 2010 2014 Change Growth

1 18 2,305,733 2,165,062 -140,671 -1.56%

1 19 2,334,906 2,220,790 -114,116 -1.24%

2 18 2,185,272 2,060,528 -124,744 -1.46%

2 19 2,236,479 2,105,604 -130,875 -1.50%

both.where(np.logical_and(both.column('AGE')>=18,

both.column('AGE')<=19))

SEX AGE 2010 2014 Change Growth

1 18 2,305,733 2,165,062 -140,671 -1.56%

1 19 2,334,906 2,220,790 -114,116 -1.24%

2 18 2,185,272 2,060,528 -124,744 -1.46%

2 19 2,236,479 2,105,604 -130,875 -1.50%

Here,weobservearatherdramaticdecreaseinthenumberof18and19yearoldsintheUnitedStates;thechildrenofthebabyboomgenerationarenowintheir20's,leavingfewerteenagersinAmerica.

AggregationandGrouping.Thisparticulardatasetincludesentriesforsumsacrossallagesandsexes,usingthespecialvalues999and0respectively.However,iftheserowsdidnotexist,wewouldbeabletoaggregatetheremainingentriesinordertocomputethosesameresults.

both.where('AGE',999).select(['SEX','2014'])

SEX 2014

1 156,936,487

2 161,920,569

ComputationalandInferentialThinking

65Tables

no_sums=both.select(['SEX','AGE','2014']).where(both.column('AGE'

females=no_sums.where('SEX',2).select(['AGE','2014'])

females

AGE 2014

0 1,930,493

1 1,938,870

2 1,935,270

3 1,956,572

4 1,959,950

5 1,961,391

6 2,024,024

7 2,031,760

8 2,014,402

9 2,009,560

...(91rowsomitted)

sum(females['2014'])

161920569

Somecolumnsexpresscategories,suchasthesexoragegroupinthecaseofthecensustable.Aggregationcanalsobeperformedoneveryvalueforacategoryusingthegroupandgroupsmethods.Thegroupmethodtakesasinglecolumn(orcolumnname)asanargumentandcollectsallvaluesassociatedwitheachuniquevalueinthatcolumn.

Thesecondargumenttogroup,thenameofafunction,specifieshowtoaggregatetheresultingvalues.Forexample,thesumfunctioncanbeusedtoaddtogetherthepopulationsforeachage.Inthisresult,theSEXsumcolumnismeaninglessbecausethevaluesweresimplycodesfordifferentcategories.However,the2014sumcolumndoesinfactcontainthetotalnumberacrossallsexesforeachagecategory.

no_sums.select(['AGE','2014']).group('AGE',sum)

ComputationalandInferentialThinking

66Tables

AGE 2014sum

0 3948350

1 3962123

2 3957772

3 4005190

4 4003448

5 4004858

6 4134352

7 4154000

8 4119524

9 4106832

...(91rowsomitted)

JoiningTables.Therearemanysituationsindataanalysisinwhichtwodifferentrowsneedtobeconsideredtogether.Twotablescanbejoinedintoone,anoperationthatcreatesonelongrowoutoftwomatchingrows.Thesematchingrowscanbefromthesametableordifferenttables.

Forexample,wemightwanttoestimatewhichagecategoriesareexpectedtochangesignificantlyinthefuture.Someonewhois14yearsoldin2014willbe20yearsoldin2020.Therefore,oneestimateofthenumberof20yearoldsin2020isthenumberof14yearoldsin2014.Betweentheagesof1and50,annualmortalityratesareverylow(lessthan0.5%formenandlessthan0.3%forwomen[1]).Immigrationalsoaffectspopulationchanges,butfornowwewillignoreitsinfluence.Let'sconsiderjustfemalesinthisanalysis.

females['AGE+6']=females['AGE']+6

females

ComputationalandInferentialThinking

67Tables

AGE 2014 AGE+6

0 1,930,493 6

1 1,938,870 7

2 1,935,270 8

3 1,956,572 9

4 1,959,950 10

5 1,961,391 11

6 2,024,024 12

7 2,031,760 13

8 2,014,402 14

9 2,009,560 15

...(91rowsomitted)

Inordertorelatetheagein2014totheagein2020,wewilljointhistablewithitself,matchingvaluesintheAGEcolumnwithvaluesintheAGEin2020column.

joined=females.join('AGE',females,'AGE+6')

joined

AGE 2014 AGE+6 AGE_2 2014_2

6 2024024 12 0 1930493

7 2031760 13 1 1938870

8 2014402 14 2 1935270

9 2009560 15 3 1956572

10 2015380 16 4 1959950

11 2001949 17 5 1961391

12 1993547 18 6 2024024

13 2041159 19 7 2031760

14 2068252 20 8 2014402

15 2035299 21 9 2009560

...(85rowsomitted)

ComputationalandInferentialThinking

68Tables

Theresultingtablehasfivecolumns.Thefirstthreearethesameasbefore.ThetwonewcolumsarethevaluesforAGEand2014thatappearedinadifferentrow,theoneinwhichthatAGEappearedintheAGE+6column.Forinstance,thefirstrowcontainsthenumberof6yearoldsin2014andanestimateofthenumberof6yearoldsin2020(whowere0in2014).Somerelabelingandselectingmakesthistablemoreclear.

future=joined.select(['AGE','2014','2014_2']).relabeled('2014_2'

future

AGE 2014 2020(est)

6 2024024 1930493

7 2031760 1938870

8 2014402 1935270

9 2009560 1956572

10 2015380 1959950

11 2001949 1961391

12 1993547 2024024

13 2041159 2031760

14 2068252 2014402

15 2035299 2009560

...(85rowsomitted)

AddingaChangecolumnandsortingbythatchangedescribessomeofthemajorchangesinagecategoriesthatwecanexpectintheUnitedStatesforpeopleunder50.Accordingtothissimplisticanalysis,therewillbesubstantiallymorepeopleintheirlate30'sby2020.

predictions=future.where(future['AGE']<50)

predictions['Change']=predictions['2020(est)']-predictions['2014'

predictions.sort('Change',descending=True)

ComputationalandInferentialThinking

69Tables

AGE 2014 2020(est) Change

40 1940627 2170440 229813

38 1936610 2154232 217622

30 2110672 2301237 190565

37 1995155 2148981 153826

39 1993913 2135416 141503

29 2169563 2298701 129138

35 2046713 2169563 122850

36 2009423 2110672 101249

28 2144666 2244480 99814

41 1977497 2046713 69216

...(34rowsomitted)

ComputationalandInferentialThinking

70Tables

FunctionsInteractWearebuildingupausefulinventoryoftechniquesforidentifyingpatternsandthemesinadataset.Sortingandgroupingrowsofatablecanfocusourattention.Thenextapproachtoanalysiswewillconsiderinvolvesgroupingrowsofatablebyarbitrarycriteria.Todoso,wewillexploretwocorefeaturesofthePythonprogramminglanguage:functiondefinitionandconditionalstatements.

Wehaveusedfunctionsextensivelyalreadyinthistext,butneverdefinedafunctionofourown.Thepurposeofdefiningafunctionistogiveanametoacomputationalprocessthatmaybeappliedmultipletimes.Therearemanysituationsincomputingthatrequirerepeatedcomputation.Forexample,itisoftenthecasethatwewillwanttoperformthesamemanipulationoneveryvalueinacolumn.

AfunctionisdefinedinPythonusingadefstatement,whichisamulti-linestatementthatbeginswithaheaderlinegivingthenameofthefunctionandnamesfortheargumentsofthefunction.Therestofthedefstatement,calledthebody,mustbeindentedbelowtheheader.

Afunctionexpressesarelationshipbetweenitsinputs(calledarguments)anditsoutputs(calledreturnvalues).Thenumberofargumentsrequiredtocallafunctionisthenumberofnamesthatappearwithinparenthesesinthedefstatementheader.Thevaluesthatarereturneddependonthebody.Wheneverafunctioniscalled,itsbodyisexecuted.Wheneverareturnstatementwithinthebodyisexecuted,thecalltothefunctioncompletesandthevalueofthereturnexpressionisreturnedasthevalueofthefunctioncall.

Thedefinitionofthepercentfunctionbelowmultipliesanumberby100androundstheresulttotwodecimalplaces.

defpercent(x):

returnround(100*x,2)

Theprimarydifferencebetweendefiningapercentfunctionandsimplyevaluatingitsreturnexpression,round(100*x,2),isthatwhenafunctionisdefined,itsreturnexpressionisnotimmediatelyevaluated.Itcannotbe,becausethevalueforxisnotyetdefined.Instead,thereturnexpressionisevaluatedwheneverthispercentfunctioniscalledbyplacingparenthesesafterthenamepercentandplacinganexpressiontocomputeitsargumentinparentheses.

ComputationalandInferentialThinking

71Functions

percent(1/6)

16.67

percent(1/6000)

0.02

percent(1/60000)

0.0

Thethreeexpressionsaboveareallcallexpressions.Inthefirst,thevalueof1/6iscomputedandthenpassedastheargumentnamedxtothepercentfunction.Whenthepercentfunctioniscalledinthisway,itsbodyisexecuted.Thebodyofpercenthasonlyasingleline:returnround(100*x,2).Executingthisreturnstatementcompletesexecutionofthepercentfunction'sbodyandcomputesthevalueofthecallexpression.

Thesameresultiscomputedbypassinganamedvalueasanargument.Thepercentfunctiondoesnotknoworcarehowitsargumentiscomputedorstored;itsonlyjobistoexecuteitsownbodyusingtheargumentspassedtoit.

sixth=1/6

percent(sixth)

16.67

ConditionalStatements.Thebodyofafunctioncanhavemorethanonelineandmorethanonereturnstatement.Aconditionalstatementisamulti-linestatementthatallowsPythontochooseamongdifferentalternativesbasedonthetruthvalueofanexpression.Whileconditionalstatementscanappearanywhere,theyappearmostoftenwithinthebodyofafunctioninordertoexpressalternativebehaviordependingonargumentvalues.

ComputationalandInferentialThinking

72Functions

Aconditionalstatementalwaysbeginswithanifheader,whichisasinglelinefollowedbyanindentedbody.Thebodyisonlyexecutediftheexpressiondirectlyfollowingif(calledtheifexpression)evaluatestoatruevalue.Iftheifexpressionevaluatestoafalsevalue,thenthebodyoftheifisskipped.

Forexample,wecanimproveourpercentfunctionsothatitdoesn'troundverysmallnumberstozerosoreadily.Thebehaviorofpercent(1/6)isunchanged,butpercent(1/60000)providesamoreusefulresult.

defpercent(x):

ifx<0.00005:

return100*x

returnround(100*x,2)

percent(1/6)

16.67

percent(1/6000)

0.02

percent(1/60000)

0.0016666666666666668

Aconditionalstatementcanalsohavemultipleclauseswithmultiplebodies,andonlyoneofthosebodiescaneverbeexecuted.Thegeneralformatofamulti-clauseconditionalstatementappearsbelow.

ComputationalandInferentialThinking

73Functions

if<ifexpression>:

<ifbody>

elif<elifexpression0>:

<elifbody0>

elif<elifexpression1>:

<elifbody1>

...

else:

<elsebody>

Thereisalwaysexactlyoneifclause,buttherecanbeanynumberofelifclauses.Pythonwillevaluatetheifandelifexpressionsintheheadersinorderuntiloneisfoundthatisatruevalue,thenexecutethecorrespondingbody.Theelseclauseisoptional.Whenanelseheaderisprovided,itselsebodyisexecutedonlyifnoneoftheheaderexpressionsofthepreviousclausesaretrue.Theelseclausemustalwayscomeattheend(ornotatall).

Letuscontinuetorefineourpercentfunction.Perhapsforsomeanalysis,anyvaluebelowshouldbeconsideredcloseenoughto0thatitcanbeignored.Thefollowingfunction

definitionhandlesthiscaseaswell.

defpercent(x):

ifx<1e-8:

return0.0

elifx<0.00005:

return100*x

else:

returnround(100*x,2)

percent(1/6)

16.67

percent(1/6000)

0.02

ComputationalandInferentialThinking

74Functions

percent(1/60000)

0.0016666666666666668

percent(1/60000000000)

0.0

Awell-composedfunctionhasanamethatevokesitsbehavior,aswellasadocstring—adescriptionofitsbehaviorandexpectationsaboutitsarguments.Thedocstringcanalsoshowexamplecallstothefunction,wherethecallisprecededby>>>.

Adocstringcanbeanystringthatimmediatelyfollowstheheaderlineofadefstatement.Docstringsaretypicallydefinedusingtriplequotationmarksatthestartandend,whichallowsthestringtospanmultiplelines.Thefirstlineisconventionallyacompletebutshortdescriptionofthefunction,whilefollowinglinesprovidefurtherguidancetofutureusersofthefunction.

Amorecompletedefinitionofpercentthatincludesadocstringappearsbelow.

ComputationalandInferentialThinking

75Functions

defpercent(x):

"""Convertxtoapercentagebymultiplyingby100.

Percentagesareconventionallyroundedtotwodecimalplaces,

butprecisionisretainedforanyxabove1e-8thatwould

otherwiseroundto0.

>>>percent(1/6)

16.67

>>>perent(1/6000)

0.02

>>>perent(1/60000)

0.0016666666666666668

>>>percent(1/60000000000)

0.0

"""

ifx<1e-8:

return0.0

elifx<0.00005:

return100*x

else:

returnround(100*x,2)

ComputationalandInferentialThinking

76Functions

FunctionsandTablesInteractInthissection,wewillcontinuetousetheUSCensusdataset.Wewillfocusonlyonthe2014populationestimate.

census_2014=census.select(['SEX','AGE','2014']).where(census.column

census_2014

SEX AGE 2014

0 0 3,948,350

0 1 3,962,123

0 2 3,957,772

0 3 4,005,190

0 4 4,003,448

0 5 4,004,858

0 6 4,134,352

0 7 4,154,000

0 8 4,119,524

0 9 4,106,832

...(293rowsomitted)

Functionscanbeusedtocomputenewcolumnsinatablebasedonexistingcolumnvalues.Forexample,wecantransformthecodesintheSEXcolumntostringsthatareeasiertointerpret.

defmale_female(code):

ifcode==0:

return'Total'

elifcode==1:

return'Male'

elifcode==2:

return'Female'

ComputationalandInferentialThinking

77FunctionsandTables

Thisfunctiontakesanindividualcode—0,1,or2—andreturnsastringthatdescribesitsmeaning.

male_female(0)

'Total'

male_female(1)

'Male'

male_female(2)

'Female'

Wecouldalsotransformagesintoagecategories.

defage_group(age):

ifage<2:

return'Baby'

elifage<13:

return'Child'

elifage<20:

return'Teen'

else:

return'Adult'

age_group(15)

'Teen'

Apply.Theapplymethodofatablecallsafunctiononeachelementofacolumn,forminganewarrayofreturnvalues.Toindicatewhichfunctiontocall,justnameit(withoutquotationmarks).Thenameofthecolumnofinputvaluesmuststillappearwithinquotation

ComputationalandInferentialThinking

78FunctionsandTables

marks.

census_2014.apply(male_female,'SEX')

array(['Total','Total','Total',...,'Female','Female','Female'],

dtype='<U6')

Thisarray,whichhasthesamelengthastheoriginalSEXcolumnofthepopulationtable,canbeusedasthevaluesinanewcolumncalledMale/FemalealongsidetheexistingAGEandPopulationcolumns.

population=Table().with_columns([

'Male/Female',census_2014.apply(male_female,'SEX'),

'AgeGroup',census_2014.apply(age_group,'AGE'),

'Population',census_2014.column('2014')

])

population

Male/Female AgeGroup Population

Total Baby 3948350

Total Baby 3962123

Total Child 3957772

Total Child 4005190

Total Child 4003448

Total Child 4004858

Total Child 4134352

Total Child 4154000

Total Child 4119524

Total Child 4106832

...(293rowsomitted)

Groups.Thegroupmethodwithasingleargumentcountsthenumberofrowsforeachcategoryinacolumn.Theresultcontainsonerowperuniquevalueinthegroupedcolumn.

ComputationalandInferentialThinking

79FunctionsandTables

population.group('AgeGroup')

AgeGroup count

Adult 243

Baby 6

Child 33

Teen 21

Theoptionalsecondargumentnamesthefunctionthatwillbeusedtoaggregatevaluesinothercolumnsforallofthoserows.Forinstance,sumwillsumupthepopulationsinallrowsthatmatcheachcategory.Thisresultalsocontainsonerowperuniquevalueinthegroupedcolumn,butithasthesamenumberofcolumnsastheoriginaltable.

totals=population.where(0,'Total').select(['AgeGroup','Population'

totals

AgeGroup Population

Baby 3948350

Baby 3962123

Child 3957772

Child 4005190

Child 4003448

Child 4004858

Child 4134352

Child 4154000

Child 4119524

Child 4106832

...(91rowsomitted)

totals.group('AgeGroup',sum)

ComputationalandInferentialThinking

80FunctionsandTables

AgeGroup Populationsum

Adult 236721454

Baby 7910473

Child 44755656

Teen 29469473

Thegroupsmethodbehavesinthesameway,butacceptsalistofcolumnsasitsfirstargument.Theresultingtablehasonerowforeveryuniquecombinationofvaluesthatappeartogetherinthegroupedcolumns.Again,asingleargument(alist,inthiscase)givesrowcounts.

population.groups(['Male/Female','AgeGroup'])

Male/Female AgeGroup count

Female Adult 81

Female Baby 2

Female Child 11

Female Teen 7

Male Adult 81

Male Baby 2

Male Child 11

Male Teen 7

Total Adult 81

Total Baby 2

...(2rowsomitted)

Asecondargumenttogroupsaggregatesallothercolumnsthatdonotappearinthelistofgroupedcolumns.

population.groups(['Male/Female','AgeGroup'],sum)

ComputationalandInferentialThinking

81FunctionsandTables

Male/Female AgeGroup Populationsum

Female Adult 121754366

Female Baby 3869363

Female Child 21903805

Female Teen 14393035

Male Adult 114967088

Male Baby 4041110

Male Child 22851851

Male Teen 15076438

Total Adult 236721454

Total Baby 7910473

...(2rowsomitted)

Pivot.Thepivotmethodiscloselyrelatedtothegroupsmethod:itgroupstogetherrowsthatshareacombinationofvalues.Itdiffersfromgroupsbecauseitorganizestheresultingvaluesinagrid.Thefirstargumenttopivotisacolumnthatcontainsthevaluesthatwillbeusedtoformnewcolumnsintheresult.Thesecondargumentisacolumnusedforgroupingrows.Theresultgivesthecountofallrowsthatsharethecombinationofcolumnandrowvalues.

population.pivot('Male/Female','AgeGroup')

AgeGroup Female Male Total

Adult 81 81 81

Baby 2 2 2

Child 11 11 11

Teen 7 7 7

Anoptionalthirdargumentindicatesacolumnofvaluesthatwillreplacethecountsineachcellofthegrid.Thefourthargumentindicateshowtoaggregateallofthevaluesthatmatchthecombinationofcolumnandrowvalues.

pivoted=population.pivot('Male/Female','AgeGroup','Population'

pivoted

ComputationalandInferentialThinking

82FunctionsandTables

AgeGroup Female Male Total

Adult 121754366 114967088 236721454

Baby 3869363 4041110 7910473

Child 21903805 22851851 44755656

Teen 14393035 15076438 29469473

Theadvantageofpivotisthatitplacesgroupedvaluesintoadjacentcolumns,sothattheycanbecombined.Forinstance,thispivotedtableallowsustocomputetheproportionofeachagegroupthatismale.Wefindthesurprisingresultthatyoungeragegroupsarepredominantlymale,butamongadultstherearesubstantiallymorefemales.

pivoted.with_column('MalePercentage',pivoted.column('Male')/pivoted

AgeGroup Female Male Total MalePercentage

Adult 121754366 114967088 236721454 48.57%

Baby 3869363 4041110 7910473 51.09%

Child 21903805 22851851 44755656 51.06%

Teen 14393035 15076438 29469473 51.16%

ComputationalandInferentialThinking

83FunctionsandTables

Chapter2

VisualizationsWhilewehavebeenabletomakeseveralinterestingobservationsaboutdatabysimplyrunningoureyesdownthenumbersinatable,ourtaskbecomesmuchharderastablesbeenlarger.Indatascience,apicturecanbeworthathousandnumbers.

Wesawanexampleofthisearlierinthetext,whenweexaminedJohnSnow'smapofcholeradeathsinLondonin1854.

Snowshowedeachdeathasablackmarkatthelocationwherethedeathoccurred.Indoingso,heplottedthreevariables—thenumberofdeaths,aswellastwocoordinatesforeachlocation—inasinglegraphwithoutanyornamentsorflourishes.Thesimplicityofhis

ComputationalandInferentialThinking

84Visualization

presentationfocusestheviewer'sattentiononhismainpoint,whichwasthatthedeathswerecenteredaroundtheBroadStreetpump.

In1869,aFrenchcivilengineernamedCharlesJosephMinardcreatedwhatisstillconsideredoneofthegreatestgraphofalltime.ItshowsthedecimationofNapoleon'sarmyduringitsretreatfromMoscow.In1812,NapoleonhadsetouttoconquerRussia,withover400,000meninhisarmy.TheydidreachMoscow,butwereplaguedbylossesalongtheway;theRussianarmykeptretreatingfartherandfartherintoRussia,deliberatelyburningfieldsanddestroyingvillagesasitretreated.ThislefttheFrencharmywithoutfoodorshelterasthebrutalRussianwinterbegantosetin.TheFrencharmyturnedbackwithoutadecisivevictoryinMoscow.Theweathergotcolder,andmoremendied.Only10,000returned.

ThegraphisdrawnoveramapofeasternEurope.ItstartsatthePolish-Russianborderattheleftend.ThelightbrownbandrepresentsNapoleon'sarmymarchingtowardsMoscow,andtheblackbandrepresentsthearmyreturning.Ateachpointofthegraph,thewidthofthebandisproportionaltothenumberofsoldiersinthearmy.Atthebottomofthegraph,Minardincludesthetemperaturesonthereturnjourney.

Noticehownarrowtheblackbandbecomesasthearmyheadsback.ThecrossingoftheBerezinariverwasparticularlydevastating;canyouspotitonthegraph?

Thegraphisremarkableforitssimplicityandpower.Inasinglegraph,Minardshowssixvariables:

thenumberofsoldiersthedirectionofthemarchthetwocoordinatesoflocationthetemperatureonthereturnjourneythelocationonspecificdatesinNovemberandDecember

ComputationalandInferentialThinking

85Visualization

EdwardTufte,ProfessoratYaleandoneoftheworld'sexpertsonvisualizingquantitativeinformation,saysthatMinard'sgraphis"probablythebeststatisticalgraphiceverdrawn."

Thetechnologyofourtimesallowstoincludeanimationandcolor.Usedjudiciously,withoutexcess,thesecanbeextremelyinformative,asinthisanimationbyGapminder.orgofannualcarbondioxideemissionsintheworldoverthepasttwocenturies.

Acaptionsays,"ClickPlaytoseehowUSAbecomesthelargestemitterofCO2from1900onwards."ThegraphoftotalemissionsshowsthatChinahashigheremissionsthantheUnitedStates.However,inthegraphofpercapitaemissions,theUSishigherthanChina,becauseChina'spopulationismuchlargerthanthatoftheUnitedStates.

Technologycanbeahelpaswellasahindrance.Sometimes,theabilitytocreateafancypictureleadstoalackofclarityinwhatisbeingdisplayed.Inaccuraterepresentationofnumericalinformation,inparticular,canleadtomisleadingmessages.

Here,fromtheWashingtonPostintheearly1980s,isagraphthatattemptstocomparetheearningsofdoctorswiththeearningsofotherprofessionalsoverafewdecades.Dowereallyneedtoseetwoheads(onewithastethoscope)oneachbar?Tuftscoinedtheterm"chartjunk"forsuchunnecessaryembellishments.Healsodeploresthe"lowdata-to-inkratio"whichthisgraphunfortunatelypossesses.

ComputationalandInferentialThinking

86Visualization

Mostimportantly,thehorizontalaxisofthegraphisisnotdrawntoscale.Thishasasignificanteffectontheshapeofthebargraphs.Whendrawntoscaleandshornofdecoration,thegraphsrevealtrendsthatarequitedifferentfromtheapparentlylineargrowthintheoriginal.TheelegantgraphbelowisduetoRossIhaka,oneoftheoriginatorsofthestatisticalsystemR.

ComputationalandInferentialThinking

87Visualization

ComputationalandInferentialThinking

88Visualization

BarChartsInteract

CategoricalDataandBarCharts¶Nowthatwehaveexaminedseveralgraphicsproducedbyothers,itistimetoproducesomeofourown.Wewillstartwithbarcharts,atypeofgraphwithwhichyoumightalreadybefamiliar.Abarchartshowsthedistributionofacategoricalvariable,thatis,avariablewhosevaluesarecategories.Inhumanpopulations,examplesofcategoricalvariablesincludegender,ethnicity,maritalstatus,countryofcitizenship,andsoon.

Abarchartconsistsofasequenceofrectangularbars,onecorrespondingtoeachcategory.Thelengthofeachbarisproportionaltothenumberofentriesinthecorrespondingcategory.

TheKaiserFamilyFoundationhascompliedCensusdataonthedistributionofraceandethnicityintheUnitedStates.

http://kff.org/other/state-indicator/distribution-by-raceethnicity/

http://kff.org/other/state-indicator/children-by-raceethnicity/

Herearesomeofthedata,arrangedbystate.Thetablechildrencontainsdataforpeoplewhowereyoungerthan18yearsoldin2014;everyonecontainsthedataforpeopleofallagesin2014.Somemissingvaluesinthesetablesarerepresentedas0.

everyone=Table.read_table('kaiser_ethnicity_everyone.csv')

children=Table.read_table('kaiser_ethnicity_children.csv')

children.set_format(2,NumberFormatter)

everyone.set_format(2,NumberFormatter)

ComputationalandInferentialThinking

89BarCharts

State Race/Ethnicity Population

Alabama White 3,167,600

Alabama Black 1,269,200

Alabama Hispanic 191,000

Alabama Asian 77,300

Alabama AmericanIndian/AlaskaNative 0

Alabama TwoOrMoreRaces 56,100

Alabama Total 4,768,000

Alaska White 396,400

Alaska Black 17,000

Alaska Hispanic 59,200

...(347rowsomitted)

Themethodbarhtakestwoarguments,acolumncontainingcategoriesandacolumncontainingnumericquantities.Itgeneratesahorizontalbarchart.ThischartfocusesonCalifornia.

everyone.where('State','California').barh('Race/Ethnicity','Population'

ThetableofchildrenalsocontainsinformationaboutCalifornia,althoughtherearefewercategoriesofrace/ethnicity.

children.where('State','California')

ComputationalandInferentialThinking

90BarCharts

State Race/Ethnicity Population

California White 2,786,000

California Black 484,200

California Hispanic 4,849,400

California Other 1,590,100

California Total 9,709,700

Again,wecandispalythesedatainahorizontalbarchart.Thebarhmethodcantakecolumnindicesaswellaslabels.Thischartlookssimilar,butinvolvesfewercategoriesbecausetheKaiserFamilyFoundationonlyprovidesthesedataaboutchildren.

children.where('State','California').barh(1,2)

It'snotobvioushowtocomparethesetwocharts.Tohelpusdrawconclusions,wewillcombinethesetwotablesintoone,whichcontainsbotheveryoneandchildrenasseparatecolumns.Thistransformationwillrequireseveralsteps.First,toensurethatvaluesarecomparable,wewilldivideeachpopulationbythetotalforthestatetocomputestateproportions.

totals=everyone.join('State',everyone.where(1,'Total').relabeled

totals

ComputationalandInferentialThinking

91BarCharts

State Race/Ethnicity Population StateTotal

Alabama White 3167600 4768000

Alabama Black 1269200 4768000

Alabama Hispanic 191000 4768000

Alabama Asian 77300 4768000

Alabama AmericanIndian/AlaskaNative 0 4768000

Alabama TwoOrMoreRaces 56100 4768000

Alabama Total 4768000 4768000

Alaska White 396400 695700

Alaska Black 17000 695700

Alaska Hispanic 59200 695700

...(347rowsomitted)

proportions=everyone.select([0,1]).with_column('Proportion',totals

proportions.set_format(2,PercentFormatter)

State Race/Ethnicity Proportion

Alabama White 66.43%

Alabama Black 26.62%

Alabama Hispanic 4.01%

Alabama Asian 1.62%

Alabama AmericanIndian/AlaskaNative 0.00%

Alabama TwoOrMoreRaces 1.18%

Alabama Total 100.00%

Alaska White 56.98%

Alaska Black 2.44%

Alaska Hispanic 8.51%

...(347rowsomitted)

Thissametransformationcanbeappliedtothechildrentable.Below,wechaintogetherseveraltablemethodsinordertoexpressthetransformationinasinglelongline.Typically,it'sbesttoexpresstransformationsinpieces(above),butprogramminglanguagesarequite

ComputationalandInferentialThinking

92BarCharts

flexibleinallowingprogrammerstocombineexpressions(below).

children_proportions=children.select([0,1]).with_column(

'ChildrenProportion',children.column(2)/children.join('State'

children_proportions.set_format(2,PercentFormatter)

State Race/Ethnicity ChildrenProportion

Alabama White 59.99%

Alabama Black 30.50%

Alabama Hispanic 6.41%

Alabama Other 3.09%

Alabama Total 100.00%

Alaska White 46.00%

Alaska Black 2.60%

Alaska Hispanic 11.99%

Alaska Other 39.36%

Alaska Total 100.00%

...(245rowsomitted)

Thefollowingfunctiontakesthenameofastate.Itjoinstogethertheproportionsforeveryoneandchildrenintoasingletable.Whenthejoinmethodcombinestwocolumnswithdifferentvalues,onlythematchingelementsareretained.Inthiscase,anyrace/ethnicitynotrepresentedinbothtableswillbeexcluded.

defcompare(state):

prop_for_state=proportions.where('State',state).drop(0)

children_prop_for_state=children_proportions.where('State',state

joined=prop_for_state.join('Race/Ethnicity',children_prop_for_state

joined.set_format([1,2],PercentFormatter)

returnjoined.sort('Proportion')

compare('California')

ComputationalandInferentialThinking

93BarCharts

Race/Ethnicity Proportion ChildrenProportion

Black 5.34% 4.99%

Hispanic 38.21% 49.94%

White 39.20% 28.69%

Total 100.00% 100.00%

Thebarhmethodcandisplaythedatafrommultiplecolumnsonasinglechart.Whencalledwithonlyoneargumentindicatingthecolumnofcategories,allothercolumnsaredisplayedassetsofbars.

compare('California').barh('Race/Ethnicity')

ThischartshowsthechangingcompositionofthestateofCalifornia.In2014,fewerthan40%ofCalifornianswereHispanic.But50%ofthechildrenwere,whichisthecrucialingredientinpredictionsthatHispanicswilleventuallybeamajorityinCalifornia.

Whataboutotherstates?Ourfunctionallowsquickvisualcomparisonsforanystate.

compare('Minnesota').barh('Race/Ethnicity')

ComputationalandInferentialThinking

94BarCharts

compare('Vermont').barh('Race/Ethnicity')

ComputationalandInferentialThinking

95BarCharts

HistogramsInteract

QuantitativeDataandHistograms¶Manyofthevariablesthatdatascientistsstudyarequantitative.Forinstance,wecanstudytheamountofrevenueearnedbymoviesinrecentdecades.OursourceistheInternetMovieDatabase,anonlinedatabasethatcontainsinformationaboutmovies,televisionshows,videogames,andsoon.

Thetabletopconsistsof[U.S.A.'stopgrossingmovies(http://www.boxofficemojo.com/alltime/adjusted.htm)ofalltime.Thefirstcolumncontainsthetitleofthemovie;StarWars:TheForceAwakenshasthetoprank,withaboxofficegrossamountofmorethan900milliondollarsintheUnitedStates.Thesecondcolumncontainsthenameofthestudio;thethirdcontainstheU.S.boxofficegrossindollars;thefourthcontainsthegrossamountthatwouldhavebeenearnedfromticketsalesat2016prices;andthefifthcontainsthereleaseyearofthemovie.

Thereare200moviesonthelist.Herearethetopten.

top=Table.read_table('top_movies.csv')

top.set_format([2,3],NumberFormatter)

ComputationalandInferentialThinking

96Histograms

Title Studio Gross Gross(Adjusted) Year

StarWars:TheForceAwakens

BuenaVista(Disney) 906,723,418 906,723,400 2015

Avatar Fox 760,507,625 846,120,800 2009

Titanic Paramount 658,672,302 1,178,627,900 1997

JurassicWorld Universal 652,270,625 687,728,000 2015

Marvel'sTheAvengers BuenaVista(Disney) 623,357,910 668,866,600 2012

TheDarkKnight WarnerBros. 534,858,444 647,761,600 2008

StarWars:EpisodeI-ThePhantomMenace Fox 474,544,677 785,715,000 1999

StarWars Fox 460,998,007 1,549,640,500 1977

Avengers:AgeofUltron BuenaVista(Disney) 459,005,868 465,684,200 2015

TheDarkKnightRises WarnerBros. 448,139,099 500,961,700 2012

...(190rowsomitted)

Three-digitnumbersareeasiertoworkwiththannine-digitnumbers.So,wewilladdanadditionalmillionscolumncontainingtheadjustedgrossboxofficerevenueinmillions.

millions=top.select(0).with_column('Gross',np.round(top.column(3

millions

ComputationalandInferentialThinking

97Histograms

Title Gross

StarWars:TheForceAwakens 906.72

Avatar 846.12

Titanic 1178.63

JurassicWorld 687.73

Marvel'sTheAvengers 668.87

TheDarkKnight 647.76

StarWars:EpisodeI-ThePhantomMenace 785.72

StarWars 1549.64

Avengers:AgeofUltron 465.68

TheDarkKnightRises 500.96

...(190rowsomitted)

Thehistmethodgeneratesahistogramofthevaluesinacolumn.Theoptionalunitargumentisusedtolabeltheaxes.

millions.hist('Gross',unit="MillionDollars")

Thefigureaboveshowsthedistributionoftheamountsgrossed,inmillionsofdollars.Theamountshavebeengroupedintocontiguousintervalscalledbins.Althoughinthisdatasetnomoviegrossedanamountthatisexactlyontheedgebetweentwobins,itisworthnotingthathisthasanendpointconvention:binsincludethedataattheirleftendpoint,butnotthedataattheirrightendpoint.Sometimes,adjustmentshavetobemadeinthefirstorlastbin,

ComputationalandInferentialThinking

98Histograms

toensurethatthesmallestandlargestvaluesofthevariableareincluded.YousawanexampleofsuchanadjustmentintheCensusdatausedintheTablessection,whereanageof"100"yearsactuallymeant"100yearsoldorolder."

Wecanseethatthereare10bins(somebarsaresolowthattheyarehardtosee),andthattheyallhavethesamewidth.Wecanalsoseethattherethelistcontainsnomoviethatgrossedfewerthan300milliondollars;thatisbecauseweareconsideringonlythetopgrossingmoviesofalltime.Itisalittlehardertoseeexactlywheretheedgesofthebinsareplaced.Forexample,itisnotclearexactlywherethevalue500liesonthehorizontalaxis,andsoitishardtojudgeexactlywherethefirstbarendsandthesecondbegins.

Theoptionalargumentbinscanbeusedwithhisttospecifytheedgesofthebars.Itmustconsistofasequenceofnumbersthatincludestheleftendofthefirstbarandtherightendofthelastbar.Asthehighestgrossamountissomewhatover760onthehorizontalscale,wewillstartbysettingbinstobethearrayconsistingofthenumbers300,400,500,andsoon,endingwith2000.

millions.hist('Gross',unit="MillionDollars",bins=np.arange(300,2001

Thisfigureiseasiertoread.Onthehorizontalaxis,thelabels200,400,600,andsoonarecenteredatthecorrespondingvalues.Thetallestbarisformoviesthatgrossedbetween300andmillionand400million(adjusted)dollars;thebarfor400to500millionisabout5/8thashigh.Theheightofeachbarisproportionaltothenumberofmoviesthatfallwithinthex-axisrangeofthebar.

Averysmallnumbergrossedmorethan700milliondollars.Thisresultsinthefigurebeing"skewedtotheright,",or,lessformally,having"alongrighthandtail."Distributionsofvariableslikeincomeorrentoftenhavethiskindofshape.

ComputationalandInferentialThinking

99Histograms

Thecountsofvaluesindifferentrangescanbecomputedfromatableusingthebinmethod,whichtakesacolumnlabelorindexandanoptionalsequenceornumberofbins.Theresultisatabularformofahistogram;itliststhecountsofallvaluesinthe'Millions'columnthataregreaterthanorequaltotheindicatedbin,butlessthanthenextbin'sstartingvalue.

bins=millions.bin('Gross',bins=np.arange(300,2001,100))

bins

bin Grosscount

300 81

400 52

500 28

600 16

700 7

800 5

900 3

1000 1

1100 3

1200 2

...(8rowsomitted)

Theverticalaxisofahistogramisthedensityscale.Theheightofeachbaristheproportionofelementsthatfallintothecorrespondingbin,dividedbythewidthofthebin.Forexample,theheightofthebinfrom300to400,whichcontains81ofthe200movies,is

=0.405%.

starts=bins.column(0)

counts=bins.column(1)

counts.item(0)/sum(counts)/(starts.item(1)-starts.item(0))

0.0040500000000000006

ComputationalandInferentialThinking

100Histograms

Thetotalareaofthisbaris0.405,100timesgreaterbecauseitswidthis100.ThisareaistheproportionofallvaluesintheMillionscolumnthatfallintothe300-400bin.Theareaofallbarsinahistogramsumsto1.

Ahistogrammaycontainbinsofunequalsize,buttheareaofallbarswillstillsumto1.Below,thevaluesintheMillionscolumnarebinnedintofourcategories.

millions.hist('Gross',unit="MillionDollars",bins=[300,400,600,

Althoughtheranges300-400and400-600havenearlyidenticalcounts,theformeristwiceastallasthelatterbecauseitisonlyhalfaswide.Thedensityofvaluesinthatrangeistwiceashigh.Histogramsdisplayinvisualformwhereonthenumberlinevaluesaremostconcentrated.

millions.bin('Gross',bins=[300,400,600,1500])

bin Grosscount

300 81

400 80

600 37

1500 0

HistogramofCounts.Itispossibletodisplaycountsdirectlyinachart,usingthenormed=Falseoptionofthehistmethod.Theresultingcharthasthesameshapewhenbinsallhaveequalwidths,butthebarareasdonotsumto1.

ComputationalandInferentialThinking

101Histograms

millions.hist('Gross',bins=np.arange(300,2001,100),normed=False)

Whilethecountscaleisperhapsmorenaturaltointerpretthanthedensityscale,thechartbecomeshighlymisleadingwhenbinshavedifferentwidths.Below,itappears(duetothecountscale)thathigh-grossingmoviesarequitecommon,wheninfactwehaveseenthattheyarerelativelyrare.

millions.hist('Gross',bins=[300,400,500,600,1500],normed=False

Eventhoughthemethodusediscalledhist,thefigureaboveisNOTAHISTOGRAM.Itgivestheimpressionthatmoviesareevenlydistributedinthe300-600range,buttheyarenot.Moreover,itexaggeratesthedensityofmoviesgrossingmorethan600million.The

ComputationalandInferentialThinking

102Histograms

heightofeachbarissimplyplottedthenumberofmoviesinthebin,withoutaccountingforthedifferenceinthewidthsofthebins.Thepicturebecomesevenmoreskewedifthelasttwobinsarecombined.

millions.hist('Gross',bins=[300,400,1500],normed=False)

Inthiscount-basedhistogram,theshapeofthedistributionofmoviesislostentirely.

Sowhatisahistogram?

Thefigureaboveshowsthatwhattheeyeperceivesas"big"isarea,notjustheight.Thisisparticularlyimportantwhenthebinsareofdifferentwidths.

Thatiswhyahistogramhastwodefiningproperties:

1. Thebinsarecontiguous(thoughsomemightbeempty)andaredrawntoscale.2. Theareaofeachbarisproportionaltothenumberofentriesinthebin.

Property2isthekeytodrawingahistogram,andisusuallyachievedasfollows:

Whendrawnusingthismethod,thehistogramissaidtobedrawnonthedensityscale,andthetotalareaofthebarsisequalto1.

Tocalculatetheheightofeachbar,usethefactthatthebarisarectangle:

andso

ComputationalandInferentialThinking

103Histograms

Thelevelofdetail,andtheflattopsofthebars

Eventhoughthedensityscalecorrectlyrepresentsproportionsusingarea,somedetailislostbyplacingdifferentvaluesintothesamebins.

Takeanotherlookatthe[300,400)bininthefigurebelow.Theflattopofthebar,atalevelnear0.4%,hidesthefactthatthemoviesaresomewhatunevenlydistributedacrossthebin.

millions.hist('Gross',bins=[300,400,600,1500])

Toseethis,letussplitthe[150,200)binintotennarrowerbinsofwidth10milliondollarseach:

millions.hist('Gross',bins=[300,310,320,330,340,350,360,370

ComputationalandInferentialThinking

104Histograms

Someoftheskinnybarsaretallerthan0.4%andothersareshorter.Byputtingaflattopat0.4%overthewholebin,wearedecidingtoignorethefinerdetailandusetheflatlevelasaroughapproximation.Often,thoughnotalways,thisissufficientforunderstandingthegeneralshapeofthedistribution.

Noticethatbecausewehavetheentiredataset,wecandrawthehistograminasfinealevelofdetailasthedataandourpatiencewillallow.However,ifyouarelookingatahistograminabookoronawebsite,andyoudon'thaveaccesstotheunderlyingdataset,thenitbecomesimportanttohaveaclearunderstandingofthe"roughapproximation"oftheflattops.

Thedensityscale

Theheightofeachbarisaproportiondividedbyabinwidth.Thus,forthisdatset,thevaluesontheverticalaxisare"proportionspermilliondollars."Tounderstandthisbetter,lookagainatthe[150,200)bin.Thebinis50milliondollarswide.Sowecanthinkofitasconsistingof50narrowbinsthatareeach1milliondollarswide.Thebar'sheightofroughly"0.004permilliondollars"meansthatineachofthose50skinnybinsofwidth1milliondollars,theproportionofmoviesisroughly0.004.

Thustheheightofahistogrambarisaproportionperunitonthehorizontalaxis,andcanbethoughtofasthedensityofentriesperunitwidth.

millions.hist('Gross',bins=[300,350,400,450,1500])

ComputationalandInferentialThinking

105Histograms

DensityQ&A

Lookagainatthehistogram,andthistimecomparethe[400,450)binwiththe[450,1500)bin.

Q:Whichhasmoremoviesinit?

A:The[450,1500)bin.Ithas92movies,comparedwith25moviesinthe[400,450)bin.

Q:Thenwhyisthe[450,1500)barsomuchshorterthanthe[400,450)bar?

A:Becauseheightrepresentsdensityperunitwidth,notthenumberofmoviesinthebin.The[450,1500)binhasmoremoviesthanthe[400,450)bin,butitisalsoawholelotwider.Sothedensityismuchlower.

millions.bin('Gross',bins=[300,350,400,450,1500])

bin Grosscount

300 32

350 49

400 25

450 92

1500 0

Barchartorhistogram?

Barchartsdisplaythedistributionsofcategoricalvariables.Allthebarsinabarcharthavethesamewidth.Thelengths(orheights,ifthebarsaredrawnvertically)ofthebarsareproportionaltothenumberofentries.

ComputationalandInferentialThinking

106Histograms

Histogramsdisplaythedistributionsofquantitativevariables.Thebarscanhavedifferentwidths.Theareasofthebarsareproportionaltothenumberofentries.

Multiplehistograms

Inalltheexamplesinthissection,wehavedrawnasinglehistogram.However,ifadatatablecontainsseveralcolumns,thenbarhandhistcanbeusedtodrawseveralchartsatonce.Wewillcoverthisfeatureinalatersection.

ComputationalandInferentialThinking

107Histograms

Chapter3

SamplingSamplingisapowerfulcomputationaltechniqueindataanalysis.Itallowsustoinvestigatebothissuesofprobabilityandstatistics.Probabilityisthestudyoftheoutcomesofrandomprocesses;processesthatincludechance.Statisticsisjusttheopposite:itisthestudyofrandomprocessesgivenobservationsoftheiroutcomes.Putanotherway,probabilityfocusesonwhatwillbeobservedgivenwhatweknowabouttheworld.Statisticsfocusesonwhatwecaninferabouttheworldgivenwhatwehaveobserved.RandomSamplingisacentraltoolinbothdirectionsofinference.

ComputationalandInferentialThinking

108Sampling

SamplingInteractInthissection,wewillcontinuetousethetop_movies.csvdataset.

top=Table.read_table('top_movies.csv')

top.set_format([2,3],NumberFormatter)

Title Studio Gross Gross(Adjusted) Year

StarWars:TheForceAwakens

BuenaVista(Disney) 906,723,418 906,723,400 2015

Avatar Fox 760,507,625 846,120,800 2009

Titanic Paramount 658,672,302 1,178,627,900 1997

JurassicWorld Universal 652,270,625 687,728,000 2015

Marvel'sTheAvengers BuenaVista(Disney) 623,357,910 668,866,600 2012

TheDarkKnight WarnerBros. 534,858,444 647,761,600 2008

StarWars:EpisodeI-ThePhantomMenace Fox 474,544,677 785,715,000 1999

StarWars Fox 460,998,007 1,549,640,500 1977

Avengers:AgeofUltron BuenaVista(Disney) 459,005,868 465,684,200 2015

TheDarkKnightRises WarnerBros. 448,139,099 500,961,700 2012

...(190rowsomitted)

Samplingrowsofatable¶DeterministicSamples

Whenyousimplyspecifywhichelementsofasetyouwanttochoose,withoutanychancesinvolved,youcreateadeterministicsample.

ComputationalandInferentialThinking

109Sampling

Adeterminsiticsamplefromtherowsofatablecanbeconstructedusingthetakemethod.Itsargumentisasequenceofintegers,anditreturnsatablecontainingthecorrespondingrowsoftheoriginaltable.

Thecodebelowreturnsatableconsistingoftherowsindexed3,18,and100inthetabletop.Sincetheoriginaltableissortedbyrank,whichbeginsat1,thezero-indexedrowsalwayshavearankthatisonegreaterthantheindex.However,thetakemethoddoesnotinspecttheinformationintherowstoconstructitssample.

top.take([3,18,100])

Title Studio Gross Gross(Adjusted) Year

JurassicWorld Universal 652,270,625 687,728,000 2015

Spider-Man Sony 403,706,375 604,517,300 2002

GonewiththeWind MGM 198,676,459 1,757,788,200 1939

Wecanselectevenlyspacedrowsbycallingnp.arangeandpassingtheresulttotake.Intheexamplebelow,westartwiththefirstrow(index0)oftop,andchooseevery40throwafterthatuntilwereachtheendofthetable.

top.take(np.arange(0,top.num_rows,40))

Title Studio Gross Gross(Adjusted) Year

StarWars:TheForceAwakens BuenaVista(Disney) 906,723,418 906,723,400 2015

ShrektheThird Paramount/Dreamworks 322,719,944 408,090,600 2007

BruceAlmighty Universal 242,829,261 350,350,700 2003

ThreeMenandaBaby BuenaVista(Disney) 167,780,960 362,822,900 1987

SaturdayNightFever Paramount 94,213,184 353,261,200 1977

ProbabilitySamples¶Muchofdatascienceconsistsofmakingconclusionsbasedonthedatainrandomsamples.Correctlyinterpretinganalysesbasedonrandomsamplesrequiresdatascientiststoexamineexactlywhatrandomsamplesare.

ComputationalandInferentialThinking

110Sampling

Apopulationisthesetofallelementsfromwhomasamplewillbedrawn.

Aprobabilitysampleisoneforwhichitispossibletocalculate,beforethesampleisdrawn,thechancewithwhichanysubsetofelementswillenterthesample.

Inaprobabilitysample,allelementsneednothavethesamechanceofbeingchosen.Forexample,supposeyouchoosetwopeoplefromapopulationthatconsistsofthreepeopleA,B,andC,accordingtothefollowingscheme:

PersonAischosenwithprobability1.OneofPersonsBorCischosenaccordingtothetossofacoin:ifthecoinlandsheads,youchooseB,andifitlandstailsyouchooseC.

Thisisaprobabilitysampleofsize2.Herearethechancesofentryforallnon-emptysubsets:

A:1

B:1/2

C:1/2

AB:1/2

AC:1/2

BC:0

ABC:0

PersonAhasahigherchanceofbeingselectedthanPersonsBorC;indeed,PersonAiscertaintobeselected.Sincethesedifferencesareknownandquantified,theycanbetakenintoaccountwhenworkingwiththesample.

Todrawaprobabilitysample,weneedtobeabletochooseelementsaccordingtoaprocessthatinvolveschance.Abasictoolforthispurposeisarandomnumbergenerator.ThereareseveralinPython.Herewewilluseonethatispartofthemodulerandom,whichinturnispartofthemodulenumpy.

Themethodrandint,whengiventwoargumentslowandhigh,returnsanintegerpickeduniformlyatrandombetweenlowandhigh,includinglowbutexcludinghigh.Runthecodebelowseveraltimestoseethevariabilityintheintegersthatarereturned.

np.random.randint(3,8)#selectonceatrandomfrom3,4,5,6,7

7

ASystematicSample

ComputationalandInferentialThinking

111Sampling

Imaginealltheelementsofthepopulationlistedinasequence.Onemethodofsamplingstartsbychoosingarandompositionearlyinthelist,andthenevenlyspacedpositionsafterthat.Thesampleconsistsoftheelementsinthosepositions.Suchasampleiscalledasystematicsample.

Herewewillchooseasystematicsampleoftherowsoftop.Wewillstartbypickingoneofthefirst10rowsatrandom,andthenwewillpickevery10throwafterthat.

"""Choosearandomstartamongrows0through9;

thentakeevery10throw."""

start=np.random.randint(0,10)

top.take(np.arange(start,top.num_rows,10))

Title Studio Gross Gross(Adjusted) Year

StarWars Fox 460,998,007 1,549,640,500 1977

TheHungerGames Lionsgate 408,010,692 442,510,400 2012

ThePassionoftheChrist NM 370,782,930 519,432,100 2004

AliceinWonderland(2010)

BuenaVista(Disney) 334,191,110 365,718,600 2010

StarWars:EpisodeII-AttackoftheClones Fox 310,676,740 465,175,700 2002

TheSixthSense BuenaVista(Disney) 293,506,292 500,938,400 1999

TheHangover WarnerBros. 277,322,503 323,343,300 2009

RaidersoftheLostArk Paramount 248,159,971 770,183,000 1981

TheLostWorld:JurassicPark Universal 229,086,679 434,216,600 1997

AustinPowers:TheSpyWhoShaggedMe NewLine 206,040,086 352,863,900 1999

...(10rowsomitted)

Runthecodeafewtimestoseehowtheoutputvaries.

Thissystematicsampleisaprobabilitysample.Inthisscheme,allrowsdohavethesamechanceofbeingchosen.Butthatisnottrueofothersubsetsoftherows.Becausetheselectedrowsareevenlyspaced,mostsubsetsofrowshavenochanceofbeingchosen.

ComputationalandInferentialThinking

112Sampling

Theonlysubsetsthatarepossiblearethosethatconsistofrowsallseparatedbymultiplesof10.Anyofthosesubsetsisselectedwithchance1/10.

Randomsamplewithreplacement

Someofthesimplestprobabilitysamplesareformedbydrawingrepeatedly,uniformlyatrandom,fromthelistofelementsofthepopulation.Ifthedrawsaremadewithoutchangingthelistbetweendraws,thesampleiscalledarandomsamplewithreplacement.Youcanimaginemakingthefirstdrawatrandom,replacingtheelementdrawn,andthendrawingagain.

Inarandomsamplewithreplacement,eachelementinthepopulationhasthesamechanceofbeingdrawn,andeachcanbedrawnmorethanonceinthesample.

Ifyouwanttodrawasampleofpeopleatrandomtogetsomeinformationaboutapopulation,youmightnotwanttosamplewithreplacement–drawingthesamepersonmorethanoncecanleadtoalossofinformation.Butindatascience,randomsampleswithreplacementariseintwomajorareas:

studyingprobabilitiesbysimulatingtossesofacoin,rollsofadie,orgamblinggames

creatingnewsamplesfromasampleathand

Thesecondoftheseareaswillbecoveredlaterinthecourse.Fornow,letusstudysomelong-runpropertiesofprobabilities.

Wewillstartwiththetablediewhichcontainsthenumbersofspotsonthefacesofadie.Allthenumbersappearexactlyonce,asweareassumingthatthedieisfair.

die=Table().with_column('Face',[1,2,3,4,5,6])

die

Face

1

2

3

4

5

6

Drawingthehistogramofthissimplesetofnumbersyieldsanunsettlingfigureashistmakesadefaultchoiceofbins:

ComputationalandInferentialThinking

113Sampling

die.hist()

Thenumbers1,2,3,4,5,6areintegers,sothebinschosenbyhistonlyhaveentriesattheedges.Insuchasituation,itisabetterideatoselectbinssothattheyarecenteredontheintegers.Thisisoftentrueofhistogramsofdatathatarediscrete,thatis,variableswhosesuccessivevaluesareseparatedbythesamefixedamount.Therealadvantageofthismethodofbinselectionwillbecomemoreclearwhenwestartimaginingsmoothcurvesdrawnoverhistograms.

dice_bins=np.arange(0.5,7,1)

die.hist(bins=dice_bins)

Noticehoweachbinhaswidth1andiscenteredonaninteger.Noticealsothatbecausethewidthofeachbinis1,theheightofeachbaris ,thechancethatthecorrespondingface

appears.

ComputationalandInferentialThinking

114Sampling

Thehistogramshowstheprobabilitywithwhicheachfaceappears.Itiscalledaprobabilityhistogramoftheresultofonerollofadie.

Thehistogramwasdrawnwithoutrollinganydiceorgeneratinganyrandomnumbers.Wewillnowusethecomputertomimicactuallyrollingadie.Theprocessofusingacomputerprogramtoproducetheresultsofachanceexperimentiscalledsimulation.

Torollthedie,wewilluseamethodcalledsample.Thismethodreturnsanewtableconsistingofrowsselecteduniformlyatrandomfromatable.Itsfirstargumentisthenumberofrowstobereturned.Itssecondargumentiswhetherornotthesamplingshouldbedonewithreplacement.

Thecodebelowsimulates10rollsofthedie.Aswithallsimulationsofchanceexperiments,youshouldrunthecodeseveraltimesandnoticethevariabilityinwhatisreturned.

die.sample(10,with_replacement=True)

Face

5

4

2

3

6

5

1

1

6

2

Wewillnowrollthedieseveraltimesanddrawthehistogramoftheobservedresults.Thehistogramofobservedresultsiscalledanempiricalhistogram.

Theresultsofrollingthediewillallbeintegersintherange1through6,sowewillwanttousethesamebinsasweusedfortheprobabilityhistogram.Toavoidwritingoutthesamebinargumenteverytimewedrawahistogram,letusdefineafunctioncalleddice_histthatwillperformthetaskforus.Thefunctionwilltakeoneargument:thenumbernofdicetoroll.

ComputationalandInferentialThinking

115Sampling

defdice_hist(n):

rolls=die.sample(n,with_replacement=True)

rolls.hist(bins=dice_bins)

dice_hist(20)

Asweincreasethenumberofrollsinthesimulation,theproportionsgetcloserto .

dice_hist(10000)

Thebehaviorwehaveobservedisaninstanceofageneralrule.

TheLawofAverages¶

ComputationalandInferentialThinking

116Sampling

Ifachanceexperimentisrepeatedindependentlyandunderidenticalconditions,then,inthelongrun,theproportionoftimesthataneventoccursgetscloserandclosertothetheoreticalprobabilityoftheevent.

Forexample,inthelongrun,theproportionoftimesthefacewithfourspotsappearsgetscloserandcloserto1/6.

Here"independentlyandunderidenticalconditions"meansthateveryrepetitionisperformedinthesamewayregardlessoftheresultsofalltheotherrepetitions.

Convergenceofempiricalhistograms¶Wehavealsoobservedthatarandomquantity(suchasthenumberofspotsononerollofadie)isassociatedwithtwohistograms:

aprobabilityhistogram,thatshowsallthepossiblevaluesofthequantityandalltheirchances

anempirialhistogram,createdbysimulatingtherandomquantityrepeatedlyanddrawingahistogramoftheobservedresults

Wehaveseenanexampleofthelong-runbehaviorofempiricalhistograms:

Asthenumberofrepetitionsincreases,theempiricalhistogramofarandomquantitylooksmoreandmoreliketheprobabilityhistogram.

AttheRouletteTable¶Equippedwithournewknowledgeaboutthelong-runbehaviorofchances,letusexploreagamblinggame.BettingonrouletteispopularingamblingcenterssuchasLasVegasandMonteCarlo,andwewillsimulateoneofthebetshere.

ThemainrandomizerinrouletteinNevadaisawheelthathas38pocketsonitsrim.Twoofthepocketsaregreen,eighteenblack,andeighteenred.Thewheelisonaspindle,andthereisasmallballonit.Whenthewheelisspun,theballricochetsaroundandfinallycomestorestinoneofthepockets.Thatisdeclaredtobethewinningpocket.

Youareallowedtobetonseveralpre-specifiedcollectionsofpockets.Ifyoubeton"red,"youwiniftheballcomestorestinoneoftheredpockets.

Thebetevenmoney.Thatis,itpays1to1.Tounderstandwhatthatmeans,assumeyouaregoingtobet$1on"red."Thefirstthingthathappens,evenbeforethewheelisspun,isthatyouhavetohandoveryour$1.Iftheballlandsinagreenorblackpocket,younever

ComputationalandInferentialThinking

117Sampling

seethatdollaragain.Iftheballlandsinaredpocket,yougetyourdollarback(tobringyoubacktoeven),plusanother$1inwinnings.

ThetablewheelrepresentsthepocketsofaNevadaroulettewheel.Ithas38rowslabeled1through38,onerowperpocket.

pockets=np.arange(1,39)

colors=(['red','black']*5+['black','red']*4)*2+['green'

wheel=Table().with_columns([

'pocket',pockets,

'color',colors])

wheel

pocket color

1 red

2 black

3 red

4 black

5 red

6 black

7 red

8 black

9 red

10 black

...(28rowsomitted)

Thefunctionbet_on_redtakesanumericalargumentxandreturnsthenetwinningsona$1beton"red,"providedxisthenumberofapocket.

defbet_on_red(x):

"""Thenetwinningsofbettingonredforoutcomex."""

pockets=wheel.where('pocket',x)

ifpockets.column('color').item(0)=='red':

return1

else:

return-1

ComputationalandInferentialThinking

118Sampling

bet_on_red(17)

-1

Thefunctionspinstakesanumericalargumentnandreturnsanewtableconsistingofnrowsofwheelsampledatrandomwithreplacement.Inotherwords,itsimulatestheresultsofnspinsoftheroulettewheel.

defspins(n):

returnwheel.sample(n,with_replacement=True)

Wewillcreateatablecalledplayconsistingoftheresultsof10spins,andaddacolumnthatshowsthenetwinningson$1placedon"red."Recallthattheapplymethodappliesafunctiontoeachelementinacolumnofatable.

outcomes=spins(500)

results=outcomes.with_column(

'winnings',outcomes.apply(bet_on_red,'pocket'))

results

pocket color winnings

13 black -1

22 black -1

38 green -1

11 black -1

29 black -1

19 red 1

38 green -1

37 green -1

19 red 1

31 black -1

...(490rowsomitted)

Andhereisthenetgainonall500bets:

ComputationalandInferentialThinking

119Sampling

sum(results.column('winnings'))

-32

Betting$1onredhundredsoftimesseemslikeabadideafromagambler'sperspective.Butfromthecasinos'perspectiveitisexcellent.Casinosrelyonlargenumbersofbetsbeingplaced.Thepayoffoddsaresetsothatthemorebetsthatareplaced,themoremoneythecasinosarelikelytomake,eventhoughafewpeoplearelikelytogohomewithwinnings.

SimpleRandomSample–aRandomSamplewithoutReplacement

Arandomsamplewithoutreplacementisoneinwhichelementsaredrawnfromalistrepeatedly,uniformlyatrandom,ateachstagedeletingfromthelisttheelementthatwasdrawn.

Arandomsamplewithoutreplacementisalsocalledasimplerandomsample.Allelementsofthepopulationhavethesamechanceofenteringasimplerandomsample.Allpairshavethesamechanceaseachother,asdoalltriples,andsoon.

Thedefaultactionofsampleistodrawwithoutreplacement.Incardgames,cardsarealmostalwaysdealtwithoutreplacement.Letususesampletodealcardsfromadeck.

Astandarddeckdeckconsistsof13ranksofcardsineachoffoursuits.Thesuitsarecalledspades,clubs,diamonds,andhearts.TheranksareAce,2,3,4,5,6,7,8,9,10,Jack,Queen,andKing.Spadesandclubsareblack;diamondsandheartsarered.TheJacks,Queens,andKingsarecalled'facecards.'

Thetabledeckcontainsall52cardsinacolumnlabeledcards.Theabbreviationsare:

Ace:AJack:JQueen:QKing:K

Below,weusetheproductfunctiontoproduceallpossiblepairsofasuitandarank.Thesuitsarerepresentedbyspecialcharactersusedtoprintbridgeandpokerarticlesinnewspapers.

ComputationalandInferentialThinking

120Sampling

fromitertoolsimportproduct

ranks=['A','2','3','4','5','6','7','8','9','10','J','Q'

suits=['♠','♥','♦','♣']

cards=product(ranks,suits)

deck=Table(['rank','suit']).with_rows(cards)

deck

rank suit

A ♠

A ♥

A ♦

A ♣

2 ♠

2 ♥

2 ♦

2 ♣

3 ♠

3 ♥

...(42rowsomitted)

Apokerhandis5cardsdealtatrandomfromthedeck.Theexpressionbelowdealsahand.Executethecellafewtimes;howoftenareyoudealtapair?

deck.sample(5)

rank suit

5 ♥

4 ♣

9 ♥

3 ♣

10 ♣

ComputationalandInferentialThinking

121Sampling

Ahandisasetoffivecards,regardlessoftheorderinwhichtheyappeared.Forexample,thehand9♣9♥Q♣7♣8♥isthesameasthehand7♣9♣8♥9♥Q♣.

Thisfactcanbeusedtoshowthatsimplerandomsamplingcanbethoughtofintwoequivalentways:

drawingelementsonebyoneatrandomwithoutreplacement

randomlypermuting(thatis,shuffling)thewholelist,andthenpullingoutasetofelementsatthesametime

ComputationalandInferentialThinking

122Sampling

IterationInteract

Repetition¶Itisoftenthecasewhenprogrammingthatyouwillwishtorepeatthesameoperationmultipletimes,perhapswithslightlydifferentbehavioreachtime.Youcouldcopy-pastethecode10times,butthat'stediousandpronetotypos,andifyouwantedtodoitathousandtimes(oramilliontimes),forgetit.

Abettersolutionistouseaforstatementtoloopoverthecontentsofasequence.Aforstatementbeginswiththewordfor,followedbyanamefortheiteminthesequence,followedbythewordin,andendingwithanexpressionthatevaluatestoasequence.Theindentedbodyoftheforstatementisexecutedonceforeachiteminthatsequence.

foriinnp.arange(5):

print(i)

0

1

2

3

4

Atypicaluseofaforstatementistobuildupatablebyrepeatingarandomcomputationmanytimesandstoringeachresultasanewrow.Theappendmethodofatabletakesasequenceandaddsanewrow.It'sdifferentfromwith_rowbecauseanewtableisnotcreated;instead,theoriginaltableisextended.Thecellbelowdraws100cards,butkeepsonlytheaces.

ComputationalandInferentialThinking

123Iteration

aces=Table(['Rank','Suit'])

foriinnp.arange(100):

card=deck.row(np.random.randint(deck.num_rows))

ifcard.item(0)=='A':

aces.append(card)

aces

Rank Suit

A ♠

A ♣

A ♥

A ♦

A ♠

A ♦

Thispatterncanbeusedtotracktheresultsofrepeatedexperiments.Forexample,perhapswewanttolearnabouttheempiricalpropertiesofsomerandomlydrawnpokerhands.Below,wetrackwhetherthehandcontainsfour-of-a-kindorfivecardsofthesamesuit(calledaflush).

hands=Table(['Four-of-a-kind','Flush'])

foriinnp.arange(10000):

hand=deck.sample(5)

four_of_a_kind=max(hand.group('rank').column('count'))==4

flush=max(hand.group('suit').column('count'))==5

hands.append([four_of_a_kind,flush])

hands

ComputationalandInferentialThinking

124Iteration

Four-of-a-kind Flush

False False

False False

False False

False False

False False

False False

False False

False False

False False

False False

...(9990rowsomitted)

Aforstatementcanalsoiterateoverasequenceoflabels.Wecanusethisfeaturetosummarizetheresultsofourexperiment.Thesearerarehandsindeed!

forlabelinhands.labels:

success=np.count_nonzero(hands.column(label))

print('A',label,'wasdrawn',success,'of',hands.num_rows,'times'

AFour-of-a-kindwasdrawn2of10000times

AFlushwasdrawn22of10000times

Randomizedresponse¶Next,we'lllookatatechniquethatwasdesignedseveraldecadesagotohelpconductsurveysofsensitivesubjects.Researcherswantedtoaskparticipantsafewquestions:Haveyoueverhadanaffair?Doyousecretlythinkyouaregay?Haveyouevershoplifted?HaveyoueversungaJustinBiebersongintheshower?Theyfiguredthatsomepeoplemightnotrespondhonestly,becauseofthesocialstigmaassociatedwithanswering"yes".So,theycameupwithacleverwaytoestimatethefractionofthepopulationwhoareinthe"yes"camp,withoutviolatinganyone'sprivacy.

ComputationalandInferentialThinking

125Iteration

Here'stheidea.We'llinstructtherespondenttorollafair6-sideddie,secretly,wherenooneelsecanseeit.Ifthediecomesup1,2,3,or4,thenrespondentissupposedtoanswerhonestly.Ifitcomesup5or6,therespondentissupposedtoanswertheoppositeofwhattheirtrueanswerwouldbe.But,theyshouldn'trevealwhatcameupontheirdie.

Noticehowcleverthisis.Evenifthepersonsays"yes",thatdoesn'tnecessarilymeantheirtrueansweris"yes"--theymightverywellhavejustrolleda5or6.Sotheresponsestothesurveydon'trevealanyoneindividual'strueanswer.Yet,inaggregate,theresponsesgiveenoughinformationthatwecangetaprettygoodestimateofthefractionofpeoplewhosetrueansweris"yes".

Let'stryasimulation,sowecanseehowthisworks.We'llwritesomecodetoperformthisoperation.First,afunctiontosimulaterollingonedie:

defroll_once():

returnnp.random.randint(1,7)

Nowwe'llusethistowriteafunctiontosimulatehowsomeoneissupposedtorespondtothesurvey.Theargumenttothefunctionistheirtrueanswer(TrueorFalse);thefunctionreturnswhatthey'resupposedtotelltheinterview.

defrespond(true_answer):

ifroll_once()>=5:

returnnottrue_answer

else:

returntrue_answer

Wecantryit.Assumeourtrueansweris'no';let'sseewhathappensthistime:

respond(False)

False

Ofcourse,ifyouweretorunitmanytimes,youmightgetadifferentresulteachtime.Below,webuildatableoftheresponsesformanyresponseswhenthetrueanswerisalwaysFalse.

ComputationalandInferentialThinking

126Iteration

responses=Table(['Truth','Response'])

foriinnp.arange(1000):

responses.append([False,respond(False)])

responses

Truth Response

False False

False False

False False

False False

False False

False False

False True

False False

False True

False False

...(990rowsomitted)

Let'sbuildabarchartandlookathowmanyTrueandFalseresponsesweget.

responses.group('Response').barh('Response')

responses.where('Response',False).num_rows

ComputationalandInferentialThinking

127Iteration

656

responses.where('Response',True).num_rows

344

Exerciseforyou:IfNoutof1000responsesareTrue,approximatelywhatfractionofthepopulationhastrulysungaJustinBiebersongintheshower?

Analysis¶Thismethodiscalled"randomizedresponse".Itisonewaytopollpeopleaboutsensitivesubjects,whilestillprotectingtheirprivacy.Youcanseehowitisaniceexampleofrandomnessatwork.

Itturnsoutthatrandomizedresponsehasbeautifulgeneralizations.Forinstance,yourChromewebbrowserusesittoanonymouslyreportfeedbacktoGoogle,inawaythatwon'tviolateyourprivacy.That'sallwe'llsayaboutitforthissemester,butifyoutakeanupper-divisioncourse,maybeyou'llgettoseesomegeneralizationsofthisbeautifultechnique.

Thestepsintherandomizedresponsesurveycanbevisualizedusingatreediagram.Thediagrampartitionsallthesurveyrespondentsaccordingtotheirtrueandanswerandtheanswerthattheyeventuallygive.Italsodisplaystheproportionsofrespondentswhosetrueanswersare1("True")and0("False"),aswellasthechancesthatdeterminetheanswersthattheygive.Asinthecodeabove,wehaveusedptodenotetheproportionwhosetrueansweris1.

ComputationalandInferentialThinking

128Iteration

Therespondentswhoanswer1splitintotwogroups.Thefirstgroupconsistsoftherespondentswhosetrueanswerandgivenanswersareboth1.Ifthenumberofrespondentsislarge,theproportioninthisgroupislikelytobeabout2/3ofp.Thesecondgroupconsistsoftherespondentswhosetrueansweris0andgivenansweris1.Thisproportioninthisgroupislikelytobeabout1/3of1-p.

Wecanobserved ,theproportionof1'samongthegivenanswers.Thus

Thisallowsustosolveforanapproximatevalueofp:

Inthiswaywecanusetheobservedproportionof1'sto"workbackwards"andgetanestimateofp,theproportioninwhichwheareinterested.

Technicalnote.Itisworthnotingtheconditionsunderwhichthisestimateisvalid.Thecalculationoftheproportionsinthetwogroupswhosegivenansweris1reliesoneachofthegroupsbeinglargeenoughsothattheLawofAveragesallowsustomakeestimatesabouthowtheirdicearegoingtoland.Thismeansthatitisnotonlythetotalnumberof

ComputationalandInferentialThinking

129Iteration

respondentsthathastobelarge–thenumberofrespondentswhosetrueansweris1hastobelarge,asdoesthenumberwhosetrueansweris0.Forthistohappen,pmustbeneithercloseto0norcloseto1.Ifthecharacteristicofinterestiseitherextremelyrareorextremelycommoninthepopulation,themethodofrandomizedresponsedescribedinthisexamplemightnotworkwell.

Let'stryoutthismethodonsomerealdata.Thechanceofdrawingapokerhandwithnoacesis

np.product(np.arange(48,43,-1)/np.arange(52,47,-1))

0.65884199833779666

Itisquiteembarassingindeedtodrawahandwithnoaces.Thetablebelowcontainsonecolumnforthetruthofwhetherahandhasnoacesandanotherfortherandomizedresponse.

ace_responses=Table(['Truth','Response'])

foriinnp.arange(10000):

hand=deck.sample(5)

no_aces=hand.where('rank','A').num_rows==0

ace_responses.append([no_aces,respond(no_aces)])

ace_responses

ComputationalandInferentialThinking

130Iteration

Truth Response

False True

True True

False True

True False

True False

True True

False False

True False

False False

True True

...(9990rowsomitted)

Usingourderivedformula,wecanestimatewhatfractionofhandshavenoaces.

3*np.count_nonzero(ace_responses.column('Response'))/10000-1

0.6644000000000001

ComputationalandInferentialThinking

131Iteration

EstimationInteractIntheprevioussectionwesawthattwokindsofhistogramscanbeassociatedwithaquantitygeneratedbyarandomprocess.

Theprobabilityhistogramshowsallthepossiblevaluesofthequantityandtheprobabilitiesofallthosevalues.Theprobabilitiesarecalculatedusingmathandtherulesofchance.

Theempiricalhistogramisbasedonrunningtherandomprocessrepeatedlyandkeepingtrackofthevalueofthequantityeachtime.Itshowsallthevaluesofthequantityandtheproportionoftimeseachvaluewasobservedamongalltherepetitions.

WenotedthatbytheLawofAverages,theempiricalhistogramislikelytoresembletheprobabilityhistogramiftherandomprocessisrepeatedalargenumberoftimes.

Thismeansthatsimulatingrandomprocessesrepeatedlyisawayofapproximatingprobabilitydistributionswithoutfiguringouttheprobabilitiesmathematically.Thuscomputersimulationsbecomeapowerfultoolindatascience.Theycanhelpdatascientistsunderstandthepropertiesofrandomquantitiesthatwouldbecomplicatedtoanalyzeusingmathalone.

Hereisanexampleofsuchasimulation.

Estimatingthenumberofenemyaircraft¶InWorldWarII,dataanalystsworkingfortheAlliesweretaskedwithestimatingthenumberofGermanwarplanes.ThedataincludedtheserialnumbersoftheGermanplanesthathadbeenobservedbyAlliedforces.Theseserialnumbersgavethedataanalystsawaytocomeupwithananswer.

Tocreateanestimateofthetotalnumberofwarplanes,thedataanalystshadtomakesomeassumptionsabouttheserialnumbers.Herearetwosuchassumptions,greatlysimplifiedtomakeourcalculationseasier.

1. ThereareNplanes,numbered .

2. Theobservedplanesaredrawnuniformlyatrandomwithreplacementfromtheplanes.

Thegoalistoestimatethenumber .

ComputationalandInferentialThinking

132Estimation

Supposeyouobservesomeplanesandnotedowntheirserialnumbers.Howmightyouusethedatatoguessthevalueof ?Anaturalandstraightforwardmethodwouldbetosimplyusethelargestserialnumberobserved.

Letusseehowwellthismethodofestimationworks.First,anothersimplification:SomehistoriansnowestimatethattheGermanaircraftindustryproducedalmost100,000warplanesofmanydifferentkinds,Butherewewillimaginejustonekind.ThatmakesAssumption1aboveeasiertojustify.

Supposethereareinfact planesofthiskind,andthatyouobserve30ofthem.Wecanconstructatablecalledserialnothatcontainstheserialnumbers1through .Wecanthensample30timeswithreplacement(seeAssumption2)togetoursampleofserialnumbers.Ourestimateisthemaximumofthese30numbers.

N=300

serialno=Table().with_column('serialnumber',np.arange(1,N+1))

serialno

serialnumber

1

2

3

4

5

6

7

8

9

10

...(290rowsomitted)

max(serialno.sample(30,with_replacement=True).column(0))

296

ComputationalandInferentialThinking

133Estimation

Aswithallcodeinvolvingrandomsampling,runitafewtimestoseethevariation.Youwillobservethatevenwithjust30observationsfromamong300,thelargestserialnumberistypicallyinthe250-300range.

Inprinciple,thelargestserialnumbercouldbeassmallas1,ifyouwereunluckyenoughtoseePlaneNumber1all30times.Anditcouldbeaslargeas300ifyouobservePlaneNumber300atleastonce.Butusually,itseemstobeintheveryhigh200's.Itappearsthatifyouusethelargestobservedserialnumberasyourestimateofthetotal,youwillnotbeveryfarwrong.

Letusgeneratesomedatatoseeifwecanconfirmthis.Wewilluseiterationtorepeatthesamplingprocedurenumeroustimes,eachtimenotingthelargestserialnumberobserved.Thesewouldbeourestimatesof fromallthenumeroussamples.Wewillthendrawahistogramofalltheseestimates,andexaminebyhowmuchtheydifferfrom .

Inthecodebelow,wewillrun750repetitionsofthefollowingprocess:Sample30timesatrandomwithreplacementfrom1through300andnotethelargestnumberobserved.

Todothis,wewilluseaforloop.Asyouhaveseenbefore,wewillstartbysettingupanemptytablethatwilleventuallyholdalltheestimatesthataregenerated.Aseachestimateisthelargestnumberinitssample,wewillcallthistablemaxes.

Foreachinteger(callediinthecode)intherange0through749(750total),theforloopexecutesthecodeinthebodyoftheloop.Inthisexample,itgeneratesarandomsampleof30serialnumbers,computesthemaximumvalue,andaugmentstherowsofmaxeswiththatvalue.

sample_size=30

repetitions=750

maxes=Table(['i','max'])

foriinnp.arange(repetitions):

sample=serialno.sample(sample_size,with_replacement=True)

maxes.append([i,max(sample.column(0))])

maxes

ComputationalandInferentialThinking

134Estimation

i max

0 300

1 300

2 291

3 300

4 297

5 295

6 283

7 296

8 299

9 273

...(740rowsomitted)

Hereisahistogramofthe750estimates.Asyoucansee,theestimatesareallcrowdedupnear300,eventhoughintheorytheycouldbemuchsmaller.Thehistogramindicatesthatasanestimateofthetotalnumberofplanes,thelargestserialnumbermightbetoolowbyabout10to25.Butitisextremelyunlikelytobeunderestimatethetruenumberofplanesbymorethanabout50.

every_ten=np.arange(1,N+100,10)

maxes.hist('max',bins=every_ten)

Theempiricaldistributionofastatistic¶

ComputationalandInferentialThinking

135Estimation

Intheexampleabove,thelargestserialnumberiscalledastatistic(singular!).Astatisticisanynumbercomputedusingthedatainasample.

Thegraphaboveisanempiricalhistogram.Itdisplaystheempiricalorobserveddistributionofthestatistic,basedon750samples.

Thestatisticcouldhavebeendifferent.

Afundamentalconsiderationinusinganystatisticbasedonarandomsampleisthatthesamplecouldhavecomeoutdifferently,andthereforethestatisticcouldhavecomeoutdifferentlytoo.

Justhowdifferentcouldthestatistichavebeen?

Ifyougenerateallofthepossiblesamples,andcomputethestatisticforeachofthem,thenyouwillhaveaclearpictureofhowdifferentthestatisticmighthavebeen.Indeed,youwillhaveacompleteenumerationofallthepossiblevaluesofthestatisticandalltheirprobabilities.Theresultingdistributioniscalledtheprobabilitydistributionofthestatistic,anditshistogramisthesameastheprobabilityhistogram.

Theprobabilitydistributionofastatisticisalsocalledthesamplingdistributionofthestatistic,becauseitisbasedonallofthepossiblesamples.

Thetotalnumberofpossiblesamplesisoftenverylarge.Forexample,thetotalnumberofpossiblesequencesof30serialnumbersthatyoucouldseeiftherewere300aircraftis

300**30

205891132094649000000000000000000000000000000000000000000000000000000000000

That'salotofsamples.Fortunately,wedon'thavetogenerateallofthem.Weknowthattheempiricalhistogramofthestatistic,basedonmanybutnotallofthepossiblesamples,isagoodapproximationtotheprobabilityhistogram.Theempiricaldistributionofastatisticgivesagoodideaofhowdifferentthestatisticcouldbe.

Itistruethattheprobabilitydistributionofastatisticcontainsmoreaccurateinformationaboutthestatisticthananempiricaldistributiondoes.Butoften,asinthisexample,theapproximationprovidedbytheempiricaldistributionissufficientfordatascientiststounderstandhowmuchastatisticcanvary.Andempiricaldistributionsareeasiertocompute.Therefore,datascientistsoftenuseempiricaldistributionsinsteadofexactprobabilitydistributionswhentheyaretryingtounderstandrandomquantities.

AnotherEstimate.

ComputationalandInferentialThinking

136Estimation

Hereisanexampletoillustratethispoint.Thusfar,wehaveusedthelargestobservedserialnumberasanestimateofthetotalnumberofplanes.Butthereareotherpossibleestimates,andwewillnowconsideroneofthem.

Theideaunderlyingthisestimateisthattheaverageoftheobservedserialnumbersislikelybeabouthalfwaybetween1and .Thus,if istheaverage,then

Thusanewstatisticcanbeusedtoestimatethetotalnumberofplanes:taketheaverageoftheobservedserialnumbersanddoubleit.

Howdoesthismethodofestimationcomparewithusingthelargestnumberobserved?Itisnoteasytocalculatetheprobabilitydistributionofthenewstatistic.Butasbefore,wecansimulateittogettheprobabilitiesapproximately.Let'stakealookattheempiricaldistributionbasedonrepeatedsampling.Thenumberofrepetitionsischosentobethesameasitwasfortheearlierestimate.Thiswillallowustocomparethetwoempiricaldistributions.

new_est=Table(['i','2*avg'])

foriinnp.arange(repetitions):

sample=serialno.sample(sample_size,with_replacement=True)

new_est.append([i,2*np.average(sample.column(0))])

new_est.hist('2*avg',bins=every_ten)

Noticethatunliketheearlierestimate,thisonecanoverestimatethenumberofplanes.Thiswillhappenwhentheaverageoftheobservedserialnumbersiscloserto thanto1.

ComputationalandInferentialThinking

137Estimation

Foreaseofcomparison,letusrunanothersetof750repetitionsofthesamplingprocessandcalculatebothestimatesforeachsample.

new_est=Table(['i','max','2*avg'])

foriinnp.arange(repetitions):

sample=serialno.sample(sample_size,with_replacement=True)

new_est.append([i,max(sample.column(0)),2*np.average(sample

new_est.select(['max','2*avg']).hist(bins=every_ten)

Youcanseethattheoldmethodalmostalwaysunderestimates;formally,wesaythatitisbiased.Butithaslowvariability,andishighlylikelytobeclosetothetruetotalnumberofplanes.

Thenewmethodoverestimatesaboutasoftenasitunderestimates,andthusisroughlyunbiasedonaverageinthelongrun.However,itismorevariablethantheoldestimate,andthusispronetolargerabsoluteerrors.

Thisisaninstanceofabias-variancetradeoffthatisnotuncommonamongcompetingestimates.Whichestimateyoudecidetousewilldependonthekindsoferrorsthatmatterthemosttoyou.Inthecaseofenemywarplanes,underestimatingthetotalnumbermighthavegrimconsequences,inwhichcaseyoumightchoosetousethemorevariablemethodthatoverestimatesabouthalfthetime.Ontheotherhand,ifoverestimationleadstoneedlesslyhighcostsofguardingagainstplanesthatdon'texist,youmightbesatisfiedwiththemethodthatunderestimatesbyamodestamount.

ComputationalandInferentialThinking

138Estimation

Technicalnote:Infact,twicetheaverageisnotunbiased.Onaverage,itoverestimatesbyexactly1.Forexample,ifNis3,theaverageofdrawsfrom1,2,and3willbe2,and2times2is4,whichisonemorethanN.Twicetheaverageminus1isanunbiasedestimatorofN.

ComputationalandInferentialThinking

139Estimation

CenterInteractWehaveseenthatstudyingthevariabilityofanestimateinvolvesquestionssuchas:

Whereistheempiricalhistogramcentered?Abouthowfarawayfromcentercouldthevaluesbe?

Inordertodiscussestimatesandtheirvariabilityingreaterdetail,itwillhelptoquantifysomebasicfeaturesofdistributions.

Themeanofalistofnumbers¶

Inthiscourse,wewillusethewords"average"and"mean"interchangeably.

Definition:Theaverageormeanofalistofnumbersisthesumofallthenumbersinthelist,dividedbythenumberofentriesinthelist.

Hereisthedefinitionappliedtoalistof20numbers.

#Thedata:alistofnumbers

x_list=[4,4,4,5,5,5,2,2,2,2,2,2,3,3,3,3,3,3,3,

#Theaverage

sum(x_list)/len(x_list)

3.15

Theaverageofthelistis3.15.

Youcanthinkoftakingthemeanasan"equalizing"or"smoothing"operation.Forexample,imaginetheentriesinx_listbeingtheamountsofmoney(indollars)inthepocketsof20differentpeople.Togetthemean,youfirstputallofthemoneyintoonebigpot,andthendivideitevenlyamongall20people.Theyhadstartedoutwithdifferentamountsofmoneyintheirpockets,butnoweachpersonhas$3.15,theaverageamount.

ComputationalandInferentialThinking

140Center

Computation:Wecouldinsteadhavefoundthemeanbyapplyingthefunctionnp.meantothelistx_list.Differencesinroundingleadtoanswersthatdifferveryslightly.

np.mean(x_list)

3.1499999999999999

Orwecouldhaveplacedthelistinatable,andusedtheTablemethodmean:

x_table=Table().with_column('value',x_list)

x_table['value'].mean()

3.1499999999999999

Themeanandthehistogram¶

Anotherwaytocalculatethemeanistofirstarrangethelistinadistributiontable.Thecolumnscontainthedistinctvaluesinthelist,thecountofeachvalue,andthecorrespondingproportion.Thesumofthecolumncountis20,thenumberofentriesinthelist.

values=Table(['value','count']).with_rows([

[2,6],

[3,8],

[4,3],

[5,3]

])

counts=values.column('count')

x_dist=values.with_column('proportion',counts/sum(counts))

#thedistributiontable

x_dist

ComputationalandInferentialThinking

141Center

value count proportion

2 6 0.3

3 8 0.4

4 3 0.15

5 3 0.15

Thesumofthenumbersinthelistcanbecalculatedbymultiplyingthefirsttwocolumnsandaddingup–four2's,five3's,andsoon–andthendividingby20:

sum(x_dist.column('value')*x_dist.column('count'))/sum(x_dist.column

3.1499999999999999

Thisisthesameasmultiplyingthefirstandthirdcolumns–thevaluesandtheirproportions–andaddingup:

sum(x_dist.column('value')*x_dist.column('proportion'))

3.1500000000000004

Wenowhaveanotherwaytothinkaboutthemean:Themeanofalististheweightedaverageofthedistinctvalues,wheretheweightsaretheproportionsinwhichthosevaluesappear.

Physically,thisimpliesthattheaverageisthebalancepointofthehistogram.Hereisthehistogramofthedistribution.Imaginethatitismadeoutofcardboardandthatyouaretryingtobalancethecardboardfigureatapointonthehorizontalaxis.Theaverageof3.15iswherethefigurewillbalance.

Tounderstandwhythatis,youwillhavetostudysomephysics.Balancepoints,orcentersofgravity,canbecalculatedinexactlythewaywecalculatedthemeanfromthedistributiontable.

x_dist.select(['value','proportion']).hist(counts='value',bins=np

ComputationalandInferentialThinking

142Center

Noticethatthemeanofalistdependsonlyonthedistinctvaluesandtheirproportions;youdonotneedtoknowhowmanyentriesthereareinthelist.

Inotherwords,theaverageisapropertyofthehistogram.Iftwolistshavethesamehistogram,theywillalsohavethesameaverage.

Themeanandthemedian¶

Ifastudent'sscoreonatestisbelowaverage,doesthatimplythatthestudentisinthebottomhalfoftheclassonthattest?

Happilyforthestudent,theansweris,"Notnecessarily."Thereasonhastodowiththerelationbetweentheaverage,whichisthebalancepointofthehistogram,andthemedian,whichisthe"half-waypoint"ofthedata.

Let'scomparethebalancepointofthehistogramtothepointthathas50%oftheareaofthehistogramoneithersideofit.

Therelationshipiseasytoseeinasimpleexample.Hereisthelist1,2,2,3,representedasadistributiontablecalledsymfor"symmetric".

sym=Table().with_columns([

'value',[1,2,3],

'dist',[0.25,0.5,0.25]

])

sym

ComputationalandInferentialThinking

143Center

value dist

1 0.25

2 0.5

3 0.25

Thehistogram(oracalculation)showsthattheaverageandmedianareboth2.

sym.hist(counts='value',bins=np.arange(0.5,10.6,1))

Ingeneral,forsymmetricdistributions,themeanandthemedianareequal.

Whatifthedistributionisnotsymmetric?Wewillexplorethisbymakingachangeinthehistogramabove:wewilltakethebaratthevalue3andslideitovertothevalue10.

#Averageversusmedian:

#Balancepointversus50%point

slide=Table().with_columns([

'value',[1,2,3,10],

'dist1',[0.25,0.5,0.25,0],

'dist2',[0.25,0.5,0,0.25]

])

slide

ComputationalandInferentialThinking

144Center

value dist1 dist2

1 0.25 0.25

2 0.5 0.5

3 0.25 0

10 0 0.25

slide.hist(counts='value',bins=np.arange(0.5,10.6,1))

Thebluehistogramrepresentstheoriginaldistribution;thegoldhistogramstartsoutthesameastheblueattheleftend,butitsrightmostbarhasslidovertothevalue10.Thebrownpartiswherethetwohistogramsoverlap.

Themedianandmeanofthebluedistributionarebothequalto2.Themedianofthegolddistributionisalsoequalto2;theareaoneithersideof2isstill50%,thoughtherighthalfisdistributeddifferentlyfromtheleft.

Butthemeanofthegolddistributionisnot2:thegoldhistogramwouldnotbalanceat2.Thebalancepointhasshiftedtotheright,to3.75.

sum(slide['value']*slide['dist2'])

3.75

Inthegolddistribution,3outof4entries(75%)arebelowaverage.Thestudentwithabelowaveragescorecanthereforetakeheart.Heorshemightbeinthemajorityoftheclass.

ComputationalandInferentialThinking

145Center

Themean,themedian,andskeweddistributions:Ingeneral,ifthehistogramhasatailononeside(theformaltermis"skewed"),thenthemeanispulledawayfromthemedianinthedirectionofthetail.

Example:FlightDelays.TheBureauofTransportationStatisticsprovidesdataondeparturesandarrivalsofflightsintheUnitedStates.Thetableuacontainsthedeparturedelaytimes,inminutes,ofabout37,500UnitedAirlinesflightsoverthepastfewyears.Anegativedelaymeansthattheflightleftearlierthanscheduled.

ontime=Table.read_table('airline_ontime.csv')

ontime.show(3)

Carrier DepartureDelay ArrivalDelay

AA -1 -1

AA -1 -1

AA -1 -1

...(457010rowsomitted)

ua=ontime.where('Carrier','UA').select(1)

ua.show(3)

DepartureDelay

-1

-1

-1

...(37360rowsomitted)

ua_bins=np.arange(-15,121,2)

ua.hist(bins=ua_bins,unit='minute')

ComputationalandInferentialThinking

146Center

Thehistogramofthedelaytimesisright-skewed:ithasalongrighthandtail.

Themeangetspulledawayfromthemedianinthedirectionofthetail.Soweexpectthemeandelaytobelargerthanthemedian,andthatisindeedthecase:

np.mean(ua['DepartureDelay'])

13.885555228434548

np.median(ua['DepartureDelay'])

2.0

ComputationalandInferentialThinking

147Center

SpreadInteract

Variability¶

TheRoughSizeofDeviationsfromAverage¶

Aswehaveseen,inferencebasedonteststatisticsmusttakeintoaccountthewayinwhichthestatisticsvaryacrosssamples.Itisthereforeimportanttobeabletoquantifythevariabilityinanylistofnumbers.Onewayistocreateameasureofthedifferencebetweenthevaluesinthelistandtheaverageofthelist.

Wewillstartbydefiningsuchameasureinthecontextofasimplearrayofjustfournumbers.

numbers=np.array([1,2,2,10])

numbers

array([1,2,2,10])

Thegoalistogetameasureofroughlyhowfaroffthenumbersinthelistarefromtheaverage.Todothis,weneedtheaverage,andallthedeviationsfromtheaverage.A"deviationfromaverage"isjustavalueminustheaverage.

#Step1.Theaverage.

average=np.mean(numbers)

average

3.75

ComputationalandInferentialThinking

148Spread

#Step2.Thedeviationsfromaverage.

deviations=numbers-average

numbers_and_deviations=Table().with_columns([

'value',numbers,

'deviationfromaverage',deviations

])

numbers_and_deviations

value deviationfromaverage

1 -2.75

2 -1.75

2 -1.75

10 6.25

Someofthedeviationsarenegative;thosecorrespondtovaluesthatarebelowaverage.Positivedeviationscorrespondtoabove-averagevalues.

Tocalculateroughlyhowbigthedeviationsare,itisnaturaltocomputethemeanofthedeviations.Butsomethinginterestinghappenswhenallthedeviationsareaddedtogether:

sum(deviations)

0.0

Thepositivedeviationsexactlycanceloutthenegativeones.Thisistrueofalllistsofnumbers,nomatterwhatthehistogramofthelistlookslike:thesumofthedeviationsfromaverageiszero.

Sincethesumofthedeviationsis0,themeanofthedeviationswillbe0aswell:

np.mean(deviations)

0.0

ComputationalandInferentialThinking

149Spread

Becauseofthis,themeanofthedeviationsisnotausefulmeasureofthesizeofthedeviations.Whatwereallywanttoknowisroughlyhowbigthedeviationsare,regardlessofwhethertheyarepositiveornegative.Soweneedawaytoeliminatethesignsofthedeviations.

Therearetwotime-honoredwaysoflosingsigns:theabsolutevalue,andthesquare.Itturnsoutthattakingthesquareconstructsameasurewithextremelypowerfulproperties,someofwhichwewillstudyinthiscourse.

Soletuseliminatethesignsbysquaringallthedeviations.Thenwewilltakethemeanofthesquares:

#Step3.Thesquareddeviationsfromaverage

squared_deviations=deviations**2

numbers_and_deviations.with_column('squareddeviationsfromaverage'

value deviationfromaverage squareddeviationsfromaverage

1 -2.75 7.5625

2 -1.75 3.0625

2 -1.75 3.0625

10 6.25 39.0625

#Variance=themeansquareddeviationfromaverage

variance=np.mean(squared_deviations)

variance

13.1875

Variance:Themeansquareddeviationiscalledthevarianceofasequenceofvalues.Whileitdoesgiveusanideaofspread,itisnotonthesamescaleastheoriginalvariableanditsunitsarethesquareoftheoriginal.Thismakesinterpretationverydifficult.Sowereturntotheoriginalscalebytakingthepositivesquarerootofthevariance:

ComputationalandInferentialThinking

150Spread

#StandardDeviation:rootmeansquareddeviationfromaverage

#Stepsofcalculation:54321

sd=variance**0.5

sd

3.6314597615834874

StandardDeviation¶Thequantitythatwehavejustcomputediscalledthestandarddeviationofthelist,andisabbreviatedasSD.Itmeasuresroughlyhowfarthenumbersonthelistarefromtheiraverage.

Definition.TheSDofalistisdefinedastherootmeansquareofdeviationsfromaverage.That'samouthful.Butreaditfromrighttoleftandyouhavethesequenceofstepsinthecalculation.

Computation.ThenumPyfunctionnp.stdcomputestheSDofvaluesinanarray:

np.std(numbers)

3.6314597615834874

WorkingwiththeSD¶

LetusnowexaminetheSDinthecontextofamoreinterestingdataset.ThetablenbacontainsdataontheplayersintheNationalBasketballAssociation(NBA)in2013.Foreachplayer,thetablerecordsthepositionatwhichtheplayerusuallyplayed,hisheightininches,weightinpounds,andageinyears.

nba=Table.read_table("nba2013.csv")

nba

ComputationalandInferentialThinking

151Spread

Name Position Height Weight Agein2013

DeQuanJones Guard 80 221 23

DariusMiller Guard 80 235 23

TrevorAriza Guard 80 210 28

JamesJones Guard 80 215 32

WesleyJohnson Guard 79 215 26

KlayThompson Guard 79 205 23

ThaboSefolosha Guard 79 215 29

ChaseBudinger Guard 79 218 25

KevinMartin Guard 79 185 30

EvanFournier Guard 79 206 20

...(495rowsomitted)

Hereisahistogramoftheplayers'heights.

nba.select('Height').hist(bins=np.arange(68,88,1))

ItisnosurprisethatNBAplayersaretall!Theiraverageheightisjustover79inches(6'7"),about10inchestallerthantheaverageheightofmenintheUnitedStates.

mean_height=np.mean(nba.column('Height'))

mean_height

ComputationalandInferentialThinking

152Spread

79.065346534653472

Aboutfaroffaretheplayers'heightsfromtheaverage?ThisismeasuredbytheSDoftheheights,whichisabout3.45inches.

sd_height=np.std(nba.column('Height'))

sd_height

3.4505971830275546

ThetoweringcenterHasheemThabeetoftheOklahomaCityThunderwasthetallestplayerataheightof87inches.

nba.sort('Height',descending=True).show(3)

Name Position Height Weight Agein2013

HasheemThabeet Center 87 263 26

RoyHibbert Center 86 278 26

TysonChandler Center 85 235 30

...(502rowsomitted)

Thabeetwasabout8inchesabovetheaverageheight.

87-mean_height

7.9346534653465284

That'sadeviationfromaverage,anditisabout2.3timesthestandarddeviation:

(87-mean_height)/sd_height

2.2995015194397923

Inotherwords,theheightofthetallestplayerwasabout2.3SDsaboveaverage.

ComputationalandInferentialThinking

153Spread

At69inchestall,IsaiahThomaswasoneofthetwoshortestplayersintheNBA.Hisheightwasabout2.9SDsbelowaverage.

nba.sort('Height').show(3)

Name Position Height Weight Agein2013

IsaiahThomas Guard 69 185 24

NateRobinson Guard 69 180 29

JohnLucasIII Guard 71 157 30

...(502rowsomitted)

(69-mean_height)/sd_height

-2.9169868288775844

WhatwehaveobservedisthatthetallestandshortestplayerswerebothjustafewSDsawayfromtheaverageheight.ThisisanexampleofwhytheSDisausefulmeasureofspread.Nomatterwhattheshapeofthehistogram,theaverageandtheSDtogethertellyoualotaboutthedata.

FirstmainreasonformeasuringspreadbytheSD¶

Informalstatement.Foralllists,thebulkoftheentriesarewithintherange"average afewSDs".

Trytoresistthedesiretoknowexactlywhatfuzzywordslike"bulk"and"few"mean.Wewilmakethempreciselaterinthissection.Fornow,letusexaminethestatementinthecontextofsomeexamples.

WehavealreadyseenthatalloftheheightsoftheNBAplayerswereintherange"average3SDs".

Whatabouttheages?Hereisahistogramofthedistribution.Yes,therewasa15-year-oldplayerintheNBAin2013.

nba.select('Agein2013').hist(bins=np.arange(15,45,1))

ComputationalandInferentialThinking

154Spread

JuwanHowardwastheoldestplayer,at40.Hisagewasabout3.2SDsaboveaverage.

nba.sort('Agein2013',descending=True).show(3)

Name Position Height Weight Agein2013

JuwanHoward Forward 81 250 40

MarcusCamby Center 83 235 39

DerekFisher Guard 73 210 39

...(502rowsomitted)

ages=nba['Agein2013']

(40-np.mean(ages))/np.std(ages)

3.1958482778922357

Theyoungestwas15-year-oldJarvisVarnado,whowontheNBAChampionshipthatyearwiththeMiamiHeat.Hisagewasabout2.6SDsbelowaverage.

nba.sort('Agein2013').show(3)

ComputationalandInferentialThinking

155Spread

Name Position Height Weight Agein2013

JarvisVarnado Forward 81 230 15

GiannisAntetokounmpo Forward 81 205 18

SergeyKarasev Guard 79 197 19

...(502rowsomitted)

(15-np.mean(ages))/np.std(ages)

-2.5895811038670811

Whatwehaveobservedfortheheightsandagesistrueingreatgenerality.Foralllists,thebulkoftheentriesarenomorethan2or3SDsawayfromtheaverage.

Theseroughstatementsaboutdeviationsfromaveragearemadeprecisebythefollowingresult.

Chebychev'sinequality¶

Foralllists,andallpositivenumbersz,theproportionofentriesthatareintherange"average SDs"isatleast .

Whatmakesthisresultpowerfulisthatitistrueforalllists–alldistributions,nomatterhowirregular.

Specifically,Chebychev'sinequalitysaysthatforeverylist:

theproportionintherange"average 2SDs"isatleast1-1/4=0.75

theproportionintherange"average 3SDs"isatleast1-1/9 0.89

theproportionintherange"average 4.5SDs"isatleast1-1/ 0.95

Chebychev'sInequalitygivesalowerbound,notanexactansweroranapproximation.Forexample,thepercentofentriesintherange"average SDs"mightbequiteabitlargerthan75%.Butitcannotbesmaller.

Standardunits¶

Inthecalculationsabove, measuresstandardunits,thenumberofstandarddeviationsaboveaverage.

ComputationalandInferentialThinking

156Spread

Somevaluesofstandardunitsarenegative,correspondingtooriginalvaluesthatarebelowaverage.Othervaluesofstandardunitsarepositive.Butnomatterwhatthedistributionofthelistlookslike,Chebychev'sInequalitysaysthatstandardunitswilltypicallybeinthe(-5,5)range.

Toconvertavaluetostandardunits,firstfindhowfaritisfromaverage,andthencomparethatdeviationwiththestandarddeviation.

Aswewillsee,standardunitsarefrequentlyusedindataanalysis.Soitisusefultodefineafunctionthatconvertsanarrayofnumberstostandardunits.

defstandard_units(any_numbers):

"Convertanyarrayofnumberstostandardunits."

return(any_numbers-np.mean(any_numbers))/np.std(any_numbers)

Aswesawinanearliersection,thetableuacontainsacolumnDepartureDelayconsistingofthedeparturedelaytimes,inminutes,ofover37,000UnitedAirlinesflights.WecancreateanewcolumncalledDelay_subyapplyingthefunctionstandard_unitstothecolumnofdelaytimes.Thisallowsustoseeallthedelaytimesinminutesaswellastheircorrespondingvaluesinstandardunits.

ua.append_column('z',standard_units(ua.column('DepartureDelay')))

ua

ComputationalandInferentialThinking

157Spread

DepartureDelay z

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

...(37353rowsomitted)

Somethingratheralarminghappenswhenwesortthedelaytimesfromhighesttolowest.Thestandardunitsareextremelyhigh!

ua.sort(0,descending=True)

DepartureDelay z

886 22.9631

739 19.0925

678 17.4864

595 15.301

566 14.5374

558 14.3267

543 13.9318

541 13.8791

537 13.7738

513 13.1419

...(37353rowsomitted)

WhatthisshowsisthatitispossiblefordatatobemanySDsaboveaverage(andforflightstobedelayedbyalmost15hours).Thehighestvalueismorethan22instandardunits.

ComputationalandInferentialThinking

158Spread

However,theproportionoftheseextremevaluesistiny,andtheboundsinChebychev'sinequalitystillholdtrue.Forexample,letuscalculatethepercentofdelaytimesthatareintherange"average 2SDs".Thisisthesameasthepercentoftimesforwhichthestandardunitsareintherange(-2,2).Thatisjustabout96%,ascomputedbelow,consistentwithChebychev'sboundof"atleast75%".

within_2_sd=np.logical_and(ua.column('z')>-2,ua.column('z')<

ua.where(within_2_sd).num_rows/ua.num_rows

0.960522441988063

Thehistogramofdelaytimesisshownbelow,withthehorizontalaxisinstandardunits.Bythetableabove,therighthandtailcontinuesallthewayoutto22standardunits(886minutes).Theareaofthehistogramoutsidetherange to isabout4%,puttogetherintinylittlebitsthataremostlyinvisibleinahistogram.

ua.hist('z',bins=np.arange(-5,23,1),unit='standardunit')

ComputationalandInferentialThinking

159Spread

TheNormalDistributionInteractWeknowthatthemeanisthebalancepointofthehistogram.Unlikethemean,theSDisusuallynoteasytoidentifybylookingatthehistogram.

However,thereisoneshapeofdistributionforwhichtheSDisalmostasclearlyidentifiableasthemean.Thissectionexaminesthatshape,asitappearsfrequentlyinprobabilityhistogramsandalsoinmanyhistogramsofdata.

TheSDandthebellcurve¶

Thetablebirthscontainsdataonover1,000babiesbornatahospitalintheBayArea.Thecolumnscontainseveralvariables:thebaby'sbirthweightinounces;thenumberofgestationaldays;themother'sageinyears,heightininches,andpregnancyweightinpounds;andarecordofwhethershesmokedornot.

births=Table.read_table('baby.csv')

births

BirthWeight

GestationalDays

MaternalAge

MaternalHeight

MaternalPregnancyWeight

MaternalSmoker

120 284 27 62 100 False

113 282 33 64 135 False

128 279 28 64 115 True

108 282 23 67 125 True

136 286 25 62 93 False

138 244 33 62 178 False

132 245 23 65 140 False

120 289 25 62 125 False

143 299 30 66 136 True

140 351 27 68 120 False

...(1164rowsomitted)

ComputationalandInferentialThinking

160TheNormalDistribution

Themothers'heightshaveameanof64inchesandanSDof2.5inches.Unliketheheightsofthebasketballplayers,themothers'heightsaredistributedfairlysymmetricallyaboutthemeaninabell-shapedcurve.

heights=births.column('MaternalHeight')

mean_height=np.round(np.mean(heights),1)

mean_height

64.0

sd_height=np.round(np.std(heights),1)

sd_height

2.5

births.hist('MaternalHeight',bins=np.arange(55.5,72.5,1),unit=

positions=np.arange(-3,3.1,1)*sd_height+mean_height

_=plots.xticks(positions)

Thelasttwolinesofcodeinthecellabovechangethelabelingofthehorizontalaxis.Now,thelabelscorrespondto"average SDs"for ,and .Becauseoftheshapeofthedistribution,the"center"hasanunambiguousmeaningandisclearlyvisibleat64.

ComputationalandInferentialThinking

161TheNormalDistribution

ToseehowtheSDisrelatedtothecurve,startatthetopofthecurveandlooktowardstheright.Noticethatthereisaplacewherethecurvechangesfromlookinglikean"upside-downcup"toa"right-way-upcup";formally,thecurvehasapointofinflection.ThatpointisoneSDaboveaverage.Itisthepoint ,whichis"averageplus1SD"=66.5inches.

Symmetricallyontheleft-handsideofthemean,thepointofinflectionisat ,thatis,"averageminus1SD"=61.5inches.

HowtospottheSDonabell-shapedcurve:Forbell-shapeddistributions,theSDisthedistancebetweenthemeanandthepointsofinflectiononeitherside.

Bell-shapedprobabilityhistograms¶

Thereasonforstudyingbell-shapedcurvesisnotjustthattheyappearassomedistributionsofdata,butthattheyappearasapproximationstoprobabilityhistogramsofsampleaverages.

Wehaveseenthisphenomenonalready,intheexampleaboutestimatingthetotalnumberofaircraft.Recallthatwesimulatedrandomsamplesofsize30drawnwithreplacementfromtheserialnumbers1,2,3,...,300,andcomputedtwodifferentquantities:thelargestofthe30numbersobserved,andtwicetheaverageofthe30numbers.Herearethegeneratedhistograms.

N=300

sample_size=30

repetitions=5000

serialno=Table().with_column('serialnumber',np.arange(1,N+1))

new_est=Table(['i','max','2*avg'])

foriinnp.arange(repetitions):

sample=serialno.sample(sample_size,with_replacement=True)

new_est.append([i,max(sample.column(0)),2*np.average(sample

every_ten=np.arange(1,N+100,10)

new_est.select(['max','2*avg']).hist(bins=every_ten)

ComputationalandInferentialThinking

162TheNormalDistribution

Weobservedthatthedistributionof"twicethesampleaverage"isroughlysymmetric;nowlet'salsoobservethatitisroughlybell-shaped.

Ifthedistributionof"twicetheaverage"isbell-shaped,soisthedistributionoftheaverage.Hereisthedistributionoftheaverageofasampleofsize30drawnfromthenumbers1,2,3,...,300.Thenumberofreplicationsofthesamplingprocedureislarge,soitissafetoassumethattheempiricalhistogramresemblestheprobabilityhistogramofthesampleaverage.

sample_means=Table(['Samplemean'])

foriinnp.arange(repetitions):

sample=serialno.sample(sample_size,with_replacement=True)

sample_means.append([np.average(sample.column(0))])

every_five=np.arange(100.5,200.6,5)

sample_means.hist(bins=every_five)

ComputationalandInferentialThinking

163TheNormalDistribution

Wecandrawthiscurvewitharedbell-shapedcurvesuperposed.Thiscurveisgeneratedfromthemeanandstandarddeviationofthesamplemeans.Don'tworryaboutthecodethatdrawsthecurve;justfocusonthediagramitself.

#Importamoduleofstandardstatisticalfunctions

fromscipyimportstats

sample_means.hist(bins=every_five)

means=sample_means['Samplemean']

m=np.mean(means)

s=np.std(means)

x=np.arange(100.5,200.6,1)

plots.plot(x,stats.norm.pdf(x,m,s),color='r',lw=1)

positions=np.arange(-3,3.1,1)*15+150

_=plots.xticks(positions,[int(y)foryinpositions])

ComputationalandInferentialThinking

164TheNormalDistribution

Sinceallthenumbersdrawnarebetween1and300,weexpectthesamplemeantobearound150,whichiswherethehistogramiscentered.

CanyouguesswhattheSDis?Runyoureyealongthecurvetillyoufeelyouareclosetothepointofinflectiontotherightofcenter.Ifyouguessedthatthepointisatabout165,you'recorrect–it'sbetween165and166andsymmetricallybetween134and135totheleftofcenter.

Theprobabilityhistogramofarandomsampleaverage¶

WhatisnotableaboutthisisnotjustthatwecanspottheSD,butthattheprobabilitydistributionofthestatisticisbell-shapedinthefirstplace.ThereisaremarkablepieceofmathematicscalledtheCentralLimitTheoremthatshowsthatthedistributionoftheaverageofalargerandomsamplewillberoughlynormal,nomatterwhatthedistributionofthepopulationfromwhichthesampleisdrawn.

Inourexampleaboutserialnumbers,eachofthe300serialnumbersisequallylikelytobedrawneachtime,sothedistributionofthepopulationisuniformandtheprobabilityhistogramlookslikeabrick.

serialno.hist(bins=np.arange(0.5,300.5,1))

Let'stestouttheCentralLimitTheorembydrawingsamplesfromquiteadifferentdistribution.Hereisthehistogramofover37,000flightdelaytimesinthetableua.Itlooksnothingliketheuniformdistributionabove.

plotrange=np.arange(-15,121,2)

ua.hist(bins=plotrange,unit='minute')

ComputationalandInferentialThinking

165TheNormalDistribution

Wewillnowlookattheprobabilitydistributionoftheaverageof625flighttimesdrawnatrandomfromthispopulation.Thedrawswillbemadewithreplacement,astheywereintheexampleaboutserialnumbers.

sample_size=625

sample_means=Table(['Samplemean'])

foriinnp.arange(repetitions):

sample=ua.sample(sample_size,with_replacement=True)

sample_means.append([np.average(sample.column(0))])

sample_means.hist(bins=np.arange(9,19.1,0.5),unit='minute')

#Drawaredbell-shapedcurvefromthemeanandSDofthesamplemeans

means=sample_means['Samplemean']

m=np.mean(means)

s=np.std(means)

x=np.arange(9,19.1,0.1)

_=plots.plot(x,stats.norm.pdf(x,np.mean(ua[0]),np.std(ua[0])/25

ComputationalandInferentialThinking

166TheNormalDistribution

Theprobabilitiesforthesampleaverageareroughlybell-shaped,consistentwiththeCenralLimitTheorem.

Theaveragedelaytimeofalltheflightsinthepopulationisabout14minutes,whichiswherethehistogramiscentered.TospottheSD,itwillhelptonotethatthebarsarehalfaminutewide.Byeye,thepointsofinflectionlooktobeaboutthreebarsawayfromthecenter.SowewouldguessthattheSDisabout1.5minutes,andinfactthatisnotfarfromtheexactvalue.

Whatistheexactvalue?Wewillanswerthatquestionlaterintheterm.Remarkably,itcanbecalculatedeasilybasedonthesamplesizeandtheSDofthepopulation,withoutdoinganysimulationsatall.

Fornow,hereisthemainpointtonote.

SecondmainreasonforusingtheSDtomeasurespread¶

Foralargerandomsample,theprobabilitydistributionofthesampleaveragewillberoughlybell-shaped,withameanandSDthatcanbeeasilyidentified,nomatterwhatthedistributionofthepopulationfromwhichthesampleisdrawn.

Thestandardnormalcurve¶

Thebell-shapedcurvesabovelookessentiallyallthesameapartfromthelabelsthatwehaveplacedonthehorizontalaxes.Indeed,thereisreallyjustonebasiccurvefromwhichallofthesecurvescanbedrawnjustbyrelabelingtheaxesappropriately.

Todrawthatbasiccurve,wewillusetheunitsintowhichwecanconverteverylist:standardunits.Theresultingcurveisthereforecalledthestandardnormalcurve.

ComputationalandInferentialThinking

167TheNormalDistribution

Thestandardnormalcurvehasanimpressiveequation.Butfornow,itisbesttothinkofitasasmoothedversionofahistogramofavariablethathasbeenmeasuredinstandardunits.

#Thestandardnormalprobabilitydensityfunction(pdf)

plot_normal_cdf()

Asalwayswhenyouexamineanewhistogram,startbylookingatthehorizontalaxis.Onthehorizontalaxisofthestandardnormalcurve,thevaluesarestandardunits.Herearesomepropertiesofthecurve.Someareapparentbyobservation,andothersrequireaconsiderableamountofmathematicstoestablish.

Thetotalareaunderthecurveis1.Soyoucanthinkofitasahistogramdrawntothedensityscale.

Thecurveissymmetricabout0.Soifavariablehasthisdistribution,itsmeanandmedianareboth0.

Thepointsofinflectionofthecurveareat-1and+1.

Ifavariablehasthisdistribution,itsSDis1.ThenormalcurveisoneoftheveryfewdistributionsthathasanSDsoclearlyidentifiableonthehistogram.

Sincewearethinkingofthecurveasasmoothedhistogram,wewillwanttorepresentproportionsofthetotalamountofdatabyareasunderthecurve.Areasundersmoothcurvesareoftenfoundbycalculus,usingamethodcalledintegration.Itisafactofmathematics,however,thatthestandardnormalcurvecannotbeintegratedinanyofthe

ComputationalandInferentialThinking

168TheNormalDistribution

usualwaysofcalculus.Therefore,areasunderthecurvehavetobeapproximated.Thatiswhyalmostallstatisticstextbookscarrytablesofareasunderthenormalcurve.Itisalsowhyallstatisticalsystems,includingthestatsmoduleofPython,includemethodsthatprovideexcellentapproximationstothoseareas.

Areasunderthenormalcurve¶

Thestandardnormalcumulativedistributionfunction(cdf)

Thefundamentalfunctionforfindingareasunderthenormalcurveisstats.norm.cdf.Ittakesanumericalargumentandreturnsalltheareaunderthecurvetotheleftofthatnumber.Formally,itiscalledthe"cumulativedistributionfunction"ofthestandardnormalcurve.Thatratherunwieldymouthfulisabbreviatedascdf.

Letususethisfunctiontofindtheareatotheleftof underthestandardnormalcurve.

#Areaunderthestandardnormalcurve,below1

plot_normal_cdf(1)

stats.norm.cdf(1)

0.84134474606854293

That'sabout84%.Wecannowusethesymmetryofthecurveandthefactthatthetotalareaunderthecurveis1tofindotherareas.

ComputationalandInferentialThinking

169TheNormalDistribution

Forexample,theareatotherightof isabout100%-84%=16%.

#Areaunderthestandardnormalcurve,above1

plot_normal_cdf(lbound=1)

1-stats.norm.cdf(1)

0.15865525393145707

Theareabetween and canbecomputedinseveraldifferentways.

#Areaunderthestandardnormalcurve,between-1and1

plot_normal_cdf(1,lbound=-1)

ComputationalandInferentialThinking

170TheNormalDistribution

Forexample,wecancalculatetheareaas"100%-twoequaltails",whichworksouttoroughly100%-2x16%=68%.

Orwecouldnotethattheareabetween and isequaltoalltheareatotheleftof ,minusalltheareatotheleftof .

stats.norm.cdf(1)-stats.norm.cdf(-1)

0.68268949213708585

Byasimilarcalculation,weseethattheareabetween and2isabout95%.

#Areaunderthestandardnormalcurve,between-2and2

plot_normal_cdf(2,lbound=-2)

ComputationalandInferentialThinking

171TheNormalDistribution

stats.norm.cdf(2)-stats.norm.cdf(-2)

0.95449973610364158

Inotherwords,ifahistogramisroughlybellshaped,theproportionofdataintherange"average 2SDs"isabout95%.

ThatisquiteabitmorethanChebychev'slowerboundof75%.Chebychev'sboundisweakerbecauseithastoworkforalldistributions.Ifweknowthatadistributionisnormal,wehavegoodapproximationstotheproportions,notjustbounds.

Thetablebelowcompareswhatweknowaboutalldistributionsandaboutnormaldistributions.NoticethatChebychev'sboundisnotveryilluminatingwhen .

PercentinRange AllDistributions NormalDistribution

average 1SD atleast0% about68%

average 2SDs atleast75% about95%

average 3SDs atleast88.888...% about99.73%

ComputationalandInferentialThinking

172TheNormalDistribution

Exploration:PrivacySectionauthor:DavidWagner

Interact

Licenseplates¶We'regoingtolookatsomedatacollectedbytheOaklandPoliceDepartment.Theyhaveautomatedlicenseplatereadersontheirpolicecars,andthey'vebuiltupadatabaseoflicenseplatesthatthey'veseen--andwhereandwhentheysaweachone.

Datacollection¶First,we'llgatherthedata.ItturnsoutthedataispubliclyavailableontheOaklandpublicrecordssite.IdownloadeditandcombineditintoasingleCSVfilebymyselfbeforelecture.

lprs=Table.read_table('./all-lprs.csv.gz',compression='gzip',sep

lprs.labels

('red_VRM','red_Timestamp','Location')

Let'sstartbyrenamingsomecolumns,andthentakealookatit.

lprs.relabel('red_VRM','Plate')

lprs.relabel('red_Timestamp','Timestamp')

lprs

ComputationalandInferentialThinking

173Exploration:Privacy

Plate Timestamp Location

1275226 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

27529C 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

1158423 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

1273718 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

1077682 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

1214195 01/19/201102:06:00AM (37.798281000000003,-122.27575299999999)

1062420 01/19/201102:06:00AM (37.79833,-122.27574300000001)

1319726 01/19/201102:05:00AM (37.798475000000003,-122.27571500000001)

1214196 01/19/201102:05:00AM (37.798499999999997,-122.27571)

75227 01/19/201102:05:00AM (37.798596000000003,-122.27569)

...(2742091rowsomitted)

Phew,that'salotofdata:wecanseeabout2.7millionlicenseplatereadshere.

Let'sstartbyseeingwhatcanbelearnedaboutsomeone,usingthisdata--assumingyouknowtheirlicenseplate.

StalkingJeanQuan¶Asawarmup,we'lltakealookatex-MayorJeanQuan'scar,andwhereithasbeenseen.Herlicenseplatenumberis6FCH845.(HowdidIlearnthat?Turnsoutshewasinthenewsforgetting$1000ofparkingtickets,andthenewsarticleincludedapictureofhercar,withthelicenseplatevisible.You'dbeamazedbywhat'soutthereontheInternet...)

lprs.where('Plate','6FCH845')

ComputationalandInferentialThinking

174Exploration:Privacy

Plate Timestamp Location

6FCH845 11/01/201209:04:00AM (37.79871,-122.276221)

6FCH845 10/24/201211:15:00AM (37.799695,-122.274868)

6FCH845 10/24/201211:01:00AM (37.799693,-122.274806)

6FCH845 10/24/201210:20:00AM (37.799735,-122.274893)

6FCH845 05/08/201407:30:00PM (37.797558,-122.26935)

6FCH845 12/31/201310:09:00AM (37.807556,-122.278485)

OK,sohercarshowsup6timesinthisdataset.However,it'shardtomakesenseofthosecoordinates.Idon'tknowaboutyou,butIcan'treadGPSsowell.

So,let'sworkoutawaytoshowwherehercarhasbeenseenonamap.We'llneedtoextractthelatitudeandlongitude,asthedataisn'tquiteintheformatthatthemappingsoftwareexpects:themappingsoftwareexpectsthelatitudetobeinonecolumnandthelongitudeinanother.Let'swritesomePythoncodetodothat,bysplittingtheLocationstringintotwopieces:thestuffbeforethecomma(thelatitude)andthestuffafter(thelongitude).

defget_latitude(s):

before,after=s.split(',')#Breakitintotwoparts

lat_string=before.replace('(','')#Getridoftheannoying'('

returnfloat(lat_string)#Convertthestringtoanumber

defget_longitude(s):

before,after=s.split(',')#Breakitintotwoparts

long_string=after.replace(')','').strip()#Getridofthe')'andspaces

returnfloat(long_string)#Convertthestringtoanumber

Let'stestittomakesureitworkscorrectly.

get_latitude('(37.797558,-122.26935)')

37.797558

get_longitude('(37.797558,-122.26935)')

ComputationalandInferentialThinking

175Exploration:Privacy

-122.26935

Good,nowwe'rereadytoaddtheseasextracolumnstothetable.

lprs=lprs.with_columns([

'Latitude',lprs.apply(get_latitude,'Location'),

'Longitude',lprs.apply(get_longitude,'Location')

])

lprs

Plate Timestamp Location Latitude Longitude

1275226 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

27529C 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

1158423 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

1273718 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

1077682 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

1214195 01/19/201102:06:00AM

(37.798281000000003,-122.27575299999999) 37.7983 -122.276

1062420 01/19/201102:06:00AM

(37.79833,-122.27574300000001) 37.7983 -122.276

1319726 01/19/201102:05:00AM

(37.798475000000003,-122.27571500000001) 37.7985 -122.276

1214196 01/19/201102:05:00AM

(37.798499999999997,-122.27571) 37.7985 -122.276

75227 01/19/201102:05:00AM

(37.798596000000003,-122.27569) 37.7986 -122.276

...(2742091rowsomitted)

Andatlast,wecandrawamapwithamarkereverywherethathercarhasbeenseen.

jean_quan=lprs.where('Plate','6FCH845').select(['Latitude','Longitude'

Marker.map_table(jean_quan)

ComputationalandInferentialThinking

176Exploration:Privacy

OK,soit'sbeenseenneartheOaklandpolicedepartment.Thisshouldmakeyoususpectwemightbegettingabitofabiasedsample.WhymighttheOaklandPDbethemostcommonplacewherehercarisseen?Canyoucomeupwithaplausibleexplanationforthis?

Pokingaround¶Let'stryanother.Andlet'sseeifwecanmakethemapalittlemorefancy.It'dbenicetodistinguishbetweenlicenseplatereadsthatareseenduringthedaytime(onaweekday),vstheevening(onaweekday),vsonaweekend.Sowe'llcolor-codethemarkers.Todothis,we'llwritesomePythoncodetoanalyzetheTimestampandchooseanappropriatecolor.

ComputationalandInferentialThinking

177Exploration:Privacy

importdatetime

defget_color(ts):

t=datetime.datetime.strptime(ts,'%m/%d/%Y%I:%M:%S%p')

ift.weekday()>=6:

return'green'#Weekend

elift.hour>=6andt.hour<=17:

return'blue'#Weekdaydaytime

else:

return'red'#Weekdayevening

lprs.append_column('Color',lprs.apply(get_color,'Timestamp'))

Nowwecancheckoutanotherlicenseplate,thistimewithourspiffycolor-coding.ThisonehappenstobethecarthatthecityissuestotheFireChief.

t=lprs.where('Plate','1328354').select(['Latitude','Longitude',

Marker.map_table(t)

ComputationalandInferentialThinking

178Exploration:Privacy

Hmm.WecanseeablueclusterindowntownOakland,wheretheFireChief'scarwasseenonweekdaysduringbusinesshours.Ibetwe'vefoundheroffice.Infact,ifyouhappentoknowdowntownOakland,thosearemostlyclusteredrightnearCityHall.Also,hercarwasseentwiceinnorthernOaklandonweekdayevenings.Onecanonlyspeculatewhatthatindicates.Maybedinnerwithafriend?Orrunningerrands?Offtothesceneofafire?Whoknows.Andthenthecarhasbeenseenoncemore,lateatnightonaweekend,inaresidentialareainthehills.Herhomeaddress,maybe?

Let'slookatanother.Thistime,we'llmakeafunctiontodisplaythemap.

defmap_plate(plate):

sightings=lprs.where('Plate',plate)

t=sightings.select(['Latitude','Longitude','Timestamp','Color'

returnMarker.map_table(t)

map_plate('5AJG153')

ComputationalandInferentialThinking

179Exploration:Privacy

Whatcanwetellfromthis?LookstomelikethispersonlivesonInternationalBlvdand9th,roughly.Onweekdaysthey'veseeninavarietyoflocationsinwestOakland.It'sfuntoimaginewhatthismightindicate--deliveryperson?taxidriver?someonerunningerrandsallovertheplaceinwestOakland?

Wecanlookatanother:

map_plate('6UZA652')

ComputationalandInferentialThinking

180Exploration:Privacy

Whatcanwelearnfromthismap?First,it'sprettyeasytoguesswherethispersonlives:16thandInternational,orprettynearthere.AndthenwecanseethemspendingsomenightsandaweekendnearLaneyCollege.Didtheyhaveanapartmenttherebriefly?Arelationshipwithsomeonewholivedthere?

Isanyoneelsegettingalittlebitcreepedoutaboutthis?IthinkI'vehadenoughoflookingatindividualpeople'sdata.

Inference¶Aswecansee,thiskindofdatacanpotentiallyrevealafairbitaboutpeople.Someonewithaccesstothedatacandrawinferences.Takeamomenttothinkaboutwhatsomeonemightbeabletoinferfromthiskindofdata.

Aswe'veseenhere,it'snottoohardtomakeaprettygoodguessatroughlywheresomelives,fromthiskindofinformation:theircarisprobablyparkedneartheirhomemostnights.Also,itwilloftenbepossibletoguesswheresomeoneworks:iftheycommuteintoworkbycar,thenonweekdaysduringbusinesshours,theircarisprobablyparkedneartheiroffice,sowe'llseeaclearclusterthatindicateswheretheywork.

ComputationalandInferentialThinking

181Exploration:Privacy

Butitdoesn'tstopthere.Ifwehaveenoughdata,itmightalsobepossibletogetasenseofwhattheyliketododuringtheirdowntime(dotheyspendtimeatthepark?).Andinsomecasesthedatamightrevealthatsomeoneisinarelationshipandspendingnightsatsomeoneelse'shouse.That'sarguablyprettysensitivestuff.

Thisgetsatoneofthechallengeswithprivacy.Datathat'scollectedforonepurpose(fightingcrime,orsomethinglikethat)canpotentiallyrevealalotmore.Itcanallowtheownerofthedatatodrawinferences--sometimesaboutthingsthatpeoplewouldprefertokeepprivate.Andthatmeansthat,inaworldof"bigdata",ifwe'renotcareful,privacycanbecollateraldamage.

Mitigation¶Ifwewanttoprotectpeople'sprivacy,whatcanbedoneaboutthis?That'salengthysubject.Butatriskofover-simplifying,thereareafewsimplestrategiesthatdataownerscantake:

1. Minimizethedatatheyhave.Collectonlywhattheyneed,anddeleteitafterit'snotneeded.

2. Controlwhohasaccesstothesensitivedata.Perhapsonlyahandfuloftrustedinsidersneedaccess;ifso,thenonecanlockdownthedatasoonlytheyhaveaccesstoit.Onecanalsologallaccess,todetermisuse.

3. Anonymizethedata,soitcan'tbelinkedbacktotheindividualwhoitisabout.Unfortunately,thisisoftenharderthanitsounds.

4. Engagewithstakeholders.Providetransparency,totrytoavoidpeoplebeingtakenbysurprise.Giveindividualsawaytoseewhatdatahasbeencollectedaboutthem.Givepeopleawaytooptoutandhavetheirdatabedeleted,iftheywish.Engageinadiscussionaboutvalues,andtellpeoplewhatstepsyouaretakingtoprotectthemfromunwantedconsequences.

Thisonlyscratchesthesurfaceofthesubject.Mymaingoalinthislecturewastomakeyouawareofprivacyconcerns,sothatifyouareeverastewardofalargedataset,youcanthinkabouthowtoprotectpeople'sdataanduseitresponsibly.

ComputationalandInferentialThinking

182Exploration:Privacy

Chapter4

Prediction

ComputationalandInferentialThinking

183Prediction

CorrelationInteract

Therelationbetweentwovariables¶Intheprevioussections,wedevelopedseveraltoolsthathelpusdescribethedistributionofasinglevariable.Datasciencealsohelpsusunderstandhowmultiplevariablesarerelatedtoeachother.Thisallowsustopredictthevalueofonevariablegiventhevaluesofothers,andtogetasenseoftheamountoferrorintheprediction.

Agoodwaytostartexploringtherelationbetweentwovariablesisbyvisualization.Agraphcalledascatterdiagramcanbeusedtoplotthevalueofonevariableagainstvaluesofanother.Letusstartbylookingatsomescatterdiagramsandthenmoveontoquantifyingsomeofthefeaturesthatwesee.

ThetablehybridcontainsdataonhybridpassengercarssoldintheUnitedStatesfrom1997to2013.ThedatawereobtainedfromtheonlinedataarchiveofProf.LarryWinneroftheUniversityofFlorida.Thecolumnsincludethemodelofthecar,yearofmanufacture,theMSRP(manufacturer'ssuggestedretailprice)in2013dollars,theaccelerationrateinkmperhourpersecond,fuelecononmyinmilespergallon,andthemodel'sclass.

hybrid

vehicle year msrp acceleration mpg class

Prius(1stGen) 1997 24509.7 7.46 41.26 Compact

Tino 2000 35355 8.2 54.1 Compact

Prius(2ndGen) 2000 26832.2 7.97 45.23 Compact

Insight 2000 18936.4 9.52 53 TwoSeater

Civic(1stGen) 2001 25833.4 7.04 47.04 Compact

Insight 2001 19036.7 9.52 53 TwoSeater

Insight 2002 19137 9.71 53 TwoSeater

Alphard 2003 38084.8 8.33 40.46 Minivan

Insight 2003 19137 9.52 53 TwoSeater

Civic 2003 14071.9 8.62 41 Compact

ComputationalandInferentialThinking

184Correlation

...(143rowsomitted)

TheTablemethodscattercanbeusedtoplotaccelerationonthehorizontalaxis(thefirstargument)andmsrponthevertical(thesecondargument).Thereare153pointsinthescatter,oneforeachcarinthetable.

hybrid.scatter('acceleration','msrp')

Thescatterofpointsisslopingupwards,indicatingthatcarswithgreateraccelerationtendedtocostmore,onaverage;conversely,thecarsthatcostmoretendedtohavegreateraccelerationonaverage.

Thisisanexampleofpositiveassociation:above-averagevaluesofonevariabletendtobeassociatedwithabove-averagevaluesoftheother.

ThescatterdiagramofMSRP(verticalaxis)versusmileage(horizontalaxis)showsanegativeassociation.Thescatterhasacleardownwardtrend.Hybridcarswithhighermileagetendedtocostless,onaverage.Thisseemssurprisingtillyouconsiderthatcarsthatacceleratefasttendtobelessfuelefficientandhavelowermileage.Asthepreviousscattershowed,thosewerealsothecarsthattendedtocostmore.

hybrid.scatter('mpg','msrp')

ComputationalandInferentialThinking

185Correlation

Alongwiththenegativeassociation,thescatterdiagramofpriceversusefficiencyshowsanon-linearrelationbetweenthetwovariables.Thepointsappeartobeclusteredaroundacurve.

IfwerestrictthedatajusttotheSUVclass,however,theassociationbetweenpriceandefficiencyisstillnegativebuttherelationappearstobemorelinear.TherelationbetweenthepriceandaccelerationofSUV'salsoshowsalineartrend,butwithapositiveslope.

suv=hybrid.where('class','SUV')

suv.scatter('mpg','msrp')

suv.scatter('acceleration','msrp')

ComputationalandInferentialThinking

186Correlation

Youwillhavenoticedthatwecanderiveusefulinformationfromthegeneralorientationandshapeofascatterdiagramevenwithoutpayingattentiontotheunitsinwhichthevariablesweremeasured.

Indeed,wecouldplotallthevariablesinstandardunitsandtheplotswouldlookthesame.Thisgivesusawaytocomparethedegreeoflinearityintwoscatterdiagrams.

HerearethetwoscatterdiagramsforSUVs,withallthevariablesmeasuredinstandardunits.

Table().with_columns([

'mpg(standardunits)',standard_units(suv.column('mpg')),

'msrp(standardunits)',standard_units(suv.column('msrp'))

]).scatter(0,1)

plots.xlim([-3,3])

plots.ylim([-3,3])

None

ComputationalandInferentialThinking

187Correlation

Table().with_columns([

'acceleration(standardunits)',standard_units(suv.column('acceleration'

'msrp(standardunits)',standard_units(suv.column('msrp'

]).scatter(0,1)

plots.xlim([-3,3])

plots.ylim([-3,3])

None

Theassociationsthatweseeinthesefiguresarethesameasthosewesawbefore.Also,becausethetwoscatterdiagramsaredrawnonexactlythesamescale,wecanseethatthelinearrelationintheseconddiagramisalittlemorefuzzythaninthefirst.

Wewillnowdefineameasurethatusesstandardunitstoquantifythekindsofassociationthatwehaveseen.

ComputationalandInferentialThinking

188Correlation

Thecorrelationcoefficient¶Thecorrelationcoefficientmeasuresthestrengthofthelinearrelationshipbetweentwovariables.Graphically,itmeasureshowclusteredthescatterdiagramisaroundastraightline.

Thetermcorrelationcoefficientisn'teasytosay,soitisusuallyshortenedtocorrelationanddenotedby .

Herearesomemathematicalfactsabout thatwewillobservebysimulation.

Thecorrelationcoefficient isanumberbetween and1.measurestheextenttowhichthescatterplotclustersaroundastraightline.

ifthescatterdiagramisaperfectstraightlineslopingupwards,and ifthescatterdiagramisaperfectstraightlineslopingdownwards.

Thefunctionr_scattertakesavalueof asitsargumentandsimulatesascatterplotwithacorrelationverycloseto .Becauseofrandomnessinthesimulation,thecorrelationisnotexpectedtobeexactlyequalto .

Callr_scatterafewtimes,withdifferentvaluesof astheargument,andseehowthefootballchanges.Positive correspondstopositiveassociation:above-averagevaluesofonevariableareassociatedwithabove-averagevaluesoftheother,andthescatterplotslopesupwards.

When thescatterplotisperfectlylinearandslopesupward.When ,thescatterplotisperfectlylinearandslopesdownward.When ,thescatterplotisaformlesscloudaroundthehorizontalaxis,andthevariablesaresaidtobeuncorrelated.

r_scatter(0.8)

ComputationalandInferentialThinking

189Correlation

r_scatter(0.25)

r_scatter(0)

r_scatter(-0.7)

ComputationalandInferentialThinking

190Correlation

Calculating ¶

Theformulafor isnotapparentfromourobservationssofar,andithasamathematicalbasisthatisoutsidethescopeofthisclass.However,thecalculationisstraightforwardandhelpsusunderstandseveralofthepropertiesof .

Formulafor :

istheaverageoftheproductsofthetwovariables,whenbothvariablesaremeasuredinstandardunits.

Herearethestepsinthecalculation.Wewillapplythestepstoasimpletableofvaluesofand .

x=np.arange(1,7,1)

y=[2,3,1,5,2,7]

t=Table().with_columns([

'x',x,

'y',y

])

t

ComputationalandInferentialThinking

191Correlation

x y

1 2

2 3

3 1

4 5

5 2

6 7

Basedonthescatterdiagram,weexpectthat willbepositivebutnotequalto1.

t.scatter(0,1,s=30,color='red')

Step1.Converteachvariabletostandardunits.

t_su=t.with_columns([

'x(standardunits)',standard_units(x),

'y(standardunits)',standard_units(y)

])

t_su

ComputationalandInferentialThinking

192Correlation

x y x(standardunits) y(standardunits)

1 2 -1.46385 -0.648886

2 3 -0.87831 -0.162221

3 1 -0.29277 -1.13555

4 5 0.29277 0.811107

5 2 0.87831 -0.648886

6 7 1.46385 1.78444

Step2.Multiplyeachpairofstandardunits.

t_product=t_su.with_column('productofstandardunits',t_su.column

t_product

x y x(standardunits) y(standardunits) productofstandardunits

1 2 -1.46385 -0.648886 0.949871

2 3 -0.87831 -0.162221 0.142481

3 1 -0.29277 -1.13555 0.332455

4 5 0.29277 0.811107 0.237468

5 2 0.87831 -0.648886 -0.569923

6 7 1.46385 1.78444 2.61215

Step3. istheaverageoftheproductscomputedinStep2.

#ristheaverageoftheproductsofstandardunits

r=np.mean(t_product.column(4))

r

0.61741639718977093

Asexpected, ispositivebutnotequalto1.

Propertiesof ¶

Thecalculationshowsthat:

ComputationalandInferentialThinking

193Correlation

isapurenumber.Ithasnounits.Thisisbecause isbasedonstandardunits.isunaffectedbychangingtheunitsoneitheraxis.Thistooisbecause isbasedon

standardunits.isunaffectedbyswitchingtheaxes.Algebraically,thisisbecausetheproductof

standardunitsdoesnotdependonwhichvariableiscalled andwhich .Geometrically,switchingaxesreflectsthescatterplotabouttheline ,butdoesnotchangetheamountofclusteringnorthesignoftheassociation.

t.scatter('y','x',s=30,color='red')

Wecandefineafunctioncorrelationtocompute ,basedontheformulathatweusedabove.Theargumentsareatableandtwolabelsofcolumnsinthetable.Thefunctionreturnsthemeanoftheproductsofthosecolumnvaluesinstandardunits,whichis .

defcorrelation(t,x,y):

returnnp.mean(standard_units(t.column(x))*standard_units(t.column

Let'scallthefunctiononthexandycolumnsoft.Thefunctionreturnsthesameanswertothecorrelationbetween and aswegotbydirectapplicationoftheformulafor.

correlation(t,'x','y')

0.61741639718977093

ComputationalandInferentialThinking

194Correlation

Callingcorrelationoncolumnsofthetablesuvgivesusthecorrelationbetweenpriceandmileageaswellasthecorrelationbetweenpriceandacceleration.

correlation(suv,'mpg','msrp')

-0.6667143635709919

correlation(suv,'acceleration','msrp')

0.48699799279959155

Thesevaluesconfirmwhatwehadobserved:

Thereisanegativeassociationbetweenpriceandefficiency,whereastheassociationbetweenpriceandaccelerationispositive.Thelinearrelationbetweenpriceandaccelerationisalittleweaker(correlationabout0.5)thanbetweenpriceandmileage(correlationabout-0.67).

CareinUsingCorrelation¶

Correlationisasimpleandpowerfulconcept,butitissometimesmisused.Beforeusing ,itisimportanttobeawareofwhatcorrelationdoesanddoesnotmeasure.

Correlationonlymeasuresassociation.Correlationdoesnotimplycausation.Thoughthecorrelationbetweentheweightandthemathabilityofchildreninaschooldistrictmaybepositive,thatdoesnotmeanthatdoingmathmakeschildrenheavierorthatputtingonweightimprovesthechildren'smathskills.Ageisaconfoundingvariable:olderchildrenarebothheavierandbetteratmaththanyoungerchildren,onaverage.

Correlationmeasureslinearassociation.Variablesthathavestrongnon-linearassociationmighthaveverylowcorrelation.Hereisanexampleofvariablesthathaveaperfectquadraticrelation buthavecorrelationequalto0.

nonlinear=Table().with_columns([

'x',np.arange(-4,4.1,0.5),

'y',np.arange(-4,4.1,0.5)**2

])

nonlinear.scatter('x','y',s=30,color='r')

ComputationalandInferentialThinking

195Correlation

correlation(nonlinear,'x','y')

0.0

Outlierscanhaveabigeffectoncorrelation.Hereisanexamplewhereascatterplotforwhich isequalto1isturnedintoaplotforwhich isequalto0,bytheadditionofjustoneoutlyingpoint.

line=Table().with_columns([

'x',[1,2,3,4],

'y',[1,2,3,4]

])

line.scatter('x','y',s=30,color='r')

ComputationalandInferentialThinking

196Correlation

correlation(line,'x','y')

1.0

outlier=Table().with_columns([

'x',[1,2,3,4,5],

'y',[1,2,3,4,0]

])

outlier.scatter('x','y',s=30,color='r')

correlation(outlier,'x','y')

0.0

Correlationsbasedonaggregateddatacanbemisleading.Asanexample,herearedataontheCriticalReadingandMathSATscoresin2014.Thereisonepointforeachofthe50statesandoneforWashington,D.C.ThecolumnParticipationRatecontainsthepercentofhighschoolseniorswhotookthetest.Thenextthreecolumnsshowtheaveragescoreinthestateoneachportionofthetest,andthefinalcolumnistheaverageofthetotalscoresonthetest.

sat2014=Table.read_table('sat2014.csv').sort('State')

sat2014

ComputationalandInferentialThinking

197Correlation

State ParticipationRate

CriticalReading Math Writing Combined

Alabama 6.7 547 538 532 1617

Alaska 54.2 507 503 475 1485

Arizona 36.4 522 525 500 1547

Arkansas 4.2 573 571 554 1698

California 60.3 498 510 496 1504

Colorado 14.3 582 586 567 1735

Connecticut 88.4 507 510 508 1525

Delaware 100 456 459 444 1359

DistrictofColumbia 100 440 438 431 1309

Florida 72.2 491 485 472 1448

...(41rowsomitted)

ThescatterdiagramofMathscoresversusCriticalReadingscoresisverytightlyclusteredaroundastraightline;thecorrelationiscloseto0.985.

sat2014.scatter('CriticalReading','Math')

correlation(sat2014,'CriticalReading','Math')

0.98475584110674341

ComputationalandInferentialThinking

198Correlation

ItisimportanttonotethatthisdoesnotreflectthestrengthoftherelationbetweentheMathandCriticalReadingscoresofstudents.Statesdon'ttaketests–studentsdo.Thedatainthetablehavebeencreatedbylumpingallthestudentsineachstateintoasinglepointattheaveragevaluesofthetwovariablesinthatstate.Butnotallstudentsinthestatewillbeatthatpoint,asstudentsvaryintheirperformance.Ifyouplotapointforeachstudentinsteadofjustoneforeachstate,therewillbeacloudofpointsaroundeachpointinthefigureabove.Theoverallpicturewillbemorefuzzy.ThecorrelationbetweentheMathandCriticalReadingscoresofthestudentswillbelowerthanthevaluecalculatedbasedonstateaverages.

Correlationsbasedonaggregatesandaveragesarecalledecologicalcorrelationsandarefrequentlyreported.Aswehavejustseen,theymustbeinterpretedwithcare.

Seriousortongue-in-cheek?¶

In2012,apaperintherespectedNewEnglandJournalofMedicineexaminedtherelationbetweenchocolateconsumptionandNobelPrizesinagroupofcountries.TheScientificAmericanrespondedseriously;othersweremorerelaxed.Thepaperincludedthefollowinggraph:

ComputationalandInferentialThinking

199Correlation

ComputationalandInferentialThinking

200Correlation

RegressionInteract

Theregressionline¶Theconceptsofcorrelationandthe"best"straightlinethroughascatterplotweredevelopedinthelate1800'sandearly1900's.ThepioneersinthefieldwereSirFrancisGalton,whowasacousinofCharlesDarwin,andGalton'sprotégéKarlPearson.Galtonwasinterestedineugenics,andwasameticulousobserverofthephysicaltraitsofparentsandtheiroffspring.Pearson,whohadgreaterexpertisethanGaltoninmathematics,helpedturnthoseobservationsintothefoundationsofmathematicalstatistics.

ThescatterplotbelowisofafamousdatasetcollectedbyPearsonandhiscolleaguesintheearly1900's.Itconsistsoftheheights,ininches,of1,078pairsoffathersandsons.Thescattermethodofatabletakesanoptionalargumentstosetthesizeofthedots.

heights=Table.read_table('heights.csv')

heights.scatter('father','son',s=10)

Noticethefamiliarfootballshapewithadensecenterandafewpointsontheperimeter.Thisshaperesultsfromtwobell-shapeddistributionsthatarecorrelated.Theheightsofbothfathersandsonshaveabell-shapeddistribution.

ComputationalandInferentialThinking

201Regression

heights.hist(bins=np.arange(55.5,80,1),unit='inch')

Theaverageheightofsonsisaboutaninchtallerthantheaverageoffathers.

np.mean(heights.column('son'))-np.mean(heights.column('father'))

0.99740259740261195

Thedifferenceinheightbetweenasonandafathervariesaswell,withthebulkofthedistributionbetween-5inchesand+7inches.

diffs=Table().with_column('son-fatherdifference',

heights.column('son')-heights.column(

diffs.hist(bins=np.arange(-9.5,10,1),unit='inch')

ComputationalandInferentialThinking

202Regression

Thecorrelationbetweentheheightsofthefathersandsonsisabout0.5.

r=correlation(heights,'father','son')

r

0.50116268080759108

Theregressioneffect¶

Forfathersaround72inchesinheight,wemightexpecttheirsonstobetallaswell.Thehistogrambelowshowstheheightofallsonsof72-inchfathers.

six_foot_fathers=heights.where(np.round(heights.column('father'))

six_foot_fathers.hist('son',bins=np.arange(55.5,80,1))

ComputationalandInferentialThinking

203Regression

Most(68%)ofthesonsofthese72-inchfathersarelessthan72inchestall,eventhoughsonsareaninchtallerthanfathersonaverage!

np.count_nonzero(six_foot_fathers.column('son')<72)/six_foot_fathers

0.6842105263157895

Infact,theaverageheightofasonofa72-inchfatherislessthan71inches.Thesonsoftallfathersaresimplynotastallinthissample.

np.mean(six_foot_fathers.column('son'))

70.728070175438603

ThisfactwasnoticedbyGalton,whohadbeenhopingthatexceptionallytallfatherswouldhavesonswhowerejustasexceptionallytall.However,thedatawereclear,andGaltonrealizedthatthetallfathershavesonswhoarenotquiteasexceptionallytall,onaverage.Frustrated,Galtoncalledthisphenomenon"regressiontomediocrity."

Galtonalsonoticedthatexceptionallyshortfathershadsonswhoweresomewhattallerrelativetotheirgeneration,onaverage.Ingeneral,individualswhoareawayfromaverageononevariableareexpectedtobenotquiteasfarawayfromaverageontheother.Thisiscalledtheregressioneffect.

ComputationalandInferentialThinking

204Regression

Thefigurebelowshowsthescatterplotofthedatawhenbothvariablesaremeasuredinstandardunits.Theredlineshowsequalstandardunitsandhasaslopeof1,butitdoesnotmatchtheangleofthecloudofpoints.Theblueline,whichdoesfollowtheangleofthecloud,iscalledtheregressionline,namedforthe"regressiontomediocrity"itpredicts.

heights_standard=Table().with_columns([

'father(standardunits)',standard_units(heights['father']),

'son(standardunits)',standard_units(heights['son'])

])

heights_standard.scatter(0,fit_line=True,s=10)

_=plots.plot([-3,3],[-3,3],color='r',lw=3)

ThescattermethodofaTabledrawsthisregressionlineforuswhencalledwithfit_line=True.Thislinepassesthroughtheresultofmultiplyingfatherheights(instandardunits)by .

father_standard=heights_standard.column(0)

heights_standard.with_column('father(standardunits)*r',father_standard

ComputationalandInferentialThinking

205Regression

Anotherinterpretationofthislineisthatitpassesthroughaveragevaluesforslicesofthepopulationofsons.Toseethisrelationship,wecanroundthefatherseachtothenearestunit,thenaveragetheheightsofallsonsassociatedwiththeseroundedvalues.Thegreenlineistheregressionlineforthesedata,anditpassesclosetoalloftheyellowpoints,whicharemeanheightsofsons(instandardunits).

rounded=heights_standard.with_column('father(standardunits)',np

rounded.join('father(standardunits)',rounded.group(0,np.average

_=plots.plot([-3,3],[-3*r,3*r],color='g',lw=2)

ComputationalandInferentialThinking

206Regression

KarlPearsonusedtheobservationoftheregressioneffectinthedataabove,aswellasinotherdataprovidedbyGalton,todeveloptheformalcalculationofthecorrelationcoefficient.Thatiswhy issometimescalledPearson'scorrelation.

Theregressionlineinoriginalunits¶

Aswesawinthelastsectionforfootballshapedscatterplots,whenthevariables andaremeasuredinstandardunits,thebeststraightlineforestimating basedon hasslopeandpassesthroughtheorigin.Thustheequationoftheregressionlinecanbewrittenas:

Thatis,

Theequationcanbeconvertedintotheoriginalunitsofthedata,eitherbyrearrangingthisequationalgebraically,orbylabelingsomeimportantfeaturesofthelinebothinstandardunitsandintheoriginalunits.

ComputationalandInferentialThinking

207Regression

Calculationoftheslopeandintercept¶

Theregressionlineisalsocommonlyexpressedasaslopeandanintercept,whereanestimateforyiscomputedfromanxvalueusingtheequation

Thecalculationsoftheslopeandinterceptoftheregressionlinecanbederivedfromtheequationabove.

ComputationalandInferentialThinking

208Regression

defslope(table,x,y):

r=correlation(table,x,y)

returnr*np.std(table.column(y))/np.std(table.column(x))

defintercept(table,x,y):

a=slope(table,x,y)

returnnp.mean(table.column(y))-a*np.mean(table.column(x))

[slope(heights,'father','son'),intercept(heights,'father','son'

[0.51400591254559247,33.892800540661682]

Itisworthnotingthattheinterceptofapproximately33.89inchesisnotintendedasanestimateoftheheightofasonwhosefatheris0inchestall.Thereisnosuchsonandnosuchfather.Theinterceptismerelyageometricoralgebraicquantitythathelpsdefinetheline.Ingeneral,itisnotagoodideatoextrapolate,thatis,tomakeestimatesoutsidetherangeoftheavailabledata.Itiscertainlynotagoodideatoextrapolateasfarawayas0isfromtheheightsofthefathersinthestudy.

Itisalsoworthnotingthattheslopeisnotr!Instead,itisrmultipliedbytheratioofstandarddeviations.

correlation(heights,'father','son')

0.50116268080759108

Fittedvalues¶Wecanalsoestimateeverysoninthedatausingtheslopeandintercept.Theestimatedvaluesof arecalledthefittedvalues.Theyalllieonastraightline.Tocalculatethem,takeason'sheight,multiplyitbytheslopeoftheregressionline,andaddtheintercept.Inotherwords,calculatetheheightoftheregressionlineatthegivenvalueof .

ComputationalandInferentialThinking

209Regression

deffit(table,x,y):

"""Returntheheightoftheregressionlineateachxvalue."""

a=slope(table,x,y)

b=intercept(table,x,y)

returna*table.column(x)+b

fitted=heights.with_column('son(fitted)',fit(heights,'father',

fitted

father son son(fitted)

65 59.8 67.3032

63.3 63.2 66.4294

65 63.3 67.3032

65.8 62.8 67.7144

61.1 64.3 65.2986

63 64.2 66.2752

65.4 64.1 67.5088

64.7 64 67.149

66.1 64.6 67.8686

67 64 68.3312

...(1068rowsomitted)

Inoriginalunits,wecanseethatthesefittedvaluesfallonaline.

fitted.scatter(0,s=10)

ComputationalandInferentialThinking

210Regression

Moreover,thisline(ingreenbelow),passesneartheaverageheightsofsonsforslicesofthepopulation.Inthiscase,weroundtheheightsoffatherstothenearestinchtoconstructtheslices.

rounded=heights.with_column('father',np.round(heights.column('father'

rounded.join('father',rounded.group(0,np.average)).scatter(0,s=80

_=plots.plot(fitted.column('father'),fitted.column('son(fitted)'

ComputationalandInferentialThinking

211Regression

ComputationalandInferentialThinking

212Regression

PredictionInteractInthelastsection,wedevelopedtheconceptsofcorrelationandregressionaswaystodescribedata.Wewillnowseehowtheseconceptscanbecomepowerfultoolsforprediction,whenusedappropriately.Inparticular,givenpairedobservationsoftwoquantitiesandsomeadditionalobservationsofonlythefirstquantity,wecaninfertypicalvaluesforthesecondquantity.

Assumptionsofrandomness:a"regressionmodel"¶Predictionispossibleifwebelievethatascatterplotreflectstheunderlyingrelationbetweenthetwovariablesbeingplotted,butdoesnotspecifytherelationcompletely.Forexample,ascatterplotoftheheightsoffathersandsonsshowsusthepreciserelationbetweenthetwovariablesinoneparticularsampleofmen;butwemightwonderwhetherthatrelationholdstrue,oralmosttrue,amongallfathersandsonsinthepopulationfromwhichthesamplewasdrawn,orindeedamongfathersandsonsingeneral.

Asalways,inferentialthinkingbeginswithacarefulexaminationoftheassumptionsaboutthedata.Setsofassumptionsareknownasmodels.Setsofassumptionsaboutrandomnessinroughlylinearscatterplotsarecalledregressionmodels.

Inbrief,suchmodelssaythattheunderlyingrelationbetweenthetwovariablesisperfectlylinear;thisstraightlineisthesignalthatwewouldliketoidentify.However,wearenotabletoseethelineclearly,becausethereisadditionalvariabilityinthedatabeyondtherelation.Whatweseearepointsthatarescatteredaroundtheline.Foreachofthepoints,thelinearsignalthatrelatesthetwovariableshasbeencombinedwithsomeothersourceofvariabilitythathasnothingtodowiththerelationatall.Thisrandomnoiseobfuscatesthelinearsignal,anditisourinferentialgoaltoidentifythesignaldespitethenoise.

Inordertomakethesestatementsmoreprecise,wewillusethenp.random.normalfunction,whichgeneratessamplesfromanormaldistributionwith0meanand1standarddeviation.Thesesamplescorrespondtoabell-shapeddistributionexpressedinstandardunits.

np.random.normal()

ComputationalandInferentialThinking

213Prediction

-0.6423537870160524

samples=Table('x')

foriinnp.arange(10000):

samples.append([np.random.normal()])

samples.hist(0,bins=np.arange(-4,4,0.1))

Theregressionmodelspecifiesthatthepointsinascatterplot,measuredinstandardunits,aregeneratedatrandomasfollows.

Thexaresampledfromthenormaldistribution.

Foreachx,itscorrespondingyissampledusingthesignal_and_noisefunctionbelow,whichtakesxandthecorrelationcoefficientr.

defsignal_and_noise(x,r):

returnr*x+np.random.normal()*(1-r**2)**0.5

Forexample,ifwechoosertobeahalf,thenthefollowingscatterdiagrammightresult.

ComputationalandInferentialThinking

214Prediction

defregression_model(r,sample_size):

pairs=Table(['x','y'])

foriinnp.arange(sample_size):

x=np.random.normal()

y=signal_and_noise(x,r)

pairs.append([x,y])

returnpairs

regression_model(0.5,1000).scatter('x','y')

Basedonthisscatterplot,howshouldweestimatethetrueline?Thebestlinethatwecanputthroughascatterplotistheregressionline.Sotheregressionlineisanaturalestimateofthetrueline.

Thesimulationbelowdrawsboththetruelineusedtogeneratethedata(green)andtheregressionline(blue)foundbyanalyzingthepairsthatweregenerated.

ComputationalandInferentialThinking

215Prediction

defcompare(true_r,sample_size):

pairs=regression_model(true_r,sample_size)

estimated_r=correlation(pairs,'x','y')

pairs.scatter('x','y',fit_line=True)

plt.plot([-3,3],[-3*true_r,3*true_r],color='g',lw=2)

print("Thetrueris",true_r,"andtheestimatedris",estimated_r

compare(0.5,1000)

Thetrueris0.5andtheestimatedris0.459672161067

With1000samples,weseethatthetruelineandtheregressionlinearequitesimilar.Withfewersamples,theregressionestimateofthetruelinethatgeneratedthedatamaybequitedifferent.

compare(0.5,50)

Thetrueris0.5andtheestimatedris0.627329170668

ComputationalandInferentialThinking

216Prediction

Withveryfewsamples,theregressionlinemayhaveaclearpositiveornegativeslope,evenwhenthetruerelationshiphasacorrelationof0.

compare(0,10)

Thetrueris0andtheestimatedris0.501582566452

ComputationalandInferentialThinking

217Prediction

Withrealdata,wewillneverobservethetrueline.Infact,regressionisoftenappliedwhenweareuncertainwhetherthetruerelationbetweentwoquantitiesislinearatall.Whatthesimulationshowsthatiftheregressionmodellooksplausible,andwehavealargesample,thenregressionlineisagoodapproximationtothetrueline.

Predictions¶Hereisanexamplewhereregressionmodelcanbeusedtomakepredictions.

ThedataareasubsetoftheinformationgatheredinarandomizedcontrolledtrialabouttreatmentsforHodgkin'sdisease.Hodgkin'sdiseaseisacancerthattypicallyaffectsyoungpeople.Thediseaseiscurablebutthetreatmentisveryharsh.Thepurposeofthetrialwastocomeupwithdosagethatwouldcurethecancerbutminimizetheadverseeffectsonthepatients.

Thistablehodgkinscontainsdataontheeffectthatthetreatmenthadonthelungsof22patients.Thecolumnsare:

HeightincmAmeasureofradiationtothemantle(neck,chest,underarms)AmeasureofchemotherapyAscoreofthehealthofthelungsatbaseline,thatis,atthestartofthetreatment;higherscorescorrespondtomorehealthylungsThesamescoreofthehealthofthelungs,15monthsaftertreatment

ComputationalandInferentialThinking

218Prediction

hodgkins=Table.read_table('hodgkins.csv')

hodgkins.show()

height rad chemo base month15

164 679 180 160.57 87.77

168 311 180 98.24 67.62

173 388 239 129.04 133.33

157 370 168 85.41 81.28

160 468 151 67.94 79.26

170 341 96 150.51 80.97

163 453 134 129.88 69.24

175 529 264 87.45 56.48

185 392 240 149.84 106.99

178 479 216 92.24 73.43

179 376 160 117.43 101.61

181 539 196 129.75 90.78

173 217 204 97.59 76.38

166 456 192 81.29 67.66

170 252 150 98.29 55.51

165 622 162 118.98 90.92

173 305 213 103.17 79.74

174 566 198 94.97 93.08

173 322 119 85 41.96

173 270 160 115.02 81.12

183 259 241 125.02 97.18

188 238 252 137.43 113.2

Itisevidentthatthepatients'lungswerelesshealthy15monthsafterthetreatmentthanatbaseline.At36months,theydidrecovermostoftheirlungfunction,butthosedataarenotpartofthistable.

Thescatterplotbelowshowsthattallerpatientshadhigherscoresat15months.

ComputationalandInferentialThinking

219Prediction

hodgkins.scatter('height','month15',fit_line=True)

Thescatterplotlooksroughlylinear.Letusassumethattheregressionmodelholds.Underthatassumption,theregressionlinecanbeusedtopredictthe15-monthscoreofanewpatient,basedonthepatient'sheight.Thepredictionwillbegoodprovidedtheassumptionsoftheregressionmodelarejustifiedforthesedata,andprovidedthenewpatientissimilartothoseinthestudy.

Thefunctionregressgivesustheslopeandinterceptoftheregressionline.Topredictthe15-monthscoreofanewpatientwhoseheightis cm,weusethefollowingcalculation:

Predicted15-monthscoreofapatientwhois inchestall slope intercept

Thepredicted15-monthscoreofapatientwhois173cmtallis83.66cm,andthatofapatientwhois163cmtallis73.69cm.

#slopeandintercept

a=slope(hodgkins,'height','month15')

b=intercept(hodgkins,'height','month15')

ComputationalandInferentialThinking

220Prediction

#Newpatient,173cmtall

#Predicted15-monthscore:

a*173+b

83.65699801460444

#Newpatient,163cmtall

#Predicted15-monthscore:

a*163+b

73.694360467072741

Thelogicofregressionpredictioncanbewrappedinafunctionpredictthattakesatable,thetwocolumnsofinterest,andsomenewvalueforx.Itpredictstheaveragevaluefory.

defpredict(t,x,y,new_x_value):

a=slope(t,x,y)

b=intercept(t,x,y)

returna*new_x_value+b

predict(hodgkins,'height','month15',163)

73.694360467072741

VariabilityofthePrediction¶Asdatascientistsworkingundertheregressionmodel,wemustalwaysremainawarethatasamplemighthavebeendifferent.Hadthesamplebeendifferent,theregressionlinewouldhavebeendifferenttoo,andsowouldourprediction.

ComputationalandInferentialThinking

221Prediction

Tounderstandhowmuchapredictionmightvary,wecanselectdifferentsamplesfromthedatawehave.Below,weselectonly15ofthe22rowsofthehodgkinstableatrandomwithoutreplacement,thenmakesaprediction.

predict(hodgkins.sample(15),'height','month15',163)

71.653856125189805

Thepredictionisdifferent,buthowmuchmightweexpectittodiffer?Fromasinglesample,itisnotpossibletoanswerthisquestion.Thesimulationbelowdraws1000samplesof15rowseach,thendrawsahistogramofthepredictedvaluesusingeachsample.

defsample_predict(new_x_value):

predictions=Table(['Predictions'])

foriinnp.arange(1000):

predicted=predict(hodgkins.sample(15),'height','month15'

predictions.append([predicted])

predictions.hist(0,bins=np.arange(40,96,2))

sample_predict(163)

Usingonly15rowsinsteadofall22,thepredictioncouldhavecomeoutdifferentlyfrom73.69.Thishistogramshowshowdifferentthepredictionmightbe.Almostallpredictionsfallbetween64and82,butafewareoutsideofthisrange.

ComputationalandInferentialThinking

222Prediction

Foranxvaluethatisclosertothemeanofallheights,thepredictionsfrom15rowsaremoretightlyclustered.

sample_predict(173)

Inthishistogramofthepredictionsforaheightof173,weseethatalmostallpredictionsfallbetween76and90.Thenarrowrange80to86accountsfornearly70%oftheestimates.

Predictionsofvaluesoutsidethetypicalobservationsofheightvarywildly.Adatascientistisrarelyjustifiedinextrapolatingalineartrendbeyondtherangeofvaluesobserved.

sample_predict(150)

ComputationalandInferentialThinking

223Prediction

Higher-OrderFunctionsInteractThefunctionswehavedefinedsofarcomputetheiroutputsdirectlyfromtheirinputs.Theirreturnvaluescanbecomputedfromtheirarguments.However,somefunctionsrequireadditionaldatatodeteriminetheirreturnvalues.

Let'sstartwithasimpleexample.Considerthefollowingthreefunctionsthatallscaletheirinputbyaconstantfactor.

defdouble(x):

return2*x

deftriple(x):

return3*x

defhalve(x):

return0.5*x

[double(4),triple(4),halve(4),double(5),triple(5),halve(5)]

[8,12,2.0,10,15,2.5]

Ratherthandefiningeachofthesefunctionswithaseparatedefstatement,itispossibletogenerateeachofthemdynamicallybypassinginaconstanttoagenericscale_byfunction.

defscale_by(a):

defscale(x):

returna*x

returnscale

double=scale_by(2)

triple=scale_by(3)

halve=scale_by(0.5)

[double(4),triple(4),halve(4),double(5),triple(5),halve(5)]

ComputationalandInferentialThinking

224Higher-OrderFunctions

[8,12,2.0,10,15,2.5]

Thebodyofthescale_byfunctionactuallydefinesanewfunction,calledscale,andthenreturnsit.Thereturnvalueofscale_byisafunctionthattakesanumberxandmultipliesitbya.Becausescale_byisafunctionthatreturnsafunction,itiscalledahigher-orderfunction.

Thepurposeofhigher-orderfunctionsistocreatefunctionsthatcombinedatawithbehavior.Above,thedoublefunctioniscreatedbycombiningthedatavalue2withthebehaviorofscalingbyaconstant.

Adefstatementwithinanotherdefstatementbehavesinthesamewayasanyotherdefstatementexceptthat:

1. Thenameoftheinnerfunctionisonlyaccessiblewithintheouterfunction'sbody.Forexample,itisnotpossibletousethenamescaleoncescale_byhasreturned.

2. Thebodyoftheinnerfunctioncanrefertotheargumentsandnameswithintheouterfunction.So,forexample,thebodyoftheinnerscalefunctioncanrefertoa,whichisanargumentoftheouterfunctionscale_by.

Thefinallineoftheouterscale_byfunctionisreturnscale.Thislinereturnsthefunctionthatwasjustdefined.Thepurposeofreturningafunctionistogiveitaname.Theline

double=scale_by(2)

givesthenamedoubletotheparticularinstanceofthescalefunctionthathasaassignedtothenumber2.Noticethatitispossibletokeepmultipleinstancesofthescalefunctionaroundatonce,allwithdifferentnamessuchasdoubleandtriple,andPythonkeepstrackofwhichvalueforatouseeachtimeoneofthesefunctionsiscalled.

Example:StandardUnits¶Wehaveseenhowtoconvertalistofnumberstostandardunitsbysubtractingthemeananddividingbythestandarddeviation.

defstandard_units(any_numbers):

"Convertanyarrayofnumberstostandardunits."

return(any_numbers-np.mean(any_numbers))/np.std(any_numbers)

ComputationalandInferentialThinking

225Higher-OrderFunctions

Thisfunctioncanconvertanylistofnumberstostandardunits.

observations=[3,4,2,4,3,5,1,6]

np.round(standard_units(observations),3)

array([-0.333,0.333,-1.,0.333,-0.333,1.,-1.667,1.667])

Howwouldweanswerthequestion,"whatisthenumber8(inoriginalunits)convertedtostandardunits?"

Adding8tothelistofobservationsandre-computingstandard_unitsdoesnotanswerthisquestion,becauseaddinganobservationchangesthescale!

observations_with_eight=[3,4,2,4,3,5,1,6,8]

standard_units(observations_with_eight)

array([-0.5,0.,-1.,0.,-0.5,0.5,-1.5,1.,2.])

Ifinsteadwewanttomaintaintheoriginalscale,weneedafunctionthatrememberstheoriginalmeanandstandarddeviation.Thisscenario,wherebothdataandbehaviorarerequiredtoarriveattheanswer,isanaturalfitforahigher-orderfunction.

defconverter_to_standard_units(any_numbers):

"""Returnafunctionthatconvertstothestandardunitsofany_numbers."""

mean=np.mean(any_numbers)

sd=np.std(any_numbers)

defconvert(a_number_in_original_units):

return(a_number_in_original_units-mean)/sd

returnconvert

Now,wecanusetheoriginallistofnumberstocreatetheconversionfunction.Themeanandstandarddeviationarestoredwithinthefunctionthatisreturned,calledto_subelow.What'smore,thesenumbersareonlycomputedonce,nomatterhowmanytimeswecallthefunction.

ComputationalandInferentialThinking

226Higher-OrderFunctions

observations=[3,4,2,4,3,5,1,6]

to_su=converter_to_standard_units(observations)

[to_su(2),to_su(3),to_su(8)]

[-1.0,-0.33333333333333331,3.0]

Acomplementaryfunctionthatreturnsaconverterfromstandardunitstooriginalunitsinvolvesmultipyingbythestandarddeviation,thenaddingthemean.

defconverter_from_standard_units(any_numbers):

"""Returnafunctionthatconvertsfromthestandardunitsofany_numbers."""

mean=np.mean(any_numbers)

sd=np.std(any_numbers)

defconvert(in_standard_units):

returnin_standard_units*sd+mean

returnconvert

from_su=converter_from_standard_units(observations)

from_su(3)

8.0

Itshouldbethecasethatanynumberconvertedtostandardunitsandback,usingthesamelisttocreatebothfunctions.

from_su(to_su(11))

11.0

Example:Prediction¶Previously,inordertocomputethefittedvaluesforaregression,wefirstcomputedtheslopeandinterceptoftheregressionline.Asanalternateimplementation,foreachx,wecanconvertxtostandardunitsfromtheoriginalunitsofxvalues,thenmultiplybyr,then

ComputationalandInferentialThinking

227Higher-OrderFunctions

converttheresulttotheoriginalunitsofyvalues.Wecanwritedownthisprocessasahigher-orderfunction.

defregression_estimator(table,x,y):

"""Returnanestimatorforyasafunctionofx."""

to_su=converter_to_standard_units(table.column(x))

from_su=converter_from_standard_units(table.column(y))

r=correlation(table,x,y)

defestimate(any_x):

returnfrom_su(r*to_su(any_x))

returnestimate

ThefollowingdatasetofmothersandbabieswascollectedfromtheKaiserHospitalinOakland,California.

baby=Table.read_table('baby.csv')

baby

BirthWeight

GestationalDays

MaternalAge

MaternalHeight

MaternalPregnancyWeight

MaternalSmoker

120 284 27 62 100 False

113 282 33 64 135 False

128 279 28 64 115 True

108 282 23 67 125 True

136 286 25 62 93 False

138 244 33 62 178 False

132 245 23 65 140 False

120 289 25 62 125 False

143 299 30 66 136 True

140 351 27 68 120 False

...(1164rowsomitted)

Gestationaldaysarethenumberofdaysthatthemotherispregnant.Mostpregnancieslastbetween250and310days.

ComputationalandInferentialThinking

228Higher-OrderFunctions

baby.hist(1,bins=np.arange(200,330,10))

Forthiscollectionoftypicalbirths,gestationaldaysandbirthweightappeartobelinearlyrelated.

typical=baby.where(np.logical_and(baby.column(1)>=250,baby.column

typical.scatter(1,0,fit_line=True)

ComputationalandInferentialThinking

229Higher-OrderFunctions

Thefunctionaverage_weightreturnstheaverageweightofbabiesbornafteracertainnumberofgestationaldays,usingtheregressionlinetopredictthisvalue.Thisfunctionisdefinedwithinregression_estimatorandreturned.Thereturnedvalueisboundtothenameaverage_weightusinganassignmentstatement.

average_weight=regression_estimator(typical,1,0)

Theaverage_weightfunctioncontainsalltheinformationneededtomakebirthweightpredictions,butwhenitiscalled,ittakesonlyasingleargument:thegestationaldays.Weavoidtheneedtopassthetableandcolumnlabelsintotheaverage_weightfunctioneachtimethatwecallitbecausethosevalueswerepassedtoregression_estimator,whichdefinedtheaverage_weightfunctionintermsofthosevalues.

average_weight(290)

126.33785752202166

Thisfunctionisquiteflexible.Forinstance,wecancomputefittedvaluesforallgestationaldaysbyapplyingittothecolumnofgestationaldays.

fitted=typical.apply(average_weight,1)

typical.select([1,0]).with_column('BirthWeight(fitted)',fitted)

ComputationalandInferentialThinking

230Higher-OrderFunctions

Higher-orderfunctionsappearofteninthecontextofpredictionbecauseestimatingavalueofteninvolvesbothdatatoinformthepredictionandaprocedurethatmakesthepredictionitself.

ComputationalandInferentialThinking

231Higher-OrderFunctions

ErrorsInteract

Errorintheregressionestimate¶

Thoughtheaverageresidualis0,eachindividualresidualisnot.Someresidualsmightbequitefarfrom0.Togetasenseoftheamountoferrorintheregressionestimate,wewillstartwithagraphicaldescriptionofthesenseinwhichtheregressionlineisthe"best".

Ourexampleisadatasetthathasonepointforeverychapterofthenovel"LittleWomen."Thegoalistoestimatethenumberofcharacters(thatis,letters,punctuationmarks,andsoon)basedonthenumberofperiods.Recallthatweattemptedtodothisintheveryfirstlectureofthiscourse.

little_women=Table.read_table('little_women.csv')

little_women.show(3)

Characters Periods

21759 189

22148 188

20558 231

...(44rowsomitted)

#Onepointforeachchapter

#Horizontalaxis:numberofperiods

#Verticalaxis:numberofcharacters(asina,b,",?,etc;notpeopleinthebook)

little_women.scatter('Periods','Characters')

ComputationalandInferentialThinking

232Errors

correlation(little_women,'Periods','Characters')

0.92295768958548163

Thescatterplotisremarkablyclosetolinear,andthecorrelationismorethan0.92.

Thefigurebelowshowsthescatterplotandregressionline,withfouroftheerrorsmarkedinred.

#Residuals:Deviationsfromtheregressionline

lw_slope=slope(little_women,'Periods','Characters')

lw_intercept=intercept(little_women,'Periods','Characters')

print('Slope:',np.round(lw_slope))

print('Intercept:',np.round(lw_intercept))

lw_errors(lw_slope,lw_intercept)

Slope:87.0

Intercept:4745.0

ComputationalandInferentialThinking

233Errors

Hadweusedadifferentlinetocreateourestimates,theerrorswouldhavebeendifferent.Thepicturebelowshowshowbigtheerrorswouldbeifweweretouseaparticularlysillylineforestimation.

#Errors:Deviationsfromadifferentline

lw_errors(-100,50000)

Belowisalinethatwehaveusedbeforewithoutsayingthatwewereusingalinetocreateestimates.Itisthehorizontallineatthevalue"averageof ."Supposeyouwereaskedtoestimate andwerenottoldthevalueof ;thenyouwouldusetheaverageof asyour

ComputationalandInferentialThinking

234Errors

estimate,regardlessofthechapter.Inotherwords,youwouldusetheflatlinebelow.

Eacherrorthatyouwouldmakewouldthenbeadeviationfromaverage.TheroughsizeofthesedeviationsistheSDof .

Insummary,ifweusetheflatlineattheaverageof tomakeourestimates,theestimateswillbeoffbytheSDof .

#Errors:Deviationsfromtheflatlineattheaverageofy

characters_average=np.mean(little_women.column('Characters'))

lw_errors(0,characters_average)

LeastSquares¶

Ifyouuseanyarbitrarylineasyourlineofestimates,thensomeofyourerrorsarelikelytobepositiveandothersnegative.Toavoidcancellationwhenmeasuringtheroughsizeoftheerrors,wetakethemeanofthesqaurederrorsratherthanthemeanoftheerrorsthemselves.Thisisexactlyanalogoustoourreasonforlookingatsquareddeviationsfromaverage,whenwewerelearninghowtocalculatetheSD.

Themeansquarederrorofestimationusingastraightlineisameasureofroughlyhowbigthesquarederrorsare;takingthesquarerootyieldstherootmeansquareerror,whichisinthesameunitsas .

ComputationalandInferentialThinking

235Errors

Hereisaremarkablefactofmathematics:theregressionlineminimizesthemeansquarederrorofestimation(andhencealsotherootmeansquarederror)amongallstraightlines.Thatiswhytheregressionlineissometimescalledthe"leastsquaresline."

Computingthe"best"line.

Togetestimatesof basedon ,youcanuseanylineyouwant.Everylinehasameansquarederrorofestimation."Better"lineshavesmallererrors.Theregressionlineistheuniquestraightlinethatminimizesthemeansquarederrorofestimationamongallstraightlines.

WecandefineafunctiontocomputetherootmeansquarederrorofanylinethroughttheLittleWomenscatterdiagram.

deflw_rmse(slope,intercept):

lw_errors(slope,intercept)

x=little_women.column('Periods')

y=little_women.column('Characters')

fitted=slope*x+intercept

mse=np.average((y-fitted)**2)

print("Rootmeansquarederror:",mse**0.5)

lw_rmse(0,characters_average)

Rootmeansquarederror:7019.17593405

ComputationalandInferentialThinking

236Errors

Theerroroftheregressionlineisindeedmuchsmallerifwechooseaslopeandinterceptneartheregressionline.

lw_rmse(90,4000)

Rootmeansquarederror:2715.53910638

Buttheminimumisachievedusingtheregressionlineitself.

lw_rmse(lw_slope,lw_intercept)

ComputationalandInferentialThinking

237Errors

Rootmeansquarederror:2701.69078531

NumericalOptimization¶Wecanalsodefinemean_squared_errorforanarbitrarydataset.We'lluseahigher-orderfunctionsothatwecantrymanydifferentlinesonthesamedataset,simplybypassingintheirslopeandintercept.

defmean_squared_error(table,x,y):

deffor_line(slope,intercept):

fitted=(slope*table.column(x)+intercept)

returnnp.average((table.column(y)-fitted)**2)

returnfor_line

TheOldFaithfulgeyserinYellowstonenationalparkeruptsregularly,butthewaitingtimebetweeneruptions(inseconds)andthedurationoftheeruptionsinseccondsdovary.

faithful=Table.read_table('faithful.csv')

faithful.scatter(1,0)

ComputationalandInferentialThinking

238Errors

Itappearsthattherearetwotypesoferuptions,shortandlong.Theeruptionsabove3secondsappearcorrelatedwithwaitingtime.

long=faithful.where(faithful.column('eruptions')>3)

long.scatter(1,0,fit_line=True)

Thermse_geyserfunctiontakesaslopeandaninterceptandreturnstherootmeansquarederrorofalinearpredictoroferuptionsizefromthewaitingtime.

ComputationalandInferentialThinking

239Errors

mse_long=mean_squared_error(long,1,0)

mse_long(0,4)**0.5

0.50268461001194098

Ifweexperimentwithdifferentvalues,wecanfindalow-errorslopeandinterceptthroughtrialanderror.

mse_long(0.01,3.5)**0.5

0.39143872353883735

mse_long(0.02,2.7)**0.5

0.38168519564089542

Theminimizefunctioncanbeusedtofindtheargumentsofafunctionforwhichthefunctionreturnsitsminimumvalue.Pythonusesasimilartrial-and-errorapproach,followingthechangesthatleadtoincrementallyloweroutputvalues.

a,b=minimize(mse_long)

Therootmeansquarederroroftheminimalslopeandinterceptissmallerthananyofthevaluesweconsideredbefore.

mse_long(a,b)**0.5

0.38014612988504093

Infact,thesevaluesaandbarethesameasthevaluesreturnedbytheslopeandinterceptfunctionswedevelopedbasedonthecorrelationcoefficient.Weseesmalldeviationsduetotheinexactnatureofminimize,butthevaluesareessentiallythesame.

ComputationalandInferentialThinking

240Errors

print("slope:",slope(long,1,0))

print("a:",a)

print("intecept:",intercept(long,1,0))

print("b:",b)

slope:0.0255508940715

a:0.0255496

intecept:2.24752334164

b:2.247629

Therefore,wehavefoundnotonlythattheregressionlineminimizesmeansquarederror,butalsothatminimizingmeansquarederrorgivesustheregressionline.Theregressionlineistheonlylinethatminimizesmeansquarederror.

Residuals¶Theamountoferrorineachoftheseregressionestimatesisthedifferencebetweentheson'sheightanditsestimate.Theseerrorsarecalledresiduals.Someresidualsarepositive.Thesecorrespondtopointsthatareabovetheregressionline–pointsforwhichtheregressionlineunder-estimates .Negativeresidualscorrespondtothelineover-estimatingvaluesof .

waiting=long.column('waiting')

eruptions=long.column('eruptions')

fitted=a*waiting+b

res=long.with_columns([

'eruptions(fitted)',fitted,

'residuals',eruptions-fitted

])

res

ComputationalandInferentialThinking

241Errors

eruptions waiting eruptions(fitted) residuals

3.6 79 4.26605 -0.666047

3.333 74 4.1383 -0.805299

4.533 85 4.41934 0.113655

4.7 88 4.49599 0.204006

3.6 85 4.41934 -0.819345

4.35 85 4.41934 -0.069345

3.917 84 4.3938 -0.476795

4.2 78 4.2405 -0.0404978

4.7 83 4.36825 0.331754

4.8 84 4.3938 0.406205

...(165rowsomitted)

Aswithdeviationsfromaverage,thepositiveandnegativeresidualsexactlycanceleachotherout.Sotheaverage(andsum)oftheresidualsis0.Becausewefoundaandbbynumericallyminimizingtherootmeansquarederrorratherthancomputingthemexactly,thesumofresidualsisslightlydifferentfromzerointhiscase.

sum(res.column('residuals'))

-0.00037579999996140145

Aresidualplotisascatterplotoftheresidualsversusthefittedvalues.Theresidualplotofagoodregressionlooksliketheonebelow:aformlesscloudwithnopattern,centeredaroundthehorizontalaxis.Itshowsthatthereisnodiscerniblenon-linearpatternintheoriginalscatterplot.

res.scatter(2,3,fit_line=True)

ComputationalandInferentialThinking

242Errors

Bycontrast,supposewehadattemptedtofitaregressionlinetotheentiresetoferuptionsofOldFaithfulintheoriginaldataset.

faithful.scatter(0,1,fit_line=True)

Thislinedoesnotpassthroughtheaveragevalueofverticalstrips.Wemaybeabletoseethisfactfromthescatteralone,butitisdramaticallyclearbyvisualizingtheresidualscatter.

ComputationalandInferentialThinking

243Errors

defresidual_plot(table,x,y):

fitted=fit(table,x,y)

residuals=table.column(y)-fitted

res_table=Table().with_columns([

'fitted',fitted,

'residuals',residuals])

res_table.scatter(0,1,fit_line=True)

residual_plot(faithful,1,0)

Thediagonalstripesshowanon-linearpatterninthedata.Inthiscase,wewouldnotexpecttheregressionlinetoprovidelow-errorestimates.

Example:SATScores¶Residualplotscanbeusefulforspottingnon-linearityinthedata,orotherfeaturesthatweakentheregressionanalysis.Forexample,considertheSATdataoftheprevioussection,andsupposeyoutrytoestimatetheCombinedscorebasedonParticipationRate.

sat=Table.read_table('sat2014.csv')

sat

ComputationalandInferentialThinking

244Errors

State ParticipationRate

CriticalReading Math Writing Combined

NorthDakota 2.3 612 620 584 1816

Illinois 4.6 599 616 587 1802

Iowa 3.1 605 611 578 1794

SouthDakota 2.9 604 609 579 1792

Minnesota 5.9 598 610 578 1786

Michigan 3.8 593 610 581 1784

Wisconsin 3.9 596 608 578 1782

Missouri 4.2 595 597 579 1771

Wyoming 3.3 590 599 573 1762

Kansas 5.3 591 596 566 1753

...(41rowsomitted)

sat.scatter('ParticipationRate','Combined',s=8)

Therelationbetweenthevariablesisclearlynon-linear,butyoumightbetemptedtofitastraightlineanyway,especiallyifyouneverlookedatascatterdiagramofthedata.

ComputationalandInferentialThinking

245Errors

sat.scatter('ParticipationRate','Combined',s=8,fit_line=True)

Thepointsinthescatterplotstartoutabovetheregressionline,thenareconsistentlybelowtheline,thenabove,thenbelow.Thispatternofnon-linearityismoreclearlyvisibleintheresidualplot.

residual_plot(sat,'ParticipationRate','Combined')

_=plt.title('Residualplotofabadregression')

ComputationalandInferentialThinking

246Errors

Thisresidualplotisnotaformlesscloud;itshowsanon-linearpattern,andisasignalthatlinearregressionshouldnothavebeenusedforthesedata.

TheSizeofResiduals¶

Wewillconcludebyobservingseveralrelationshipsbetweenquantitieswehavealreadyidentified.We'llillustratetherelationshipsusingthelongeruptionsofOldFaithful,buttheyindeedholdforanycollectionofpairedobservations.

Fact1:Therootmeansquarederrorofalinethathaszeroslopeandaninterceptattheaverageofyisthestandarddeviationofy.

mse_long(0,np.average(eruptions))**0.5

0.40967604587437784

np.std(eruptions)

0.40967604587437784

Bycontrast,thestandarddeviationoffittedvaluesissmaller.

eruptions_fitted=fit(long,'waiting','eruptions')

np.std(eruptions_fitted)

0.15271994814407569

Fact2:Theratioofthestandarddeviationsoffittedvaluestoyvaluesis .

r=correlation(long,'waiting','eruptions')

print('r:',r)

print('ratio:',np.std(eruptions_fitted)/np.std(eruptions))

r:0.372782225571

ratio:0.372782225571

ComputationalandInferentialThinking

247Errors

Noticetheabsolutevalueof intheformulaabove.Fortheheightsoffathersandsons,thecorrelationispositiveandsothereisnodifferencebetweenusing andusingitsabsolutevalue.However,theresultistrueforvariablesthathavenegativecorrelationaswell,providedwearecarefultousetheabsolutevalueof insteadof .

Theregressionlinedoesabetterjobofestimatingeruptionlengthsthanazero-slopeline.Thus,theroughsizeoftheerrorsmadeusingtheregressionlinemustbesmallerthatthatusingaflatline.Inotherwords,theSDoftheresidualsmustbesmallerthantheoverallSDof .

eruptions_residuals=eruptions-eruptions_fitted

np.std(eruptions_residuals)

0.38014612980028634

Fact3:TheSDoftheresidualsis timestheSDof .

np.std(eruptions)*(1-r**2)**0.5

0.38014612980028639

Fact4:Thevarianceofthefittedvaluesplusthevarianceoftheresidualsgivesthevarianceofy.

np.std(eruptions_fitted)**2+np.std(eruptions_residuals)**2

0.16783446256326531

np.std(eruptions)**2

0.16783446256326534

ComputationalandInferentialThinking

248Errors

MultipleRegressionInteractNowthatwehaveexploredwaystousemultiplefeaturestopredictacategoricalvariable,itisnaturaltostudywaysofusingmultiplepredictorvariablestopredictaquantitativevariable.Acommonlyusedmethodtodothisiscalledmultipleregression.

Wewillstartwithanexampletoreviewsomefundamentalaspectsofsimpleregression,thatis,regressionbasedononepredictorvariable.

Example:SimpleRegression¶

Supposethatourgoalistouseregressiontoestimatetheheightofabassethoundbasedonitsweight,usingasamplethatlooksconsistentwiththeregressionmodel.Supposetheobservedcorrelation is0.5,andthatthesummarystatisticsforthetwovariablesareasinthetablebelow:

average SD

height 14inches 2inches

weight 50pounds 5pounds

Tocalculatetheequationoftheregressionline,weneedtheslopeandtheintercept.

Theequationoftheregressionlineallowsustocalculatetheestimatedheight,ininches,basedonagivenweightinpounds:

Theslopeofthelineismeasurestheincreaseintheestimatedheightperunitincreaseinweight.Theslopeispositive,anditisimportanttonotethatthisdoesnotmeanthatwethinkbassethoundsgettalleriftheyputonweight.Theslopereflectsthedifferenceintheaverageheightsoftwogroupsofdogsthatare1poundapartinweight.Specifically,

ComputationalandInferentialThinking

249MultipleRegression

consideragroupofdogswhoseweightis pounds,andthegroupwhoseweightispounds.Thesecondgroupisestimatedtobe0.2inchestaller,onaverage.Thisistrueforallvaluesof inthesample.

Ingeneral,theslopeoftheregressionlinecanbeinterpretedastheaverageincreaseinperunitincreasein .Notethatiftheslopeisnegative,thenforeveryunitincreasein ,theaverageof decreases.

MultiplePredictors¶

Inmultipleregression,morethanonepredictorvariableisusedtoestimate .Forexample,aDartmouthstudyofundergraduatescollectedinformationabouttheiruseofanonlinecourseforumandtheirGPA.Wemightwanttopredictastudent'sGPAbasedonthenumberofdaysthattheyvisitthecourseforumandhowmanytimestheyanswersomeoneelse'squestion.Thenthemultipleregressionmodelwouldinvolvetwoslopes,anintercept,andrandomerrorsasbefore:

GPA

OurgoalwouldbetofindtheestimatedGPAusingthebestestimatesofthetwoslopesandtheintercept;asbefore,the"best"estimatesarethosethatminimizethemeansquarederrorofestimation.

Tostartoff,wewillinvestigatethedataset,whichisdescribedonline.Eachrowrepresentsastudent.Inadditiontothestudent'sGPA,therowtallies

daysonline:Thenumberofdaysonwhichthestudentviewedtheonlineforumviews:Thenumberofpostsviewedcontributions:Thenumberofcontributions,includingpostsandfollow-updiscussionsquestions:Thenumberofquestionspostednotes:Thenumberofnotespostedanswers:Thenumberofanswersposted

grades=Table.read_table('grades_and_piazza.csv')

grades

ComputationalandInferentialThinking

250MultipleRegression

GPA daysonline views contributions questions notes answers

2.863 29 299 5 1 1 0

3.505 57 299 0 0 0 0

3.029 27 101 1 1 0 0

3.679 67 301 1 0 0 0

3.474 43 201 12 1 0 0

3.705 67 308 45 22 0 5

3.806 36 171 20 4 3 4

3.667 82 300 26 11 0 3

3.245 44 127 6 1 1 0

3.293 35 259 16 13 1 0

...(20rowsomitted)

CorrelationMatrix¶

PerhapswewishtopredictGPAbasedonforumusage.AnaturalfirststepistoseewhichvariablesarecorrelatedwiththeGPA.Hereisthecorrelationmatrix.Theto_dfmethodgeneratesaPandasdataframecontainingthesamedataasthetable,anditscorrmethodgeneratesamatrixofcorrelationsforeachpairofcolumns.

AllforumusagevariablesarecorrelatedwithGPA,butthenotescorrelationhasaverysmallmagnitude.

grades.to_df().corr()

GPA daysonline views contributions questions

GPA 1.000000 0.684905 0.444175 0.427897 0.409212

daysonline 0.684905 1.000000 0.654557 0.448319 0.435269

views 0.444175 0.654557 1.000000 0.426406 0.361002

contributions 0.427897 0.448319 0.426406 1.000000 0.857981

questions 0.409212 0.435269 0.361002 0.857981 1.000000

notes -0.160604 -0.230839 0.065627 0.295661 -0.006365

answers 0.440382 0.502810 0.365010 0.702679 0.515661

ComputationalandInferentialThinking

251MultipleRegression

Tostartoff,letusperformthesimpleregressionofGPAonjustdaysonline,thecountwiththestrongestcorrelationwithanrof0.68.Hereisthescatterdiagramandregressionline.

grades.scatter('daysonline','GPA',fit_line=True)

Wecanimmediatelyseethattherelationisnotentirelylinear,andaresidualplothighlightsthisfact.OneissueisthattheGPAisneverabove4.0,sothemodelassumptionthatyvaluesarebellshapeddoesnotholdinthiscase.

residual_plot(grades,'daysonline','GPA')

Nonetheless,theregressionlinedoescapturesomeofthepatterinthedata,andsowewillcontinuetoworkwithit,notingthatourpredictionsmaynotbeentirelyjustifiedbytheregressionmodel.

ComputationalandInferentialThinking

252MultipleRegression

Therootmeansquarederrorisabitabove1/4ofaGPApointwhenpredictingGPAfromonlythedaysonline.

grades_days_mse=mean_squared_error(grades,'daysonline','GPA')

a,b=minimize(grades_days_mse)

grades_days_mse(a,b)**0.5

0.28494512878662448

TwoPredictors¶

Perhapsmoreinformationcanimproveourprediction.ThenumberofanswersisalsocorrelatedwithGPA.Whenusingitaloneasapredictor,therootmeansquarederrorisgreaterbecausethecorrelationhaslowermagnitudethandaysonline.

grades_answers_mse=mean_squared_error(grades,'answers','GPA')

a,b=minimize(grades_answers_mse)

grades_answers_mse(a,b)**0.5

0.35110518920562062

However,thetwovariablescanbeusedtogether.First,wedefinethemeansquarederrorintermsofalinethathastwoslopes,oneforeachvariable,aswellasanintercept.

defgrades_days_answers_mse(a_days,a_answers,b):

fitted=a_days*grades.column('daysonline')+\

a_answers*grades.column('answers')+\

b

y=grades.column('GPA')

returnnp.average((y-fitted)**2)

minimize(grades_days_answers_mse)

array([0.0157683,0.0301254,2.6031789])

ComputationalandInferentialThinking

253MultipleRegression

Theminimizefunctionreturnsthreevalues,theslopefordaysonline,theslopeforanswers,andtheintercept.Together,thesethreevaluesdescribealinearpredictionfunction.Forexample,someonewhospent50daysonlineandanswered5questionsispredictedtohaveaGPAofabout3.54.

np.round(0.0157683*50+0.0301254*5+2.6031789,2)

3.54

Therootmeansquarederrorofthispredictorisabitsmallerthanthatofeithersingle-variablepredictorwehadbefore!

a_days,a_answers,b=minimize(grades_days_answers_mse)

grades_days_answers_mse(a_days,a_answers,b)**0.5

0.28161529539241298

Combiningallvariables¶

Wecancombineallinformationaboutthecourseforuminordertoattempttofindanevenbetterpredictor.

ComputationalandInferentialThinking

254MultipleRegression

deffit_grades(a_days,a_views,a_contributions,a_questions,a_notes

returna_days*grades.column('daysonline')+\

a_views*grades.column('views')+\

a_contributions*grades.column('contributions')+\

a_questions*grades.column('questions')+\

a_notes*grades.column('notes')+\

a_answers*grades.column('answers')+\

b

defgrades_all_mse(a_days,a_views,a_contributions,a_questions,a_notes

fitted=fit_grades(a_days,a_views,a_contributions,a_questions

y=grades.column('GPA')

returnnp.average((y-fitted)**2)

minimize(grades_all_mse)

array([1.43703000e-02,-8.15000000e-05,6.50720000e-03,

-1.72780000e-03,-4.04014000e-02,1.59645000e-02,

2.65375470e+00])

Usingallofthisinformation,wehaveconstructedapredictorwithanevensmallerrootmeansquarederror.The*belowpassesallofthereturnvaluesoftheminimizefunctionasargumentstogrades_all_mse.

grades_all_mse(*minimize(grades_all_mse))**0.5

0.27783237243108722

Allofthisinformationdidnotimproveourpredictorverymuch.Therootmeansquarederrordecreasedfrom2.82forthebestsingle-variablepredictorto2.78whenusingalloftheavailableinformation.

Definitionof ,consistentwithourold

Whenwestudiedsimpleregression,wehadnotedthat

ComputationalandInferentialThinking

255MultipleRegression

Letususeouroldfunctionstocomputethefittedvaluesandconfirmthatthisistrueforourexample:

fitted=fit(grades,'daysonline','GPA')

np.std(fitted)/np.std(grades.column('GPA'))

0.68490480299828194

Becausevarianceisthesquareofthestandarddeviation,wecansaythat

Noticethatthiswayofthinkingabout involvesonlytheestimatedvaluesandtheobservedvalues,notthenumberofpredictorvariables.Therefore,itmotivatesthedefinitionofmultiple :

Itisafactofmathematicsthatthisquantityisalwaysbetween0and1.Withmultiplepredictorvariables,thereisnoclearinterpretationofasignattachedtothesquarerootof

.Someofthepredictorsmightbepositivelyassociatedwith ,othersnegatively.Anoverallmeasureofthefitisprovidedby .

fitted=fit_grades(*minimize(grades_all_mse))

np.var(fitted)/np.var(grades.column('GPA'))

0.49525855565521787

Itisnotalwayswisetomaximize byincludingmanyvariables,butwewilladdressthisconcernlaterinthetext.

Wecanalsoviewtheresidualplotofmultipleregression,justasbefore.Thehorizontalaxisshowsfittedvaluesandtheverticalaxisshowserrors.Again,thereissomestructuretothisplot,soperhapstherelationbetweenthesevariablesisnotentirelylinear.Inthiscase,thestructureisminorenoughthattheregressionmodelisareasonablechoiceofpredictor.

ComputationalandInferentialThinking

256MultipleRegression

Table().with_columns([

'fitted',fitted,

'errors',grades.column('GPA')-fitted

]).scatter(0,1,fit_line=True)

ComputationalandInferentialThinking

257MultipleRegression

ClassificationSectionauthor:DavidWagner

InteractThissectionwilldiscussmachinelearning.Machinelearningisaclassoftechniquesforautomaticallyfindingpatternsindataandusingittodrawinferencesormakepredictions.We'regoingtofocusonaparticularkindofmachinelearning,namely,classification.

Classificationisaboutlearninghowtomakepredictionsfrompastexamples:we'regivensomeexampleswherewehavebeentoldwhatthecorrectpredictionwas,andwewanttolearnfromthoseexampleshowtomakegoodpredictionsinthefuture.Hereareafewapplicationswhereclassificationisusedinpractice:

ForeachorderAmazonreceives,Amazonwouldliketopredict:isthisorderfraudulent?Theyhavesomeinformationabouteachorder(e.g.,itstotalvalue,whethertheorderisbeingshippedtoanaddressthiscustomerhasusedbefore,whethertheshippingaddressisthesameasthecreditcardholder'sbillingaddress).Theyhavelotsofdataonpastorders,andtheyknowwhetherwhichofthosepastorderswerefraudulentandwhichweren't.Theywanttolearnpatternsthatwillhelpthempredict,asnewordersarrive,whetherthosenewordersarefraudulent.

Onlinedatingsiteswouldliketopredict:arethesetwopeoplecompatible?Willtheyhititoff?Theyhavelotsofdataonwhichmatchesthey'vesuggestedtotheircustomersinthepast,andtheyhavesomeideawhichonesweresuccessful.Asnewcustomerssignup,they'dliketopredictmakepredictionsaboutwhomightbeagoodmatchforthem.

Doctorswouldliketoknow:doesthispatienthavecancer?Basedonthemeasurementsfromsomelabtest,they'dliketobeabletopredictwhethertheparticularpatienthascancer.Theyhavelotsofdataonpastpatients,includingtheirlabmeasurementsandwhethertheyultimatelydevelopedcancer,andfromthat,they'dliketotrytoinferwhatmeasurementstendtobecharacteristicofcancer(ornon-cancer)sotheycandiagnosefuturepatientsaccurately.

Politicianswouldliketopredict:areyougoingtovoteforthem?Thiswillhelpthemfocusfundraisingeffortsonpeoplewhoarelikelytosupportthem,andfocusget-out-the-voteeffortsonvoterswhowillvoteforthem.Publicdatabasesandcommercialdatabaseshavealotofinformationaboutmostpeople:e.g.,whethertheyownahomeorrent;whethertheyliveinarichneighborhoodorpoorneighborhood;theirinterestsandhobbies;theirshoppinghabits;andsoon.Andpoliticalcampaignshavesurveyed

ComputationalandInferentialThinking

258Classification

somevotersandfoundoutwhotheyplantovotefor,sotheyhavesomeexampleswherethecorrectanswerisknown.Fromthisdata,thecampaignswouldliketofindpatternsthatwillhelpthemmakepredictionsaboutallotherpotentialvoters.

Alloftheseareclassificationtasks.Noticethatineachoftheseexamples,thepredictionisayes/noquestion--wecallthisbinaryclassification,becausethereareonlytwopossiblepredictions.Inaclassificationtask,wehaveabunchofobservations.Eachobservationrepresentsasingleindividualorasinglesituationwherewe'dliketomakeaprediction.Eachobservationhasmultipleattributes,whichareknown(e.g.,thetotalvalueoftheorder;voter'sannualsalary;andsoon).Also,eachobservationhasaclass,whichistheanswertothequestionwecareabout(e.g.,yesorno;fraudulentornot;etc.).

Forinstance,withtheAmazonexample,eachordercorrespondstoasingleobservation.Eachobservationhasseveralattributes(e.g.,thetotalvalueoftheorder,whethertheorderisbeingshippedtoanaddressthiscustomerhasusedbefore,andsoon).Theclassoftheobservationiseither0or1,where0meansthattheorderisnotfraudulentand1meansthattheorderisfraudulent.Giventheattributesofsomeneworder,wearetryingtopredictitsclass.

Classificationrequiresdata.Itinvolveslookingforpatterns,andtofindpatterns,youneeddata.That'swherethedatasciencecomesin.Inparticular,we'regoingtoassumethatwehaveaccesstotrainingdata:abunchofobservations,whereweknowtheclassofeachobservation.Thecollectionofthesepre-classifiedobservationsisalsocalledatrainingset.Aclassificationalgorithmisgoingtoanalyzethetrainingset,andthencomeupwithaclassifier:analgorithmforpredictingtheclassoffutureobservations.

Notethatclassifiersdonotneedtobeperfecttobeuseful.Theycanbeusefuleveniftheiraccuracyislessthan100%.Forinstance,iftheonlinedatingsiteoccasionallymakesabadrecommendation,that'sOK;theircustomersalreadyexpecttohavetomeetmanypeoplebeforethey'llfindsomeonetheyhititoffwith.Ofcourse,youdon'twanttheclassifiertomaketoomanyerrors--butitdoesn'thavetogettherightanswereverysingletime.

Chronickidneydisease¶

Let'sworkthroughanexample.We'regoingtoworkwithadatasetthatwascollectedtohelpdoctorsdiagnosechronickidneydisease(CKD).Eachrowinthedatasetrepresentsasinglepatientwhowastreatedinthepastandwhosediagnosisisknown.Foreachpatient,wehaveabunchofmeasurementsfromabloodtest.We'dliketofindwhichmeasurementsaremostusefulfordiagnosingCKD,anddevelopawaytoclassifyfuturepatientsas"hasCKD"or"doesn'thaveCKD"basedontheirbloodtestresults.

Let'sloadthedatasetintoatableandlookatit.

ComputationalandInferentialThinking

259Classification

ckd=Table.read_table('ckd.csv').relabeled('BloodGlucoseRandom',

ckd

Age BloodPressure

SpecificGravity Albumin Sugar

RedBloodCells

PusCell PusCellclumps

48 70 1.005 4 0 normal abnormal present

53 90 1.02 2 0 abnormal abnormal present

63 70 1.01 3 0 abnormal abnormal present

68 80 1.01 3 2 normal abnormal present

61 80 1.015 2 0 abnormal abnormal notpresent

48 80 1.025 4 0 normal abnormal notpresent

69 70 1.01 3 4 normal abnormal notpresent

73 70 1.005 0 0 normal normal notpresent

73 80 1.02 2 0 abnormal abnormal notpresent

46 60 1.01 1 0 normal normal notpresent

...(148rowsomitted)

Wehavedataon158patients.Thereareanawfullotofattributeshere.Thecolumnlabelled"Class"indicateswhetherthepatientwasdiagnosedwithCKD:1meanstheyhaveCKD,0meanstheydonothaveCKD.

Let'slookattwocolumnsinparticular:thehemoglobinlevel(inthepatient'sblood),andthebloodglucoselevel(atarandomtimeintheday;withoutfastingspeciallyforthebloodtest).We'lldrawascatterplot,tomakeiteasytovisualizethis.ReddotsarepatientswithCKD;bluedotsarepatientswithoutCKD.WhattestresultsseemtoindicateCKD?

ckd.scatter('Hemoglobin','Glucose',c=ckd.column('Class'))

ComputationalandInferentialThinking

260Classification

SupposeAliceisanewpatientwhoisnotinthedataset.IfItellyouAlice'shemoglobinlevelandbloodglucoselevel,couldyoupredictwhethershehasCKD?Itsurelookslikeit!Youcanseeaveryclearpatternhere:pointsinthelower-righttendtorepresentpeoplewhodon'thaveCKD,andtheresttendtobefolkswithCKD.Toahuman,thepatternisobvious.Buthowcanweprogramacomputertoautomaticallydetectpatternssuchasthisone?

Well,therearelotsofkindsofpatternsonemightlookfor,andlotsofalgorithmsforclassification.ButI'mgoingtotellyouaboutonethatturnsouttobesurprisinglyeffective.Itiscallednearestneighborclassification.Here'stheidea.IfwehaveAlice'shemoglobinandglucosenumbers,wecanputhersomewhereonthisscatterplot;thehemoglobinisherx-coordinate,andtheglucoseishery-coordinate.Now,topredictwhethershehasCKDornot,wefindthenearestpointinthescatterplotandcheckwhetheritisredorblue;wepredictthatAliceshouldreceivethesamediagnosisasthatpatient.

Inotherwords,toclassifyAliceasCKDornot,wefindthepatientinthetrainingsetwhois"nearest"toAlice,andthenusethatpatient'sdiagnosisasourpredictionforAlice.Theintuitionisthatiftwopointsareneareachotherinthescatterplot,thenthecorrespondingmeasurementsareprettysimilar,sowemightexpectthemtoreceivethesamediagnosis(morelikelythannot).Wedon'tknowAlice'sdiagnosis,butwedoknowthediagnosisofallthepatientsinthetrainingset,sowefindthepatientinthetrainingsetwhoismostsimilartoAlice,andusethatpatient'sdiagnosistopredictAlice'sdiagnosis.

Thescatterplotsuggeststhatthisnearestneighborclassifiershouldbeprettyaccurate.Pointsinthelower-rightwilltendtoreceivea"noCKD"diagnosis,astheirnearestneighborwillbeabluepoint.Therestofthepointswilltendtoreceivea"CKD"diagnosis,astheirnearestneighborwillbearedpoint.Sothenearestneighborstrategyseemstocaptureourintuitionprettywell,forthisexample.

ComputationalandInferentialThinking

261Classification

However,theseparationbetweenthetwoclasseswon'talwaysbequitesoclean.Forinstance,supposethatinsteadofhemoglobinlevelsweweretolookatwhitebloodcellcount.Lookatwhathappens:

ckd.scatter('WhiteBloodCellCount','Glucose',c=ckd.column('Class'

Asyoucansee,non-CKDindividualsareallclusteredinthelower-left.MostofthepatientswithCKDareaboveortotherightofthatcluster...butnotall.TherearesomepatientswithCKDwhoareinthelowerleftoftheabovefigure(asindicatedbythehandfulofreddotsscatteredamongthebluecluster).Whatthismeansisthatyoucan'ttellforcertainwhethersomeonehasCKDfromjustthesetwobloodtestmeasurements.

IfwearegivenAlice'sglucoselevelandwhitebloodcellcount,canwepredictwhethershehasCKD?Yes,wecanmakeaprediction,butweshouldn'texpectittobe100%accurate.Intuitively,itseemslikethere'sanaturalstrategyforpredicting:plotwhereAlicelandsinthescatterplot;ifsheisinthelower-left,predictthatshedoesn'thaveCKD,otherwisepredictshehasCKD.Thisisn'tperfect--ourpredictionswillsometimesbewrong.(Takeaminuteandthinkitthrough:forwhichpatientswillitmakeamistake?)Asthescatterplotaboveindicates,sometimespeoplewithCKDhaveglucoseandwhitebloodcelllevelsthatlookidenticaltothoseofsomeonewithoutCKD,soanyclassifierisinevitablygoingtomakethewrongpredictionforthem.

Canweautomatethisonacomputer?Well,thenearestneighborclassifierwouldbeareasonablechoiceheretoo.Takeaminuteandthinkitthrough:howwillitspredictionscomparetothosefromtheintuitivestrategyabove?Whenwilltheydiffer?Itspredictionswillbeprettysimilartoourintuitivestrategy,butoccasionallyitwillmakeadifferentprediction.In

ComputationalandInferentialThinking

262Classification

particular,ifAlice'sbloodtestresultshappentoputherrightnearoneofthereddotsinthelower-left,theintuitivestrategywouldpredict"notCKD",whereasthenearestneighborclassifierwillpredict"CKD".

Thereisasimplegeneralizationofthenearestneighborclassifierthatfixesthisanomaly.Itiscalledthek-nearestneighborclassifier.TopredictAlice'sdiagnosis,ratherthanlookingatjusttheoneneighborclosesttoher,wecanlookatthe3pointsthatareclosesttoher,andusethediagnosisforeachofthose3pointstopredictAlice'sdiagnosis.Inparticular,we'llusethemajorityvalueamongthose3diagnosesasourpredictionforAlice'sdiagnosis.Ofcourse,there'snothingspecialaboutthenumber3:wecoulduse4,or5,ormore.(It'softenconvenienttopickanoddnumber,sothatwedon'thavetodealwithties.)Ingeneral,wepickanumber ,andourpredicteddiagnosisforAliceisbasedonthe patientsinthetrainingsetwhoareclosesttoAlice.Intuitively,thesearethe patientswhosebloodtestresultsweremostsimilartoAlice,soitseemsreasonabletousetheirdiagnosestopredictAlice'sdiagnosis.

The -nearestneighborclassifierwillnowbehavejustlikeourintuitivestrategyabove.

Decisionboundary¶

Sometimesahelpfulwaytovisualizeaclassifieristomaptheregionofspacewheretheclassifierwouldpredict'CKD',andtheregionofspacewhereitwouldpredict'notCKD'.Weendupwithsomeboundarybetweenthetwo,wherepointsononesideoftheboundarywillbeclassified'CKD'andpointsontheothersidewillbeclassified'notCKD'.Thisboundaryiscalledthedecisionboundary.Eachdifferentclassifierwillhaveadifferentdecisionboundary;thedecisionboundaryisjustawaytovisualizewhatcriteriatheclassifierisusingtoclassifypoints.

Banknoteauthentication¶

Let'sdoanotherexample.Thistimewe'lllookatpredictingwhetherabanknote(e.g.,a$20bill)iscounterfeitorlegitimate.Researchershaveputtogetheradatasetforus,basedonphotographsofmanyindividualbanknotes:somecounterfeit,somelegitimate.Theycomputedafewnumbersfromeachimage,usingtechniquesthatwewon'tworryaboutforthiscourse.So,foreachbanknote,weknowafewnumbersthatwerecomputedfromaphotographofitaswellasitsclass(whetheritiscounterfeitornot).Let'sloaditintoatableandtakealook.

banknotes=Table.read_table('banknote.csv')

banknotes

ComputationalandInferentialThinking

263Classification

WaveletVar WaveletSkew WaveletCurt Entropy Class

3.6216 8.6661 -2.8073 -0.44699 0

4.5459 8.1674 -2.4586 -1.4621 0

3.866 -2.6383 1.9242 0.10645 0

3.4566 9.5228 -4.0112 -3.5944 0

0.32924 -4.4552 4.5718 -0.9888 0

4.3684 9.6718 -3.9606 -3.1625 0

3.5912 3.0129 0.72888 0.56421 0

2.0922 -6.81 8.4636 -0.60216 0

3.2032 5.7588 -0.75345 -0.61251 0

1.5356 9.1772 -2.2718 -0.73535 0

...(1362rowsomitted)

Let'slookatwhetherthefirsttwonumberstellusanythingaboutwhetherthebanknoteiscounterfeitornot.Here'sascatterplot:

banknotes.scatter('WaveletVar','WaveletCurt',c=banknotes.column('Class'

Prettyinteresting!Thosetwomeasurementsdoseemhelpfulforpredictingwhetherthebanknoteiscounterfeitornot.However,inthisexampleyoucannowseethatthereissomeoverlapbetweentheblueclusterandtheredcluster.Thisindicatesthattherewillbesome

ComputationalandInferentialThinking

264Classification

imageswhereit'shardtotellwhetherthebanknoteislegitimatebasedonjustthesetwonumbers.Still,youcouldusea -nearestneighborclassifiertopredictthelegitimacyofabanknote.

Takeaminuteandthinkitthrough:Supposeweused (say).Whatpartsoftheplotwouldtheclassifiergetright,andwhatpartswoulditmakeerrorson?Whatwouldthedecisionboundarylooklike?

Thepatternsthatshowupinthedatacangetprettywild.Forinstance,here'swhatwe'dgetifusedadifferentpairofmeasurementsfromtheimages:

banknotes.scatter('WaveletSkew','Entropy',c=banknotes.column('Class'

Theredoesseemtobeapattern,butit'saprettycomplexone.Nonetheless,the -nearestneighborsclassifiercanstillbeusedandwilleffectively"discover"patternsoutofthis.Thisillustrateshowpowerfulmachinelearningcanbe:itcaneffectivelytakeadvantageofevenpatternsthatwewouldnothaveanticipated,orthatwewouldhavethoughtto"programinto"thecomputer.

Multipleattributes¶

SofarI'vebeenassumingthatwehaveexactly2attributesthatwecanusetohelpusmakeourprediction.Whatifwehavemorethan2?Forinstance,whatifwehave3attributes?

ComputationalandInferentialThinking

265Classification

Here'sthecoolpart:youcanusethesameideasforthiscase,too.Allyouhavetodoismakea3-dimensionalscatterplot,insteadofa2-dimensionalplot.Youcanstillusethe -nearestneighborsclassifier,butnowcomputingdistancesin3dimensionsinsteadofjust2.Itjustworks.Verycool!

Infact,there'snothingspecialabout2or3.Ifyouhave4attributes,youcanusethe -nearestneighborsclassifierin4dimensions.5attributes?Workin5-dimensionalspace.Andnoneedtostopthere!Thisallworksforarbitrarilymanyattributes;youjustworkinaveryhighdimensionalspace.Itgetswicked-impossibletovisualize,butthat'sOK.Thecomputeralgorithmgeneralizesverynicely:allyouneedistheabilitytocomputethedistance,andthat'snothard.Mind-blowingstuff!

Forinstance,let'sseewhathappensifwetrytopredictwhetherabanknoteiscounterfeitornotusing3ofthemeasurements,insteadofjust2.Here'swhatyouget:

ax=plt.figure(figsize=(8,8)).add_subplot(111,projection='3d')

ax.scatter(banknotes.column('WaveletSkew'),

banknotes.column('WaveletVar'),

banknotes.column('WaveletCurt'),

c=banknotes.column('Class'))

<mpl_toolkits.mplot3d.art3d.Path3DCollectionat0x1093bdef0>

ComputationalandInferentialThinking

266Classification

Awesome!Withjust2attributes,therewassomeoverlapbetweenthetwoclusters(whichmeansthattheclassifierwasboundtomakesomemistakesforpointersintheoverlap).Butwhenweusethese3attributes,thetwoclustershavealmostnooverlap.Inotherwords,aclassifierthatusesthese3attributeswillbemoreaccuratethanonethatonlyusesthe2attributes.

Thisisageneralphenomenominclassification.Eachattributecanpotentiallygiveyounewinformation,somoreattributessometimeshelpsyoubuildabetterclassifier.Ofcourse,thecostisthatnowwehavetogathermoreinformationtomeasurethevalueofeachattribute,butthiscostmaybewellworthitifitsignificantlyimprovestheaccuracyofourclassifier.

Tosumup:younowknowhowtouse -nearestneighborclassificationtopredicttheanswertoayes/noquestion,basedonthevaluesofsomeattributes,assumingyouhaveatrainingsetwithexampleswherethecorrectpredictionisknown.Thegeneralroadmapisthis:

1. identifysomeattributesthatyouthinkmighthelpyoupredicttheanswertothequestion;2. gatheratrainingsetofexampleswhereyouknowthevaluesoftheattributesaswellas

thecorrectprediction;3. tomakepredictionsinthefuture,measurethevalueoftheattributesandthenuse -

nearestneighborclassificationtopredicttheanswertothequestion.

ComputationalandInferentialThinking

267Classification

Breastcancerdiagnosis¶NowIwanttodoamoreextendedexamplebasedondiagnosingbreastcancer.IwasinspiredbyBrittanyWenger,whowontheGooglenationalsciencefairthreeyearsagoasa17-yearoldhighschoolstudent.Here'sBrittany:

Brittany'ssciencefairprojectwastobuildaclassificationalgorithmtodiagnosebreastcancer.Shewongrandprizeforbuildinganalgorithmwhoseaccuracywasalmost99%.

Let'sseehowwellwecando,withtheideaswe'velearnedinthiscourse.

So,letmetellyoualittlebitaboutthedataset.Basically,ifawomanhasalumpinherbreast,thedoctorsmaywanttotakeabiopsytoseeifitiscancerous.Thereareseveraldifferentproceduresfordoingthat.Brittanyfocusedonfineneedleaspiration(FNA),becauseitislessinvasivethanthealternatives.Thedoctorgetsasampleofthemass,putsitunderamicroscope,takesapicture,andatrainedlabtechanalyzesthepicturetodeterminewhetheritiscancerornot.Wegetapicturelikeoneofthefollowing:

ComputationalandInferentialThinking

268Classification

Unfortunately,distinguishingbetweenbenignvsmalignantcanbetricky.So,researchershavestudiedusingmachinelearningtohelpwiththistask.Theideaisthatwe'llaskthelabtechtoanalyzetheimageandcomputevariousattributes:thingslikethetypicalsizeofacell,howmuchvariationthereisamongthecellsizes,andsoon.Then,we'lltrytousethisinformationtopredict(classify)whetherthesampleismalignantornot.Wehaveatrainingsetofpastsamplesfromwomenwherethecorrectdiagnosisisknown,andwe'llhopethatourmachinelearningalgorithmcanusethosetolearnhowtopredictthediagnosisforfuturesamples.

Weendupwiththefollowingdataset.Forthe"Class"column,1meansmalignant(cancer);0meansbenign(notcancer).

patients=Table.read_table('breast-cancer.csv').drop('ID')

patients

ComputationalandInferentialThinking

269Classification

ClumpThickness

UniformityofCellSize

UniformityofCellShape

MarginalAdhesion

SingleEpithelialCellSize

BareNuclei

BlandChromatin

5 1 1 1 2 1 3

5 4 4 5 7 10 3

3 1 1 1 2 2 3

6 8 8 1 3 4 3

4 1 1 3 2 1 3

8 10 10 8 7 10 9

1 1 1 1 2 10 3

2 1 2 1 2 1 3

2 1 1 1 2 1 1

4 2 1 1 2 1 2

...(673rowsomitted)

Sowehave9differentattributes.Idon'tknowhowtomakea9-dimensionalscatterplotofallofthem,soI'mgoingtopicktwoandplotthem:

patients.scatter('BlandChromatin','SingleEpithelialCellSize',

ComputationalandInferentialThinking

270Classification

Oops.Thatplotisutterlymisleading,becausethereareabunchofpointsthathaveidenticalvaluesforboththex-andy-coordinates.Tomakeiteasiertoseeallthedatapoints,I'mgoingtoaddalittlebitofrandomjittertothex-andy-values.Here'showthatlooks:

defrandomize_column(a):

returna+np.random.normal(0.0,0.09,size=len(a))

Table().with_columns([

'BlandChromatin(jittered)',

randomize_column(patients.column('BlandChromatin')),

'SingleEpithelialCellSize(jittered)',

randomize_column(patients.column('SingleEpithelialCellSize'

]).scatter(0,1,c=patients.column('Class'))

Forinstance,youcanseetherearelotsofsampleswithchromatin=2andepithelialcellsize=2;allnon-cancerous.

Keepinmindthatthejitteringisjustforvisualizationpurposes,tomakeiteasiertogetafeelingforthedata.Whenwewanttoworkwiththedata,we'llusetheoriginal(unjittered)data.

Applyingthek-nearestneighborclassifiertobreastcancerdiagnosis¶

We'vegotadataset.Let'stryoutthe -nearestneighborclassifierandseehowitdoes.Thisisgoingtobegreat.

ComputationalandInferentialThinking

271Classification

We'regoingtoneedanimplementationofthe -nearestneighborclassifier.Inpracticeyouwouldprobablyuseanexistinglibrary,butit'ssimpleenoughthatI'mgoingtoimeplmentitmyself.

Thefirstthingweneedisawaytocomputethedistancebetweentwopoints.Howdowedothis?In2-dimensionalspace,it'sprettyeasy.Ifwehaveapointatcoordinates andanotherat ,thedistancebetweenthemis

(Wheredidthiscomefrom?ItcomesfromthePythogoreantheorem:wehavearighttrianglewithsidelengths and ,andwewanttofindthelengthofthediagonal.)

In3-dimensionalspace,theformulais

In -dimensionalspace,thingsareabithardertovisualize,butIthinkyoucanseehowtheformulageneralized:wesumupthesquaresofthedifferencesbetweeneachindividualcoordinate,andthentakethesquarerootofthat.Let'simplementafunctiontocomputethisdistancefunctionforus:

defdistance(pt1,pt2):

total=0

foriinnp.arange(len(pt1)):

total=total+(pt1.item(i)-pt2.item(i))**2

returnmath.sqrt(total)

Next,we'regoingtowritesomecodetoimplementtheclassifier.Theinputisapatientpwhowewanttodiagnose.Theclassifierworksbyfindingthe nearestneighborsofpfromthetrainingset.So,ourapproachwillgolikethis:

1. Findtheclosest neighborsofp,i.e.,the patientsfromthetrainingsetthataremostsimilartop.

2. Lookatthediagnosesofthose neighbors,andtakethemajorityvotetofindthemost-commondiagnosis.Usethatasourpredicteddiagnosisforp.

SothatwillguidethestructureofourPythoncode.

ComputationalandInferentialThinking

272Classification

Toimplementthefirststep,wewillcomputethedistancefromeachpatientinthetrainingsettop,sortthembydistance,andtakethe closestpatientsinthetrainingset.Thecodewillmakeacopyofthetable,computethedistancefromeachpatienttop,addanewcolumntothetablewiththosedistances,andthensortthetablebydistanceandtakethefirst rows.ThatleadstothefollowingPythoncode:

defclosest(training,p,k):

...

defmajority(topkclasses):

...

defclassify(training,p,k):

kclosest=closest(training,p,k)

kclosest.classes=kclosest.select('Class')

returnmajority(kclosest)

ComputationalandInferentialThinking

273Classification

defcomputetablewithdists(training,p):

dists=np.zeros(training.num_rows)

attributes=training.drop('Class')

foriinnp.arange(training.num_rows):

dists[i]=distance(attributes.row(i),p)

returntraining.with_column('Distance',dists)

defclosest(training,p,k):

withdists=computetablewithdists(training,p)

sortedbydist=withdists.sort('Distance')

topk=sortedbydist.take(np.arange(k))

returntopk

defmajority(topkclasses):

iftopkclasses.where('Class',1).num_rows>topkclasses.where('Class'

return1

else:

return0

defclassify(training,p,k):

closestk=closest(training,p,k)

topkclasses=closestk.select('Class')

returnmajority(topkclasses)

Let'sseehowthisworks,withourdataset.We'lltakepatient12andimaginewe'regoingtotrytodiagnosethem:

patients.take(12)

ClumpThickness

UniformityofCellSize

UniformityofCellShape

MarginalAdhesion

SingleEpithelialCellSize

BareNuclei

BlandChromatin

5 3 3 3 2 3 4

Wecanpulloutjusttheirattributes(excludingtheclass),likethis:

patients.drop('Class').row(12)

ComputationalandInferentialThinking

274Classification

Row(ClumpThickness=5,UniformityofCellSize=3,UniformityofCellShape=3,MarginalAdhesion=3,SingleEpithelialCellSize=2,BareNuclei=3,BlandChromatin=4,NormalNucleoli=4,Mitoses=1)

Let'stake .Wecanfindthe5nearestneighbors:

closest(patients,patients.drop('Class').row(12),5)

ClumpThickness

UniformityofCellSize

UniformityofCellShape

MarginalAdhesion

SingleEpithelialCellSize

BareNuclei

BlandChromatin

5 3 3 3 2 3 4

5 3 3 4 2 4 3

5 1 3 3 2 2 2

5 2 2 2 2 2 3

5 3 3 1 3 3 3

3outofthe5nearestneighborshaveclass1,sothemajorityis1(hascancer)--andthatistheoutputofourclassifierforthispatient:

classify(patients,patients.drop('Class').row(12),5)

1

Awesome!Wenowhaveaclassificationalgorithmfordiagnosingwhetherapatienthasbreastcancerornot,basedonthemeasurementsfromthelab.Arewedone?Shallwegivethistodoctorstouse?

Holdon:we'renotdoneyet.There'sanobviousquestiontoanswer,beforewestartusingthisinpractice:

Howaccurateisthismethod,atdiagnosingbreastcancer?

Andthatraisesamorefundamentalissue.Howcanwemeasuretheaccuracyofaclassificationalgorithm?

Measuringaccuracyofaclassifier¶

We'vegotaclassifier,andwe'dliketodeterminehowaccurateitwillbe.Howcanwemeasurethat?

ComputationalandInferentialThinking

275Classification

Tryitout.Onenaturalideaistojusttryitonpatientsforayear,keeprecordsonit,andseehowaccurateitis.However,thishassomedisadvantages:(a)we'retryingsomethingonpatientswithoutknowinghowaccurateitis,whichmightbeunethical;(b)wehavetowaitayeartofindoutwhetherourclassifierisanygood.Ifit'snotgoodenoughandwegetanideaforanimprovement,we'llhavetowaitanotheryeartofindoutwhetherourimprovementwasbetter.

Getsomemoredata.Wecouldtrytogetsomemoredatafromotherpatientswhosediagnosisisknown,andmeasurehowaccurateourclassifier'spredictionsareonthoseadditionalpatients.Wecancomparewhattheclassifieroutputsagainstwhatweknowtobetrue.

Usethedatawealreadyhave.Anothernaturalideaistore-usethedatawealreadyhave:wehaveatrainingsetthatweusedtotrainourclassifier,sowecouldjustrunourclassifieroneverypatientinthedatasetandcomparewhatitoutputstowhatweknowtobetrue.Thisissometimesknownastestingtheclassifieronyourtrainingset.

Howshouldwechooseamongtheseoptions?Aretheyallequallygood?

Itturnsoutthatthethirdoption,testingtheclassifieronourtrainingset,isfundamentallyflawed.Itmightsoundattractive,butitgivesmisleadingresults:itwillover-estimatetheaccuracyoftheclassifier(itwillmakeusthinktheclassifierismoreaccuratethanitreallyis).Intuitively,theproblemisthatwhatwereallywanttoknowishowwelltheclassifierhasdoneat"generalizing"beyondthespecificexamplesinthetrainingset;butifwetestitonpatientsfromthetrainingset,thenwehaven'tlearnedanythingabouthowwellitwouldgeneralizetootherpatients.

Thisissubtle,soitmightbehelpfultotryanexample.Let'stryathoughtexperiment.Let'sfocusonthe1-nearestneighborclassifier( ).Supposeyoutrainedthe1-nearestneighborclassifierondatafromall683patientsinthedataset,andthenyoutesteditonthosesame683patients.Howmanywoulditgetright?Thinkitthroughandseeifyoucanworkoutwhatwillhappen.That'sright!Theclassifierwillgettherightanswerforall683patients.Supposeweapplytheclassifiertoapatientfromthetrainingset,sayAlice.Theclassifierwilllookforthenearestneighbor(themostsimilarpatientfromthetrainingset),andthenearestneighborwillturnouttobeAliceherself(thedistancefromanypointtoitselfiszero).Thus,theclassifierwillproducetherightdiagnosisforAlice.Thesamereasoningappliestoeveryotherpatientinthetrainingset.

So,ifwetestthe1-nearestneighborclassifieronthetrainingset,theaccuracywillalwaysbe100%:absolutelyperfect.Thisistruenomatterwhetherthereareactuallyanypatternsinthedata.Butthe100%isatotallie.Whenyouapplytheclassifiertootherpatientswhowerenotinthetrainingset,theaccuracycouldbefarworse.

ComputationalandInferentialThinking

276Classification

Inotherwords,testingonthetrainingtellsyounothingabouthowaccuratethe1-nearestneighborclassifierwillbe.Thisillustrateswhytestingonthetrainingsetissoflawed.Thisflawisprettyblatantwhenyouusethe1-nearestneighborclassifier,butdon'tthinkthatwithsomeotherclassifieryou'dbeimmunetothisproblem--theproblemisfundamentalandappliesnomatterwhatclassifieryouuse.Testingonthetrainingsetgivesyouabiasedestimateoftheclassifier'saccurate.Forthesereasons,youshouldnevertestonthetrainingset.

Sowhatshouldyoudo,instead?Isthereamoreprincipledapproach?

Itturnsoutthereis.Theapproachcomesdownto:getmoredata.Morespecifically,therightsolutionistouseonedatasetfortraining,andadifferentdatasetfortesting,withnooverlapbetweenthetwodatasets.Wecalltheseatrainingsetandatestset.

Wheredowegetthesetwodatasetsfrom?Typically,we'llstartoutwithsomedata,e.g.,thedataseton683patients,andbeforewedoanythingelsewithit,we'llsplititupintoatrainingsetandatestset.Wemightput50%ofthedataintothetrainingsetandtheother50%intothetestset.Basically,wearesettingasidesomedataforlateruse,sowecanuseittomeasuretheaccuracyofourclassifier.Sometimespeoplewillcallthedatathatyousetasidefortestingahold-outset,andthey'llcallthisstrategyforestimatingaccuracythehold-outmethod.

Notethatthisapproachrequiresgreatdiscipline.Beforeyoustartapplyingmachinelearningmethods,youhavetotakesomeofyourdataandsetitasidefortesting.Youmustavoidusingthetestsetfordevelopingyourclassifier:youshouldn'tuseittohelptrainyourclassifierortweakitssettingsorforbrainstormingwaystoimproveyourclassifier.Instead,youshoulduseitonlyonce,attheveryend,afteryou'vefinalizedyourclassifier,whenyouwantanunbiasedestimateofitsaccuracy.

Theeffectivenessofourclassifier,forbreastcancer¶OK,solet'sapplythehold-outmethodtoevaluatetheeffectivenessofthe -nearestneighborclassifierforbreastcancerdiagnosis.Thedatasethas683patients,sowe'llrandomlypermutethedatasetandput342oftheminthetrainingsetandtheremaining341inthetestset.

patients=patients.sample(683)#Randomlypermutetherows

trainset=patients.take(range(342))

testset=patients.take(range(342,683))

ComputationalandInferentialThinking

277Classification

We'lltraintheclassifierusingthe342patientsinthetrainingset,andevaluatehowwellitperformsonthetestset.Tomakeourliveseasier,we'llwriteafunctiontoevaluateaclassifieroneverypatientinthetestset:

defevaluate_accuracy(training,test,k):

testattrs=test.drop('Class')

numcorrect=0

foriinrange(test.num_rows):

#Runtheclassifierontheithpatientinthetestset

c=classify(training,testattrs.rows[i],k)

#Wastheclassifier'spredictioncorrect?

ifc==test.column('Class').item(i):

numcorrect=numcorrect+1

returnnumcorrect/test.num_rows

Nowforthegrandreveal--let'sseehowwedid.We'llarbitrarilyuse .

evaluate_accuracy(trainset,testset,5)

0.967741935483871

About96%accuracy.Notbad!Prettydarngoodforsuchasimpletechnique.

Asafootnote,youmighthavenoticedthatBrittanyWengerdidevenbetter.Whattechniquesdidsheuse?Onekeyinnovationisthatsheincorporatedaconfidencescoreintoherresults:heralgorithmhadawaytodeterminewhenitwasnotabletomakeaconfidentprediction,andforthosepatients,itdidn'teventrytopredicttheirdiagnosis.Heralgorithmwas99%accurateonthepatientswhereitmadeaprediction--sothatextensionseemedtohelpquiteabit.

Importanttakeaways¶

Hereareafewlessonswewantyoutolearnfromthis.

First,machinelearningispowerful.Ifyouhadtotrytowritecodetomakeadiagnosiswithoutknowingaboutmachinelearning,youmightspendalotoftimebytrial-and-errortryingtocomeupwithsomecomplicatedsetofrulesthatseemtowork,andtheresultmightnotbeveryaccurate.The -nearestneighborsalgorithmautomatestheentiretaskforyou.Andmachinelearningoftenletsthemmakepredictionsfarmoreaccuratelythananythingyou'dcomeupwithbytrial-and-error.

ComputationalandInferentialThinking

278Classification

Second,youcandoit.Yes,you.Youcanusemachinelearninginyourownworktomakepredictionsbasedondata.Younowknowenoughtostartapplyingtheseideastonewdatasetsandhelpothersmakeusefulpredictions.Thetechniquesareverypowerful,butyoudon'thavetohaveaPh.D.instatisticstousethem.

Third,becarefulabouthowtoevaluateaccuracy.Useahold-outset.

There'slotsmoreonecansayaboutmachinelearning:howtochooseattributes,howtochoose orotherparameters,whatotherclassificationmethodsareavailable,howtosolvemorecomplexpredictiontasks,andlotsmore.Inthiscourse,we'vebarelyevenscratchedthesurface.Ifyouenjoyedthismaterial,youmightenjoycontinuingyourstudiesinstatisticsandcomputerscience;courseslikeStats132and154andCS188and189gointoalotmoredepth.

ComputationalandInferentialThinking

279Classification

FeaturesSectionauthor:DavidWagner

Interact

Buildingamodel¶Sofar,wehavetalkedaboutprediction,wherethepurposeoflearningistobeabletopredicttheclassofnewinstances.I'mnowgoingtoswitchtomodelbuilding,wherethegoalistolearnamodelofhowtheclassdependsupontheattributes.

Oneplacewheremodelbuildingisusefulisforscience:e.g.,whichgenesinfluencewhetheryoubecomediabetic?Thisisinterestingandusefulinitsright(apartfromanyapplicationstopredictingwhetheraparticularindividualwillbecomediabetic),becauseitcanpotentiallyhelpusunderstandtheworkingsofourbody.

Anotherplacewheremodelbuildingisusefulisforcontrol:e.g.,whatshouldIchangeaboutmyadvertisementtogetmorepeopletoclickonit?HowshouldIchangetheprofilepictureIuseonanonlinedatingsite,togetmorepeopleto"swiperight"?Whichattributesmakethebiggestdifferencetowhetherpeopleclick/swipe?Ourgoalistodeterminewhichattributestochange,tohavethebiggestpossibleeffectonsomethingwecareabout.

Wealreadyknowhowtobuildaclassifier,givenatrainingset.Let'sseehowtousethatasabuildingblocktohelpussolvetheseproblems.

Howdowefigureoutwhichattributeshavethebiggestinfluenceontheoutput?Takeamomentandseewhatyoucancomeupwith.

Featureselection¶Background:attributesarealsocalledfeatures,inthemachinelearningliterature.

Ourgoalistofindasubsetoffeaturesthataremostrelevanttotheoutput.Thewaywe'llformalizeisthisistoidentifyasubsetoffeaturesthat,whenwetrainaclassifierusingjustthosefeatures,givesthehighestpossibleaccuracyatprediction.

Intuitively,ifweget90%accuracyusingallthefeaturesand88%accuracyusingjustthreeofthefeatures(forexample),thenitstandstoreasonthatthosethreefeaturesareprobablythemostrelevant,andtheycapturemostoftheinformationthataffectsordeterminesthe

ComputationalandInferentialThinking

280Features

output.

Withthisinsight,ourproblembecomes:

Findthesubsetof featuresthatgivesthebestpossibleaccuracy(whenweuseonlythose featuresforprediction).

Thisisafeatureselectionproblem.Therearemanypossibleapproachestofeatureselection.Onesimpleoneistotryallpossiblewaysofchoosing ofthefeatures,andevaluatetheaccuracyofeach.However,thiscanbeveryslow,becausetherearesomanywaystochooseasubsetof features.

Therefore,we'llconsideramoreefficientprocedurethatoftenworksreasonablywellinpractice.Itisknownasgreedyfeatureselection.Here'showitworks.

1. Supposethereare features.Tryeachonitsown,toseehowmuchaccuracywecangetusingaclassifiertrainedwithjustthatonefeature.Keepthebestfeature.

2. Nowwehaveonefeature.Tryremaining features,toseewhichisthebestonetoaddtoit(i.e.,wearenowtrainingaclassifierwithjust2features:thebestfeaturepickedinstep1,plusonemore).Keeptheonethatbestimprovesaccuracy.Nowwehave2features.

3. Repeat.Ateachstage,wetryallpossibilitiesforhowtoaddonemorefeaturetothefeaturesubsetwe'vealreadypicked,andwekeeptheonethatbestimprovesaccuracy.

Let'simplementitandtryitonsomeexamples!

Codefork-NN¶First,somecodefromlasttime,toimplement -nearestneighbors.

defdistance(pt1,pt2):

tot=0

foriinrange(len(pt1)):

tot=tot+(pt1[i]-pt2[i])**2

returnmath.sqrt(tot)

ComputationalandInferentialThinking

281Features

defcomputetablewithdists(training,p):

dists=np.zeros(training.num_rows)

attributes=training.drop('Class').rows

foriinrange(training.num_rows):

dists[i]=distance(attributes[i],p)

withdists=training.copy()

withdists.append_column('Distance',dists)

returnwithdists

defclosest(training,p,k):

withdists=computetablewithdists(training,p)

sortedbydist=withdists.sort('Distance')

topk=sortedbydist.take(range(k))

returntopk

defmajority(topkclasses):

iftopkclasses.where('Class',1).num_rows>topkclasses.where('Class'

return1

else:

return0

defclassify(training,p,k):

closestk=closest(training,p,k)

topkclasses=closestk.select('Class')

returnmajority(topkclasses)

defevaluate_accuracy(training,valid,k):

validattrs=valid.drop('Class')

numcorrect=0

foriinrange(valid.num_rows):

#Runtheclassifierontheithpatientinthetestset

c=classify(training,validattrs.rows[i],k)

#Wastheclassifier'spredictioncorrect?

ifc==valid['Class'][i]:

numcorrect=numcorrect+1

returnnumcorrect/valid.num_rows

ComputationalandInferentialThinking

282Features

Codeforfeatureselection¶Nowwe'llimplementthefeatureselectionalgorithm.First,asubroutinetoevaluatetheaccuracywhenusingaparticularsubsetoffeatures:

defevaluate_features(training,valid,features,k):

tr=training.select(['Class']+features)

va=valid.select(['Class']+features)

returnevaluate_accuracy(tr,va,k)

Next,we'llimplementasubroutinethat,givenacurrentsubsetoffeatures,triesallpossiblewaystoaddonemorefeaturetothesubset,andevaluatestheaccuracyofeachcandidate.Thisreturnsatablethatsummarizestheaccuracyofeachoptionitexamined.

deftry_one_more_feature(training,valid,baseattrs,k):

results=Table(['Attribute','Accuracy'])

forattrintraining.drop(['Class']+baseattrs).labels:

acc=evaluate_features(training,valid,[attr]+baseattrs,

results.append((attr,acc))

returnresults.sort('Accuracy',descending=True)

Finally,we'llimplementthegreedyfeatureselectionalgorithm,usingtheabovesubroutines.Forourownpurposesofunderstandingwhat'sgoingon,I'mgoingtohaveitprintout,ateachiteration,allfeaturesitconsideredandtheaccuracyitgotwitheach.

ComputationalandInferentialThinking

283Features

defselect_features(training,valid,k,maxfeatures=3):

results=Table(['NumAttrs','Attributes','Accuracy'])

curattrs=[]

iters=min(maxfeatures,len(training.labels)-1)

whilelen(curattrs)<iters:

print('==Computingbestfeaturetoaddto'+str(curattrs))

#Tryallwaysofaddingjustonemorefeaturetocurattrs

r=try_one_more_feature(training,valid,curattrs,k)

r.show()

print()

#Takethesinglebestfeatureandaddittocurattrs

attr=r['Attribute'][0]

acc=r['Accuracy'][0]

curattrs.append(attr)

results.append((len(curattrs),','.join(curattrs),acc))

returnresults

Example:TreeCover¶Nowlet'stryitoutonanexample.I'mworkingwithadatasetgatheredbytheUSForestryservice.Theyvisitedthousandsofwildnernesslocationsandrecordedvariouscharacteristicsofthesoilandland.Theyalsorecordedwhatkindoftreewasgrowingpredominantlyonthatland.FocusingonlyonareaswherethetreecoverwaseitherSpruceorLodgepolePine,let'sseeifwecanfigureoutwhichcharacteristicshavethegreatesteffectonwhetherthepredominanttreecoverisSpruceorLodgepolePine.

Thereare500,000recordsinthisdataset--morethanIcananalyzewiththesoftwarewe'reusing.So,I'llpickarandomsampleofjustafractionoftheserecords,toletusdosomeexperimentsthatwillcompleteinareasonableamountoftime.

all_trees=Table.read_table('treecover2.csv.gz',sep=',')

training=all_trees.take(range(0,1000))

validation=all_trees.take(range(1000,1500))

test=all_trees.take(range(1500,2000))

training.show(2)

ComputationalandInferentialThinking

284Features

Elevation Aspect Slope HorizDistToWater VertDistToWater HorizDistToRoad

2804 139 9 268 65 3180

2785 155 18 242 118 3090

...(998rowsomitted)

Let'sstartbyfiguringouthowaccurateaclassifierwillbe,iftrainedusingthisdata.I'mgoingtoarbitrarilyuse forthe -nearestneighborclassifier.

evaluate_accuracy(training,validation,15)

0.572

Nowwe'llapplyfeatureselection.IwonderwhichcharacteristicshavethebiggestinfluenceonwhetherSprucevsLodgepolePinegrows?We'lllookforthebest4features.

best_features=select_features(training,validation,15)

==Computingbestfeaturetoaddto[]

ComputationalandInferentialThinking

285Features

Attribute Accuracy

Elevation 0.762

HorizDistToRoad 0.562

HorizDistToFire 0.534

Hillshade9am 0.518

Hillshade3pm 0.516

Slope 0.516

Area4 0.514

Area3 0.514

Area2 0.514

Area1 0.514

VertDistToWater 0.512

HillshadeNoon 0.51

Aspect 0.51

HorizDistToWater 0.502

==Computingbestfeaturetoaddto['Elevation']

Attribute Accuracy

HorizDistToWater 0.782

Hillshade9am 0.776

Slope 0.77

Hillshade3pm 0.768

Area4 0.762

Area3 0.762

Area2 0.762

Area1 0.762

VertDistToWater 0.762

Aspect 0.75

HillshadeNoon 0.748

HorizDistToRoad 0.73

HorizDistToFire 0.664

ComputationalandInferentialThinking

286Features

==Computingbestfeaturetoaddto['Elevation','HorizDistToWater']

Attribute Accuracy

Hillshade3pm 0.792

HillshadeNoon 0.792

Slope 0.784

Aspect 0.784

Area4 0.782

Area3 0.782

Area2 0.782

Area1 0.782

Hillshade9am 0.78

VertDistToWater 0.776

HorizDistToRoad 0.708

HorizDistToFire 0.64

best_features

NumAttrs Attributes Accuracy

1 Elevation 0.762

2 Elevation,HorizDistToWater 0.782

3 Elevation,HorizDistToWater,Hillshade3pm 0.792

Aswecansee,Elevationlookslikefarandawaythemostdiscriminativefeature.Thissuggeststhatthischaracteristicmightplayalargeroleinthebiologyofwhichtreegrowsbest,andthusmighttellussomethingaboutthescience.

Whataboutthehorizontaldistancetowaterandtheamountofshadeat3pm?Itlooksliketheymightalsobepredictive,andthusmightplayaroleinthebiologyaswell--assumingtheapparentimprovementinaccuracyisreal,andwe'renotjustfoolingourselves.

Hold-outsets:Training,Validation,Testing¶

ComputationalandInferentialThinking

287Features

Supposewebuiltapredictorusingjustthebesttwofeatures,ElevationandHorizDistToWater.Howaccuratewouldweexpectittobe,ontheentirepopulationoflocations?78.2%accurate?more?less?Why?

Well,atthispoint,it'shardtotell.It'sthesameissuewementionedearlieraboutfoolingourselves.We'vetriedmultipledifferentapproaches,andtakenthebest;ifwethenevaluateitonthesamedatasetweusedtoselectwhichisbest,wewillgetabiasednumbers--itmightnotbeanaccurateestimateofthetrueaccuracy.

Why?Ourdatasetisnoisy.We'velookedforcorrelations,andkepttheassociationthathadthehighestcorrelationinourtrainingset.Butwasthatarealrelationship,orwasitjustnoise?Ifyoupickasingleattributeandmeasureitsaccuracyonasample,we'dexpectthistobeareasonableapproximationtoitsaccuracyontheentirepopulation,butwithsomerandomerror.Ifwelookat100possiblecombinationsandchoosethebest,it'spossiblewefoundonewhoseaccuracyontheentirepopulationisindeedlarge--orit'spossiblewewerejustselectingtheonewhoseerrortermhappenedtobethelargestofthe100.

Forthesereasons,wecan'texpecttheaccuracynumberswe'vecomputedabovetonecessarilybeagood,unbiasedmeasureoftheaccuracyontheentirepopulation.

Thewaytogetanunbiasedestimateofaccuracyisthesameaslastlecture:getsomemoredata;orsetsomeasideinthebeginningsowehavemorewhenweneedit.Inthiscase,Isetasidetwoextrachunksofdata,avalidationdatasetandatestdataset.Iusedthevalidationsettoselectafewbestfeatures.Nowwe'regoingtomeasuretheperformanceofthisonthetestset,justtoseewhathappens.

evaluate_features(training,validation,['Elevation'],15)

0.762

evaluate_features(training,test,['Elevation'],15)

0.712

evaluate_features(training,validation,['Elevation','HorizDistToWater'

0.782

ComputationalandInferentialThinking

288Features

evaluate_features(training,test,['Elevation','HorizDistToWater'],

0.742

Whydoyouthinkweseethisdifference?

Togetabetterfeelingforthis,wecanlookatallofourdifferentfeatures,andcompareitsaccuracyonthetrainingsetvsitsaccuracyonthevalidationset,toseehowmuchrandomerrorthereisintheseaccuracyfigures.Iftherewasnoerror,thenbothaccuracynumberswouldalwaysbethesame,butaswe'llsee,inpracticethereissomeerror,becausecomputingtheaccuracyonasampleisonlyanapproximationtotheaccuracyontheentirepopulation.

accuracies=Table(['ValidationAccuracy','Accuracyonanothersample'

another_sample=all_trees.take(range(2000,2500))

forfeatureintraining.drop('Class').labels:

x=evaluate_features(training,validation,[feature],15)

y=evaluate_features(training,another_sample,[feature],15)

accuracies.append((x,y))

accuracies.sort('ValidationAccuracy')

ValidationAccuracy Accuracyonanothersample

0.502 0.346

0.51 0.384

0.51 0.35

0.512 0.358

0.514 0.342

0.514 0.342

0.514 0.342

0.514 0.342

0.516 0.366

0.516 0.352

...(4rowsomitted)

ComputationalandInferentialThinking

289Features

accuracies.scatter('ValidationAccuracy')

/opt/conda/lib/python3.4/site-packages/matplotlib/collections.py:590:FutureWarning:elementwisecomparisonfailed;returningscalarinstead,butinthefuturewillperformelementwisecomparison

ifself._edgecolors==str('face'):

ThoughtQuestions¶SupposethatthetoptwoattributeshadbeenElevationandHorizDistToRoad.Interpretthisforme.Whatmightthismeanforthebiologyoftrees?Onepossibleexplanationisthatthedistancetothenearestroadaffectswhatkindoftreegrows;canyougiveanyotherpossibleexplanations?

OnceweknowthetoptwoattributesareElevationandHorizDistToWater,supposewenextwantedtoknowhowtheyaffectwhatkindoftreegrows:e.g.,doeshighelevationtendtofavorspruce,ordoesitfavorlodgepolepine?Howwouldyougoaboutansweringthesekindsofquestions?

ThescientistsalsogatheredsomemoredatathatIleftout,forsimplicity:foreachlocation,theyalsogatheredwhatkindofsoilithas,outof40differenttypes.Theoriginaldatasethadacolumnforsoiltype,withnumbersfrom1-40indicatingwhichofthe40typesofsoilwaspresent.SupposeIwantedtoincludethisamongtheothercharacteristics.Whatwouldgowrong,andhowcouldIfixitup?

ComputationalandInferentialThinking

290Features

Forthisexamplewepicked arbitrarily.Supposewewantedtopickthebestvalueof--theonethatgivesthebestaccuracy.Howcouldwegoaboutdoingthat?Whatarethe

pitfalls,andhowcouldtheybeaddressed?

SupposeIwantedtousefeatureselectiontohelpmeadjustmyonlinedatingprofilepicturetogetthemostresponses.TherearesomecharacteristicsIcan'tchange(suchashowhandsomeIam),andsomeIcan(suchaswhetherIsmileornot).HowwouldIadjustthefeatureselectionalgorithmabovetoaccountforthis?

ComputationalandInferentialThinking

291Features

Chapter5

Inference

ComputationalandInferentialThinking

292Inference

ConfidenceIntervalsInteractWheneverweanalyzearandomsampleofapopulation,ourgoalistodrawrobustconclusionsaboutthewholepopulation,despitethefactthatwecanmeasureonlypartofit.Wemustguardagainstbeingmisledbytheinevitablerandomvariationsinasample.Itispossiblethatasamplehasverydifferentcharacteristicsthanthepopulationitself,butthosesamplesareunlikely.Itismuchmoretypicalthatarandomsample,especiallyalargesimplerandomsample,isrepresentativeofthepopulation.Moreimportantly,thechancethatasampleresemblesthepopulationissomethingthatweknowinadvance,basedonhowweselectedourrandomsample.Statisticalinferenceisthepracticeofusingthisinformationtomakeprecisestatementsaboutapopulationbasedonarandomsample.

StatisticsandParameters¶TheBayAreaBikeShareservicepublishedadatasetdescribingeverybicyclerentalfromSeptember2014toAugust2015intheirsystem.Thissetofalltripsisapopulation,andfortunatelywehavethedataforeachandeverytripratherthanjustasample.We'llfocusonlyonthefreetrips,whicharetripsthatlastlessthan1800seconds(halfanhour).

free_trips=Table.read_table("trip.csv").where('Duration',are.below

free_trips

ComputationalandInferentialThinking

293ConfidenceIntervals

StartStation EndStation Duration

HarryBridgesPlaza(FerryBuilding)

SanFranciscoCaltrain(Townsendat4th) 765

SanAntonioShoppingCenter MountainViewCityHall 1036

PostatKearny 2ndatSouthPark 307

SanJoseCityHall SanSalvadorat1st 409

EmbarcaderoatFolsom EmbarcaderoatSansome 789

YerbaBuenaCenteroftheArts(3rd@Howard)

SanFranciscoCaltrain(Townsendat4th) 293

EmbarcaderoatFolsom EmbarcaderoatSansome 896

EmbarcaderoatSansome SteuartatMarket 255

BealeatMarket TemporaryTransbayTerminal(HowardatBeale) 126

PostatKearny SouthVanNessatMarket 932

...(338333rowsomitted)

Aquantitymeasuredforawholepopulationiscalledaparameter.Forexample,theaveragedurationforanyfreetripbetweenSeptember2014toAugust2015canbecomputedexactlyfromthesedata:550.00.

np.average(free_trips.column('Duration'))

550.00143345658103

Althoughwehaveinformationforthefullpopulation,wewillinvestigatewhatwecanlearnfromasimplerandomsample.Below,wesample100tripswithoutreplacementfromfree_tripsnotjustonce,but40differenttimes.All4000tripsarestoredinatablecalledsamples.

n=100

samples=Table(['Sample#','Duration'])

foriinnp.arange(40):

fortripinfree_trips.sample(n).rows:

samples.append([i,trip.item('Duration')])

samples

ComputationalandInferentialThinking

294ConfidenceIntervals

Sample# Duration

0 557

0 491

0 877

0 279

0 339

0 542

0 715

0 255

0 1543

0 194

...(3990rowsomitted)

Aquantitymeasuredforasampleiscalledastatistic.Forexample,theaveragedurationofafreetripinasampleisastatistic,andwefindthatthereissomevariationacrosssamples.

foriinnp.arange(3):

sample_average=np.average(samples.where('Sample#',i).column

print("Averagedurationinsample",i,"is",sample_average)

Averagedurationinsample0is547.29

Averagedurationinsample1is500.6

Averagedurationinsample2is660.16

Often,thegoalofanalysisistoestimateparametersfromstatistics.Properstatisticalinferenceinvolvesmorethanjustpickinganestimate:weshouldalsosaysomethingabouthowconfidentweshouldbeinthatestimate.Asingleguessisrarelysufficienttomakearobustconclusionaboutapopulation.Wealsoneedtoquantifyinsomewayhowprecisewebelievethatguesstobe.

Alreadywecanseefromtheexampleabovethattheaveragedurationofasampleisaround550,theparameter.Ourgoalistoquantifythisnotionofaround,andonewaytodosoistostatearangeofvalues.

ErrorinEstimates¶

ComputationalandInferentialThinking

295ConfidenceIntervals

Wecanlearnagreatdealaboutthebehaviorofastatisticbycomputingitformanysamples.Inthescatterdiagrambelow,theorderinwhichthestatisticwascomputedappearsonthehorizontalaxis,andthevalueoftheaveragedurationforasampleofntripsappearsontheverticalaxis.

samples.group(0,np.average).scatter(0,1,s=50)

Thesamplenumberdoesn'ttellusanythingbeyondwhenthesamplewasselected,andwecanseefromthisunstructuredcloudofpointsthatonesampledoesnotaffectthenextinanyconsistentway.Thesampleaveragesareindeedallaround550.Someareabove,andsomearebelow.

samples.group(0,np.average).hist(1,bins=np.arange(480,641,10))

ComputationalandInferentialThinking

296ConfidenceIntervals

Thereare40averagesfor40samples.Ifwefindanintervalthatcontainsthemiddle38outof40,itwillcontain95%ofthesamples.Onesuchintervalisboundedbythesecondsmallestandthesecondlargestsampleaveragesamongthe40samples.

averages=np.sort(samples.group(0,np.average).column(1))

lower=averages.item(1)

upper=averages.item(38)

print('Empirically,95%ofsampleaveragesfellwithintheinterval'

Empirically,95%ofsampleaveragesfellwithintheinterval502.38to620.04

Thisstatementgivesusourfirstnotionofamarginoferror.Ifthisstatementweretoholdupnotjustforour40samples,butforallsamples,thenareasonablestrategyforestimatingthetruepopulationaveragewouldbetodrawa100-tripsample,computeitssampleaverage,andguessthatthetrueaveragewaswithinmax_deviationofthisquantity.Wewouldberightatleast95%ofthetime.

Theproblemwiththisapproachisthatfindingthemax_deviationrequireddrawingmanydifferentsamples,andwewouldliketobeabletosaysomethingsimilarafterhavingcollectedonlyone100-tripsample.

VariabilityWithinSamples¶

Bycontrast,Theindividualdurationsinthesamplesthemselvesarenotcloselyclusteredaround550.Instead,theyspanthewholerangefrom0to1800.Thedurationsforthefirstthree100-itemsamplesarevisualizedbelow.Allhavesomewhatsimilarshapes;theywere

ComputationalandInferentialThinking

297ConfidenceIntervals

drawnfromthesamepopulation.

foriinnp.arange(3):

samples.where('Sample#',i).hist(1,bins=np.arange(0,1801,200

ComputationalandInferentialThinking

298ConfidenceIntervals

Wehavenowobservedtwodifferentwaysinwhichthevariationofarandomsamplecanbeobserved:thestatisticscomputedfromthosesamplesvaryinmagnitude,andthesampledquantitiesthemselvesvaryindistribution.Thatis,thereisvariabilityacrosssamples,whichweseebycomparingsampleaverages,andthereisvariabilitywithineachsample,whichweseeinthedurationhistogramsabove.

Percentiles¶The80thpercentileofacollectionofvaluesisthesmallestvalueinthecollectionthatisatleastaslargeas80%ofthevaluesinthecollection.Likewise,the40thpercentileisatleastaslargeas40%ofthevalues.

Forexample,let'sconsiderthesizesofthefivelargestcontinents:Africa,Antarctica,Asia,NorthAmerica,andSouthAmerica,roundedtothenearestmillionsquaremiles.

sizes=[12,17,6,9,7]

The20thpercentileisthesmallestvaluethatisatleastaslargeas20%oronefifthoftheelements.

percentile(20,sizes)

6

The60thpercentileisatleastaslargeas60%orthreefifthsoftheelements.

ComputationalandInferentialThinking

299ConfidenceIntervals

percentile(60,sizes)

9

Tofindtheelementcorrespondingtoapercentile ,sorttheelementsandtakethe thelement,where where isthesizeofthecollection.Ingeneral,any

percentilefrom0to100canbecomputedforacollectionofvalues,anditisalwaysanelementofthecollection.If isnotaninteger,roundituptofindthepercentileelement.Forexample,the90thpercentileofafiveelementsetisthefifthelement,because is

4.5,whichroundsupto5.

percentile(90,sizes)

17

Whenthepercentilefunctioniscalledwithasingleargument,itreturnsafunctionthatcomputespercentilesofcollections.

tenth=percentile(10)

tenth(sizes)

6

tenth(np.arange(50))

4

tenth(np.arange(5000))

499

ComputationalandInferentialThinking

300ConfidenceIntervals

ConfidenceIntervals¶Inferencebeginswithapopulation:acollectionthatyoucandescribebutoftencannotmeasuredirectly.Onethingwecanoftendowithapopuluationistodrawasimplerandomsample.Bymeasuringastatisticofthesample,wemayhopetoestimateaparameterofthepopulation.Inthisprocess,thepopulationisnotrandom,norareitsparameters.Thesampleisrandom,andsoisitsstatistic.Thewholeprocessofestimationthereforegivesarandomresult.Itstartsfromapopulation(fixed),drawsasample(random),computesastatistic(random),andthenestimatestheparameterfromthestatistic(random).Bycallingtheprocessrandom,wemeanthatthefinalestimatecouldhavebeendifferentbecausethesamplecouldhavebeendifferent.

Aconfidenceintervalisonenotionofamarginoferroraroundanestimate.Themarginoferroracknowledgesthattheestimationprocesscouldhavebeenmisleadingbecauseofthevariationinthesample.Itgivesarangethatmightcontaintheoriginalparameter,alongwithapercentageofconfidence.Forexample,a95%confidenceintervalisoneinwhichwewouldexpecttheintervaltocontainthetrueparameterfor95%ofrandomsamples.Foranyparticularsample,anyparticularestimate,andanyparticularinterval,theparameteriseithercontainedornot;weneverknowforsure.Thepurposeoftheconfidenceintervalistoquantifyhowoftenourestimationprocessgivesusarangecontainingthetruth.

Therearemanydifferentwaystocomputeaconfidenceinterval.Thechallengeincomputingaconfidenceintervalfromasinglesampleisthatwedon'tknowthepopulation,andsowedon'tknowhowitssampleswillvary.Onewaytomakeprogressdespitethislimitedinformationistoassumethatthevariabilitywithinthesampleissimilartothevariabilitywithinthepopulation.

ResamplingfromaSample¶

Eventhoughwehappentohavethewholepopulationoffree_trips,let'simagineforamomentthatwehaveonlyasingle100-tripsamplefromthispopulation.Thatsamplehas100durations.Ourgoalnowistolearnwhatwecanaboutthepopulationusingjustthissample.

sample=free_trips.sample(n)

sample

ComputationalandInferentialThinking

301ConfidenceIntervals

StartStation EndStation Duration

GrantAvenueatColumbusAvenue

SanFranciscoCaltrain2(330Townsend) 877

EmbarcaderoatFolsom SanFranciscoCaltrain(Townsendat4th) 597

BealeatMarket BealeatMarket 67

2ndatFolsom WashingtonatKearny 714

ClayatBattery EmbarcaderoatSansome 510

SouthVanNessatMarket PowellStreetBART 494

SanJoseDiridonCaltrainStation MLKLibrary 530

MechanicsPlaza(MarketatBattery) ClayatBattery 286

SouthVanNessatMarket CivicCenterBART(7thatMarket) 172

BroadwayStatBatterySt 5thatHoward 551

...(90rowsomitted)

average=np.average(sample.column('Duration'))

average

518.55999999999995

Atechniquecalledbootstrapresamplingusesthevariabilityofthesampleasaproxyforthevariabilityofthepopulation.Insteadofdrawingmanysamplesfromthepopulation,wewilldrawmanysamplesfromthesample.Sincetheoriginalsampleandthere-sampledsamplehavethesamesize,wemustdrawwithreplacementinordertoseeanyvariabilityatall.

resamples=Table(['Resample#','Duration'])

foriinnp.arange(40):

resample=sample.sample(n,with_replacement=True)

fortripinresample.rows:

resamples.append([i,trip.item('Duration')])

resamples

ComputationalandInferentialThinking

302ConfidenceIntervals

Resample# Duration

0 329

0 252

0 252

0 685

0 434

0 396

0 434

0 585

0 394

0 202

...(3990rowsomitted)

Usingtheresamples,wecaninvestigatethevariabilityofthecorrespondingsampleaverages.

resamples.group(0,np.average).scatter(0,1,s=50)

Thesesampleaveragesarenotnecessarilyclusteredaroundthetrueaverage550.Instead,theywillbeclusteredaroundtheaverageofthesamplefromwhichtheywerere-sampled.

ComputationalandInferentialThinking

303ConfidenceIntervals

resamples.group(0,np.average).hist(1,bins=np.arange(480,641,10))

Althoughthecenterofthisdistributionisnotthesameasthetruepopulationaverage,itsvariabilityissimilar.

BootstrapConfidenceInterval¶

Atechniquecalledbootstrapresamplingestimatesaconfidenceintervalusingresampling.Tofindaconfidenceinterval,wemustestimatehowmuchthesamplestatisticdeviatesfromthepopulationparameter.Theboostrapassumesthataresampledstatisticwilldeviatefromthesamplestatisticinapproximatelythesamewaythatthesamplestatisticdeviatesfromtheparameter.

Tocarryoutthetechnique,wefirstcomputethedeviationsfromthesamplestatistic(inthiscase,thesampleaverage)formanyresampledsamples.Thenumberisarbitrary,anditdoesn'trequireanycollectingofnewdata;1000ormoreistypical.

deviations=Table(['Resample#','Deviation'])

foriinnp.arange(1000):

resample=sample.sample(n,with_replacement=True)

dev=average-np.average(resample.column(2))

deviations.append([i,dev])

deviations

ComputationalandInferentialThinking

304ConfidenceIntervals

Resample# Deviation

0 30.95

1 41.16

2 3.37

3 2.54

4 -22.87

5 27.63

6 49.93

7 11.91

8 26.32

9 -5.97

...(990rowsomitted)

Adeviationofzeroindicatesthattheresampledstatisticisthesameasthesampledstatistic.That'sevidencethatthesamplestatisticisperhapsthesameasthepopulationparameter.Apositivedeviationisevidencethatthesamplestatisticmaybegreaterthantheparameter.Aswecansee,foraverages,thedistributionofdeviationsisroughlysymmetric.

deviations.hist(1,bins=np.arange(-100,101,10))

A95%confidenceintervaliscomputedbasedonthemiddle95%ofthisdistributionofdeviations.Themaxandmindevationsarecomputedbythe97.5thpercentileand2.5thpercentilerespectively,becausethespanbetweenthesecontains95%ofthedistribution.

ComputationalandInferentialThinking

305ConfidenceIntervals

lower=average-percentile(97.5,deviations.column(1))

lower

461.93000000000001

upper=average-percentile(2.5,deviations.column(1))

upper

576.29999999999995

Now,wehavenotonlyanestimate,butanintervalarounditthatexpressesouruncertaintyaboutthevalueoftheparameter.

print('A95%confidenceintervalaround',average,'is',lower,'to'

A95%confidenceintervalaround518.56is461.93to576.3

Andbehold,theintervalhappenstocontainthetrueparameter550.Normally,aftergeneratingaconfidenceinterval,thereisnowaytocheckthatitiscorrect,becausetodosorequiresaccesstotheentirepopulation.Inthiscase,westartedwiththewholepopulationinasingletable,andsoweareabletocheck.

Thefollowingfunctionencapsulatestheentireprocessofproducingaconfidenceintervalforthepopulationaverageusingasinglesample.Eachtimeitisrun,evenforthesamesample,theintervalwillbeslightlydifferentbecauseoftherandomvariationofresampling.However,theupperandlowerboundsgeneratedby2000samplesforthisproblemaretypicallyquitesimilarformultiplecalls.

ComputationalandInferentialThinking

306ConfidenceIntervals

defaverage_ci(sample,label):

deviations=Table(['Resample#','Deviation'])

n=sample.num_rows

average=np.average(sample.column(label))

foriinnp.arange(1000):

resample=sample.sample(n,with_replacement=True)

dev=np.average(resample.column(label))-average

deviations.append([i,dev])

return(average-percentile(97.5,deviations.column(1)),

average-percentile(2.5,deviations.column(1)))

average_ci(sample,'Duration')

(457.43999999999994,572.63999999999987)

average_ci(sample,'Duration')

(458.7399999999999,575.50999999999988)

However,ifwedrawanentirelynewsample,thentheconfidenceintervalwillchange.

average_ci(free_trips.sample(n),'Duration')

(530.18000000000006,661.21000000000004)

Theintervalisaboutthesamewidth,andthat'stypicalofmultiplesamplesofthesamesizedrawnfromthesamepopulation.However,ifamuchlargersampleisdrawn,thentheconfidenceintervalthatresultsistypicallygoingtobemuchsmaller.We'renotlosingconfidenceinthiscase;we'rejustusingmoreevidencetoperforminference,andsowehavelessuncertaintyaboutthefullpopulation.

average_ci(free_trips.sample(16*n),'Duration')

(527.18937499999993,555.11437499999988)

ComputationalandInferentialThinking

307ConfidenceIntervals

EvaluatingConfidenceIntervals¶

Whengivenonlyasample,itisnotpossibletocheckwhethertheparameterwasestimatedcorrectly.However,inthecaseoftheBayAreaBikeShare,wehavenotonlythesample,butmanysamplesandthewholepopulation.Therefore,wecancheckhowoftentheconfidenceintervalswecomputeinthiswaydoinfactcontainthetrueparameter.

Beforemeasuringtheaccuracyofourtechnique,wecanvisualizetheresultofresamplingeachsample.Thefollowingscatterdiagramhas6verticallinesfor6samples,andeachlineconsistsof100pointsthatrepresentthestatisticsfromresamplesforthatsample.

samples=Table(['Sample#','Resampleaverage','SampleAverage'])

foriinnp.arange(6):

sample=free_trips.sample(n)

average=np.average(sample.column('Duration'))

forjinnp.arange(100):

resample=sample.sample(n,with_replacement=True)

resample_average=np.average(resample.column('Duration'))

samples.append([i,resample_average,average])

samples.scatter(0,s=80)

Wecanseefromtheplotabovethatforalmosteverysample,therearesomeresampledaveragesabove550andsomebelow.Whenweusetheseresampledaveragestocomputeconfidenceintervals,manyofthoseintervalscontain550asaresult.

ComputationalandInferentialThinking

308ConfidenceIntervals

Below,wedraw100moresamplesandshowtheupperandlowerboundsoftheconfidenceintervalscomputedfromeachone.

intervals=Table(['Sample#','Lower','Estimate','Upper'])

foriinnp.arange(100):

sample=free_trips.sample(n)

average=np.average(sample.column('Duration'))

lower,upper=average_ci(sample,'Duration')

intervals.append([i,lower,average,upper])

Wecancomputepreciselywhichintervalscontainthetruth,andweshouldfindthatabout95ofthemdo.

correct=np.logical_and(intervals.column('Lower')<=550,intervals

np.count_nonzero(correct)

97

Eachconfidenceintervalisgeneratedfromonlyasinglesampleof100trips.Asaresult,thewidthoftheintervalvariesquiteabitaswell.Comparedtotheintervalwidthwecomputedoriginally(whichrequiredmultiplesamples),weseethatsomeoftheseintervalsaremuchmorenarrowandsomearemuchwider.

Table().with_column('Width',intervals.column('Upper')-intervals.column

ComputationalandInferentialThinking

309ConfidenceIntervals

Givenasinglesample,itisimpossibletoknowwhetheryouhavecorrectlyestimatedtheparameterornot.However,youcaninferaprecisenotionofconfidencebyfollowingaprocessbasedonrandomsampling.

ComputationalandInferentialThinking

310ConfidenceIntervals

DistanceBetweenDistributionsInteractBynowwehaveseenmanyexamplesofsituationsinwhichonedistributionappearstobequiteclosetoanother,suchasapopulationanditssample.Thegoalofthissectionistoquantifythedifferencebetweentwodistributions.Thiswillallowouranalysestobebasedonmorethantheassessementsthatweareabletomakebyeye.

Forthis,weneedameasureofthedistancebetweentwodistributions.Wewilldevelopsuchameasureinthecontextofastudyconductedin2010bytheAmericalCivilLibertiesUnion(ACLU)ofNorthernCalifornia.

ThefocusofthestudywastheracialcompositionofjurypanelsinAlamedaCounty.Ajurypanelisagroupofpeoplechosentobeprospectivejurors;thefinaltrialjuryisselectedfromamongthem.Jurypanelscanconsistofafewdozenpeopleorseveralthousand,dependingonthetrial.Bylaw,ajurypanelissupposedtoberepresentativeofthecommunityinwhichthetrialistakingplace.Section197ofCalifornia'sCodeofCivilProceduresays,"Allpersonsselectedforjuryserviceshallbeselectedatrandom,fromasourceorsourcesinclusiveofarepresentativecrosssectionofthepopulationoftheareaservedbythecourt."

Thefinaljuryisselectedfromthepanelbydeliberateinclusionorexclusion:thelawallowspotentialjurorstobeexcusedformedicalreasons;lawyersonbothsidesmaystrikeacertainnumberofpotentialjurorsfromthelistinwhatarecalled"peremptorychallenges";thetrialjudgemightmakeaselectionbasedonquestionnairesfilledoutbythepanel;andsoon.Buttheinitialpanelissupposedtoresemblearandomsampleofthepopulationofeligiblejurors.

TheACLUofNorthernCaliforniacompileddataontheracialcompositionofthejurypanelsin11felonytrialsinAlamedaCountyintheyears2009and2010.Inthosepanels,thetotalnumberofpoeplewhoreportedforjuryservicewas1453.TheACLUgathereddemographicdataonalloftheseprosepctivejurors,andcomparedthosedatawiththecompositionofalleligiblejurorsinthecounty.

Thedataaretabulatedbelowinatablecalledjury.Foreachrace,thefirstvalueisthepercentageofalleligiblejurorcandidatesofthatrace.Thesecondvalueisthepercentageofpeopleofthatraceamongthosewhoappearedfortheprocessofselectionintothejury.

ComputationalandInferentialThinking

311DistanceBetweenDistributions

jury=Table(["Race","Eligible","Panel"]).with_rows([

["Asian",0.15,0.26],

["Black",0.18,0.08],

["Latino",0.12,0.08],

["White",0.54,0.54],

["Other",0.01,0.04],

])

jury.set_format([1,2],PercentFormatter(0))

Race Eligible Panel

Asian 15% 26%

Black 18% 8%

Latino 12% 8%

White 54% 54%

Other 1% 4%

Someracesareoverrepresentedandsomeareunderrepresentedonthejurypanelsinthestudy.Abargraphishelpfulforvisualizingthedifferences.

jury.barh('Race')

TotalVariationDistance¶

ComputationalandInferentialThinking

312DistanceBetweenDistributions

Tomeasurethedifferencebetweenthetwodistributions,wewillcomputeaquantitycalledthetotalvariationdistancebetweenthem.Tocomputethetotalvariationdistance,takethedifferencebetweenthetwoproportionsineachcategory,adduptheabsolutevaluesofallthedifferences,andthendividethesumby2.

Herearethedifferencesandtheabsolutedifferences:

jury.append_column("Difference",jury.column("Panel")-jury.column

jury.append_column("Abs.Difference",np.abs(jury.column("Difference"

jury.set_format([3,4],PercentFormatter(0))

Race Eligible Panel Difference Abs.Difference

Asian 15% 26% 11% 11%

Black 18% 8% -10% 10%

Latino 12% 8% -4% 4%

White 54% 54% 0% 0%

Other 1% 4% 3% 3%

Andhereisthesumoftheabsolutedifferences,dividedby2:

sum(jury.column(3))/2

1.3877787807814457e-17

Thetotalvariationdistancebetweenthedistributionofeligiblejurorsandthedistributionofthepanelsis0.14.Beforeweexaminethenumercialvalue,letusexaminethereasonsbehindthestepsinthecalculation.

Itisquitenaturaltocomputethedifferencebetweentheproportionsineachcategory.Nowtakealookatthecolumndiffandnoticethatthesumofitsentriesis0:thepositiveentriesaddupto0.14,exactlycancelingthetotalofthenegativeentrieswhichis-0.14.TheproportionsineachofthetwocolumnsPanelandEligibleaddupto1,andsothegive-and-takebetweentheirentriesmustaddupto0.

Toavoidthecancellation,wedropthenegativesignsandthenaddalltheentries.Butthisgivesustwotimesthetotalofthepositiveentries(equivalently,twotimesthetotalofthenegativeentries,withthesignremoved).Sowedividethesumby2.

ComputationalandInferentialThinking

313DistanceBetweenDistributions

Wewouldhaveobtainedthesameresultbyjustaddingthepositivedifferences.Butourmethodofincludingalltheabsolutedifferenceseliminatestheneedtokeeptrackofwhichdifferencesarepositiveandwhicharenot.

Thefollowingfunctioncomputesthetotalvariationdistancebetweentwocolumnsofatable.

deftotal_variation_distance(column,other):

returnsum(np.abs(column-other))/2

deftable_tvd(table,label,other):

returntotal_variation_distance(table.column(label),table.column

table_tvd(jury,'Eligible','Panel')

0.14000000000000001

Arethepanelsrepresentativeofthepopulation?¶Wewillnowturntothenumericalvalueofthetotalvariationdistancebetweentheeligiblejurorsandthepanel.Howcanweinterpretthedistanceof0.14?Toanswerthis,letusrecallthatthepanelsaresupposedtobeselectedatrandom.Itwillthereforebeinformativetocomparethevalueof0.14withthetotalvariationdistancebetweentheeligiblejurorsandarandomlyselectedpanel.

Todothis,wewillemployourskillsatsimulation.Therewere1453prosepectivejurorsinthepanelsinthestudy.Soletustakearandomsampleofsize1453fromthepopulationofeligiblejurors.

[Technicalnote.Randomsamplesofprospectivejurorswouldbeselectedwithoutreplacement.However,whenthesizeofasampleissmallrelativetothesizeofthepopulation,samplingwithoutreplacementresemblessamplingwithreplacement;theproportionsinthepopulationdon'tchangemuchbetweendraws.ThepopulationofeligiblejurorsinAlamedaCountyisoveramillion,andcomparedtothat,asamplesizeofabout1500isquitesmall.Wewillthereforesamplewithreplacement.]

Thenp.random.multinomialfunctiondrawsarandomsampleuniformlywithreplacementfromapopulationwhosedistributioniscategorical.Specifically,np.random.multinomialtakesasitsinputasamplesizeandanarrayconsistingoftheprobabilitiesofchoosing

ComputationalandInferentialThinking

314DistanceBetweenDistributions

differentcategories.Itreturnsthecountofthenumberoftimeseachcategorywaschosen.

sample_size=1453

random_panel=np.random.multinomial(sample_size,jury["Eligible"])

random_panel

array([208,265,166,805,9])

sum(random_panel)

1453

Wecancomparethisdistributionwiththedistributionofeligiblejurors,byconvertingthecountstoproportions.Todothis,wewilldividethecountsbythesamplesize.Itisclearfromtheresultsthatthetwodistributionsarequitesimilar.

sampled=jury.select(['Race','Eligible','Panel']).with_column(

'RandomPanel',random_panel/sample_size)

sampled.set_format(3,PercentFormatter(0))

Race Eligible Panel RandomPanel

Asian 15% 26% 14%

Black 18% 8% 18%

Latino 12% 8% 11%

White 54% 54% 55%

Other 1% 4% 1%

Asalways,ithelpstovisualize.Thepopulationandsamplearequitesimilar.

sampled.barh('Race',[1,3])

ComputationalandInferentialThinking

315DistanceBetweenDistributions

Whenwecomparealsotothepanel,thedifferenceisstark.

sampled.barh('Race',[1,3,2])

Thetotalvariationdistancebetweenthepopulationdistributionandthesampledistributionisquitesmall.

table_tvd(sampled,'Eligible','RandomPanel')

0.016407432897453545

Comparingthistothedistanceof0.14forthepanelquantifiesourobservationthattherandomsampleisclosetothedistributionofeligiblejurors,andthepanelisrelativelyfarfromthedistributionofeligiblejurors.

However,thedistancebetweentherandomsampleandtheeligiblejurorsdependsonthesample;samplingagainmightgiveadifferentresult.

ComputationalandInferentialThinking

316DistanceBetweenDistributions

Howdorandomsamplesdifferfromthepopulation?¶

Thetotalvariationdistancebetweenthedistributionoftherandomsampleandthedistributionoftheeligiblejurorsisthestatisticthatweareusingtomeasurethedistancebetweenthetwodistributions.Byrepeatingtheprocessofsampling,wecanmeasurehowmuchthestatisticvariesacrossdifferentrandomsamples.Thecodebelowcomputestheempiricaldistributionofthestatisticbasedonalargenumberofreplicationsofthesamplingprocess.

#ComputeempiricaldistributionofTVDs

sample_size=1453

repetitions=1000

eligible=jury.column('Eligible')

tvds=Table(["TVDfromarandomsample"])

foriinnp.arange(repetitions):

sample=np.random.multinomial(sample_size,eligible)/sample_size

tvd=sum(abs(sample-eligible))/2

tvds.append([tvd])

tvds

TVDfromarandomsample

0.0207571

0.0286098

0.0201721

0.0308672

0.0185134

0.0102546

0.0119133

0.0346731

0.0271576

0.0216862

...(990rowsomitted)

ComputationalandInferentialThinking

317DistanceBetweenDistributions

Eachrowofthecolumnabovecontainsthetotalvariationdistancebetweenarandomsampleandthepopulationofeligiblejurors.

Thehistogramofthiscolumnshowsthatdrawing1453jurorsatrandomfromthepoolofeligiblecandidatesresultsinadistributionthatrarelydeviatesfromtheeligiblejurors'racedistributionbymorethan0.05.

tvds.hist(bins=np.arange(0,0.2,0.005))

Howdothepanelscomparetorandomsamples?¶

Thepanelsinthestudy,however,werenotquitesosimilartotheeligiblepopulation.Thetotalvariationdistancebetweenthepanelsandthepopulationwas0.14,whichisfaroutinthetailofthehistogramabove.Itdoesnotlooklikeatypicaldistancebetweenarandomsampleandtheeligiblepopulation.

OuranalysissupportstheACLU'sconclusionthatthepanelswerenotrepresentativeofthepopulation.TheACLUreportdiscussesseveralreasonsforthediscrepancies.Forexample,someminoritygroupswereunderrepresentedontherecordsofvoterregistrationandoftheDepartmentofMotorVehicles,thetwomainsourcesfromwhichjurorsareselected;atthetimeofthestudy,thecountydidnothaveaneffectiveprocessforfollowinguponprospectivejurorswhohadbeencalledbuthadfailedtoappear;andsoon.Whateverthereasons,itseemsclearthatthecompositionofthejurypanelswasdifferentfromwhatwewouldhaveexpectedinarandomsample.

AClassicalInterlude:theChi-SquaredStatistic¶

ComputationalandInferentialThinking

318DistanceBetweenDistributions

"Dothedatalooklikearandomsample?"isaquestionthathasbeenaskedinmanycontextsformanyyears.Inclassicaldataanalysis,thestatisticmostcommonlyusedinansweringthisquestioniscalledthe ("chisquared")statistic,andiscalculatedasfollows:

Step1.Foreachcategory,definethe"expectedcount"asfollows:

Thisisthecountthatyouwouldexpectinthatcategory,inarandomlychosensample.

Step2.Foreachcategory,compute

Step3.AddupallthenumberscomputedinStep2.

Alittlealgebrashowsthatthisisequivalentto:

AlternativeMethod,Step1.Foreachcategory,compute

AlternativeMethod,Step2.Addupallthenumbersyoucomputedinthestepabove,andmultiplythesumbythesamplesize.

Itmakessensethatthestatisticshouldbebasedonthedifferencebetweenproportionsineachcategory,justasthetotalvariationdistanceis.Theremainingstepsofthemethodarenotveryeasytoexplainwithoutgettingdeeperintomathematics.

Thereasonforchoosingthisstatisticoveranyotherwasthatstatisticianswereabletocomeupaformulaforitsapproximateprobabilitydistributionwhenthesamplesizeislarge.Thedistributionhasanimpressivename:itiscalledthe distributionwithdegreesoffreedomequalto4(onefewerthanthenumberofcategories).Dataanalystswouldcomputethestatisticfortheirsample,andthencompareitwithtabulatednumericalvaluesofthedistributions.

Simulatingthe statistic.

The statisticisjustastatisticlikeanyother.Wecanusethecomputertosimulateitsbehaviorevenifwehavenotstudiedalltheunderlyingmathematics.

ForthepanelsintheACLUstudy,the statisticisabout348:

ComputationalandInferentialThinking

319DistanceBetweenDistributions

sum(((jury["Panel"]-jury["Eligible"])**2)/jury["Eligible"])*sample_size

348.07422222222226

Togeneratetheempiricaldistributionofthestatistic,allweneedisasmallmodificationofthecodeweusedforsimulatingtotalvariationdistance:

#Computeempiricaldistributionofchi-squaredstatistic

classical_from_sample=Table(["'Chi-squared'statistic,fromarandomsample"

foriinnp.arange(repetitions):

sample=np.random.multinomial(sample_size,eligible)/sample_size

chisq=sum(((sample-eligible)**2)/eligible)*sample_size

classical_from_sample.append([chisq])

classical_from_sample

'Chi-squared'statistic,fromarandomsample

5.29816

1.83597

5.03969

4.3795

10.738

4.74311

4.15047

5.83918

9.6604

0.127756

...(990rowsomitted)

Hereisahistogramoftheempiricaldistributionofthe statistic.Thesimulatedvaluesofbasedonrandomsamplesareconsiderablysmallerthanthevalueof348thatwegot

fromthejurypanels.

ComputationalandInferentialThinking

320DistanceBetweenDistributions

classical_from_sample.hist(bins=np.arange(0,18,0.5),normed=True)

Infact,thelongrunaveragevalueof statisticsisequaltothedegreesoffreedom.Andindeed,theaverageofthesimulatedvaluesof iscloseto4,thedegreesoffreedominthiscase.

np.mean(classical_from_sample.column(0))

3.978631566873136

Inthissituation,randomsamplesareexpectedtoproducea statisticofabout4.Thepanels,ontheotherhand,produceda statisticof348.Thisisyetmoreconfirmationoftheconclusionthatthepanelsdonotresemblearandomsamplefromthepopulationofeligiblejurors.

ComputationalandInferentialThinking

321DistanceBetweenDistributions

HypothesisTestingInteractOurinvestigationofjurypanelsisanexampleofhypothesistesting.Wewishedtoknowtheanswertoayes-or-noquestionabouttheworld,andwereasonedaboutrandomsamplesinordertoanswerthequestion.Inthissection,weformalizetheprocess.

SamplingfromaDistribution¶Therowsofatabledescribeindividualelementsofacollectionofdata.Intheprevioussection,weworkedwithatablethatdescribedadistributionovercategories.Tablesthatdescribetheproportionsorcountsofdifferentcategoriesinasampleorpopulationariseoftenindataanalysisasasummaryofadatacollectionprocess.Thesample_from_distributionmethoddrawsasampleofcountsforthecategories.Itsimplementationusesthesamenp.random.multinomialmethodfromtheprevioussection.

ThetablebelowdescribestheproportionoftheeligiblepotentialjurorsinAlamedaCounty(estimatedfrom2000censusdataandothersources)forbroadcategoriesofraceandethnicbackground.ThistablewascompiledbyProfessorWeeks,ademographeratSanDiegoStateUniversity,fortheAlamedaCountytrialPeoplev.StuartAlexanderin2002.

population=Table(["Race","Eligible"]).with_rows([

["Asian",0.15],

["Black",0.18],

["Latino",0.12],

["White",0.54],

["Other",0.01],

])

population

Race Eligible

Asian 0.15

Black 0.18

Latino 0.12

White 0.54

Other 0.01

ComputationalandInferentialThinking

322HypothesisTesting

Thesample_from_distributionmethodtakesanumberofsamplesandacolumnindexorlabel.Itreturnsatableinwhichthedistributioncolumnhasbeenreplacedbycategorycountsofthesample.

sample_size=1453

population.sample_from_distribution("Eligible",sample_size)

Race Eligible Eligiblesample

Asian 0.15 233

Black 0.18 245

Latino 0.12 177

White 0.54 787

Other 0.01 11

Eachcountisselectedrandomlyinsuchawaythatthechanceofeachcategoryisthecorrespondingproportionintheoriginal"Eligible"column.Samplingagainwillgivedifferentcounts,butagainthe"White"categoryismuchmorecommonthan"Other"becauseofitsmuchhigherproportion.

population.sample_from_distribution("Eligible",sample_size)

Race Eligible Eligiblesample

Asian 0.15 227

Black 0.18 247

Latino 0.12 155

White 0.54 819

Other 0.01 5

Theoptionproportions=Truedrawssamplecountsandthendivideseachcountbythetotalnumberofsamples.Theresultisanotherdistribution;onebasedonasample.

sample=population.sample_from_distribution("Eligible",sample_size

sample

ComputationalandInferentialThinking

323HypothesisTesting

Race Eligible Eligiblesample

Asian 0.15 0.140399

Black 0.18 0.189264

Latino 0.12 0.125258

White 0.54 0.532691

Other 0.01 0.0123882

Thisutsamplecanalsobeusedtogenerateasample.Thesampleofasampleiscalledaresampledsampleorabootstrapsample.Itisnotarandomsampleoftheoriginaldistribution,butsharesmanycharacteristicswithsuchasample.

sample.sample_from_distribution('Eligiblesample',sample_size)

Race Eligible Eligiblesample Eligiblesamplesample

Asian 0.15 0.140399 204

Black 0.18 0.189264 258

Latino 0.12 0.125258 163

White 0.54 0.532691 810

Other 0.01 0.0123882 18

EmpiricalDistributions¶Anempiricaldistributionofastatisticisatablegeneratedbycomputingthestatisticformanysamples.Theprevioussectionusedempiricaldistributionstoreasonaboutwhetherornotasamplehadastatisticthatwastypicalofarandomsample.Thefollowingfunctioncomputestheempiricalhistogramofastatistic,whichisexpressedasafunctionthattakesasample.

defempirical_distribution(table,label,sample_size,k,f):

stats=Table(['Sample#','Statistic'])

foriinnp.arange(k):

sample=table.sample_from_distribution(label,sample_size)

statistic=f(sample)

stats.append([i,statistic])

returnstats

ComputationalandInferentialThinking

324HypothesisTesting

Thisfunctioncanbeusedtocomputeanempiricaldistributionofthetotalvariationdistancefromthepopulationdistribution.

deftvd_from_eligible(sample):

counts=sample.column('Eligiblesample')

distribution=counts/sum(counts)

returntotal_variation_distance(population.column('Eligible'),

empirical_distribution(population,'Eligible',sample_size,1000,tvd_from_eligible

Sample# Statistic

0 0.0248796

1 0.0195664

2 0.00401927

3 0.0165864

4 0.0207502

5 0.013042

6 0.0297041

7 0.0153544

8 0.00830695

9 0.014384

...(990rowsomitted)

Theempiricaldistributioncanbevisualizedusingahistogram.

empirical_distribution(population,'Eligible',sample_size,1000,tvd_from_eligible

ComputationalandInferentialThinking

325HypothesisTesting

U.S.SupremeCourt,1965:Swainvs.Alabama¶Intheearly1960's,inTalladegaCountyinAlabama,ablackmancalledRobertSwainwasconvictedofrapingawhitewomanandwassentencedtodeath.Heappealedhissentence,citingamongotherfactorstheall-whitejury.Atthetime,onlymenaged21orolderwereallowedtoserveonjuriesinTalladegaCounty.Inthecounty,26%oftheeligiblejurorswereblack,buttherewereonly8blackmenamongthe100selectedforthejurypanelinSwain'strial.Noblackmanwasselectedforthetrialjury.

In1965,theSupremeCourtoftheUnitedStatesdeniedSwain'sappeal.Initsruling,theCourtwrote"...theoverallpercentagedisparityhasbeensmallandreflectsnostudiedattempttoincludeorexcludeaspecifiednumberof[blackmen]."

Letususethemethodswehavedevelopedtoexaminethedisparitybetween8outof100and26outof100blackmeninapanelof100drawnatrandomfromamongtheeligiblejurors.

alabama=Table(['Race','Eligible']).with_rows([

["Black",0.26],

["Other",0.74]

])

alabama.set_format(1,PercentFormatter(0))

Race Eligible

Black 26%

Other 74%

ComputationalandInferentialThinking

326HypothesisTesting

Asourteststatistic,wewillusethenumberofblackmeninarandomsampleofsize100.Hereisanempiricaldistributionofthisstatistic.

defvalue_for(category,category_label,value_label):

defin_table(sample):

returnsample.where(category_label,category).column(value_label

returnin_table

num_black=value_for('Black','Race','Eligiblesample')

black_in_sample=empirical_distribution(alabama,'Eligible',100,

black_in_sample

Sample# Statistic

0 27

1 31

2 28

3 27

4 24

5 23

6 29

7 28

8 16

9 25

...(9990rowsomitted)

Thenumbersofblackmeninthefirst10repetitionswerequiteabitlargerthan8.Theempiricalhistogrambelowshowsthedistributionoftheteststatisticoveralltherepetitions.

black_in_sample.hist(1,bins=np.arange(0,50,1))

ComputationalandInferentialThinking

327HypothesisTesting

Ifthe100meninSwain'sjurypanelhadbeenchosenatrandom,itwouldhavebeenextremelyunlikelyforthenumberofblackmenonthepaneltobeassmallas8.Wemustconcludethatthepercentagedisparitywaslargerthanthedisparityexpectedduetochancevariation.

MethodandTerminologyofStatisticalTestsofHypotheses¶Wehavedevelopedsomeofthefundamentalconceptsofstatisticaltestsofhypotheses,inthecontextofexamplesaboutjuryselection.Usingstatisticaltestsasawayofmakingdecisionsisstandardinmanyfieldsandhasastandardterminology.Hereisthesequenceofthestepsinmoststatisticaltests,alongwithsometerminologyandexamples.

STEP1:THEHYPOTHESES

Allstatisticaltestsattempttochoosebetweentwoviewsofhowthedataweregenerated.Thesetwoviewsarecalledhypotheses.

Thenullhypothesis.Thissaysthatthedataweregeneratedatrandomunderclearlyspecifiedassumptionsthatmakeitpossibletocomputechances.Theword"null"reinforcestheideathatifthedatalookdifferentfromwhatthenullhypothesispredicts,thedifferenceisduetonothingbutchance.

Inbothofourexamplesaboutjuryselection,thenullhypothesisisthatthepanelswereselectedatrandomfromthepopulationofeligiblejurors.Thoughtheracialcompositionofthepanelswasdifferentfromthatofthepopulationsofeligiblejurors,therewasnoreasonforthedifferenceotherthanchancevariation.

ComputationalandInferentialThinking

328HypothesisTesting

Thealternativehypothesis.Thissaysthatsomereasonotherthanchancemadethedatadifferfromwhatwaspredictedbythenullhypothesis.Informally,thealternativehypothesissaysthattheobserveddifferenceis"real."

Inbothofourexamplesaboutjuryselection,thealternativehypothesisisthatthepanelswerenotselectedatrandom.Somethingotherthanchanceledtothedifferencesbetweentheracialcompositionofthepanelsandtheracialcompositionofthepopulationsofeligiblejurors.

STEP2:THETESTSTATISTIC

Inordertodecidebetweenthetwohypothesis,wemustchooseastatisticuponwhichwewillbaseourdecision.Thisiscalledtheteststatistic.

IntheexampleaboutjurypanelsinAlamedaCounty,theteststatisticweusedwasthetotalvariationdistancebetweentheracialdistributionsinthepanelsandinthepopulationofeligiblejurors.IntheexampleaboutSwainversustheStateofAlabama,theteststatisticwasthenumberofblackmenonthejurypanel.

Calculatingtheobservedvalueoftheteststatisticisoftenthefirstcomputationalstepinastatisticaltest.InthecaseofjurypanelsinAlamedaCounty,theobservedvalueofthetotalvariationdistancebetweenthedistributionsinthepanelsandthepopulationwas0.14.IntheexampleaboutSwain,thenumberofblackmenonhisjurypanelwas8.

STEP3:THEPROBABILITYDISTRIBUTIONOFTHETESTSTATISTIC,UNDERTHENULLHYPOTHESIS

Thisstepsetsasidetheobservedvalueoftheteststatistic,andinsteadfocusesonwhatthevaluemightbeifthenullhypothesisweretrue.Underthenullhypothesis,thesamplecouldhavecomeoutdifferentlyduetochance.Sotheteststatisticcouldhavecomeoutdifferently.Thisstepconsistsoffiguringoutallpossiblevaluesoftheteststatisticandalltheirprobabilities,underthenullhypothesisofrandomness.

Inotherwords,inthisstepwecalculatetheprobabilitydistributionoftheteststatisticpretendingthatthenullhypothesisistrue.Formanyteststatistics,thiscanbeadauntingtaskbothmathematicallyandcomputationally.Therefore,weapproximatetheprobabilitydistributionoftheteststatisticbytheempiricaldistributionofthestatisticbasedonalargenumberofrepetitionsofthesamplingprocedure.

Thiswastheempiricaldistributionoftheteststatistic–thenumberofblackmenonthejurypanel–intheexampleaboutSwainandtheSupremeCourt:

black_in_sample.hist(1,bins=np.arange(0,50,1))

ComputationalandInferentialThinking

329HypothesisTesting

STEP4.THECONCLUSIONOFTHETEST

ThechoicebetweenthenullandalternativehypothesesdependsonthecomparisonbetweentheresultsofSteps2and3:theobservedteststatisticanditsdistributionaspredictedbythenullhypothesis.

Ifthetwoareconsistentwitheachother,thentheobservedteststatisticisinlinewithwhatthenullhypothesispredicts.Inotherwords,thetestdoesnotpointtowardsthealternativehypothesis;thenullhypothesisisbettersupportedbythedata.

Butifthetwoarenotconsistentwitheachother,asisthecaseinbothofourexamplesaboutjurypanels,thenthedatadonotsupportthenullhypothesis.Inbothourexamplesweconcludedthatthejurypanelswerenotselectedatrandom.Somethingotherthanchanceaffectedtheircomposition.

Howis"consistent"defined?Whethertheobservedteststatisticisconsistentwithitspredicteddistributionunderthenullhypothesisisamatterofjudgment.Werecommendthatyouprovideyourjudgmentalongwiththevalueoftheteststatisticandagraphofitspredicteddistribution.Thatwillallowyourreadertomakehisorherownjudgmentaboutwhetherthetwoareconsistent.

Ifyoudonotwanttomakeyourownjudgment,thereareconventionsthatyoucanfollow.TheseconventionsarebasedonwhatiscalledtheobservedsignificancelevelorP-valueforshort.TheP-valueisachancecomputedusingtheprobabilitydistributionoftheteststatistic,andcanbeapproximatedbyusingtheempiricaldistributioninStep3.

PracticalnoteonP-valuesandconventions.Placetheobservedteststatisticonthehorizontalaxisofthehistogram,andfindtheproportioninthetailstartingatthatpoint.That'stheP-value.

IftheP-valueissmall,thedatasupportthealternativehypothesis.Theconventionsforwhatis"small":

ComputationalandInferentialThinking

330HypothesisTesting

IftheP-valueislessthan5%,theresultiscalled"statisticallysignificant."

IftheP-valueisevensmaller–lessthan1%–theresultiscalled"highlystatisticallysignificant."

MoreformaldefinitionofP-value.TheP-valueisthechance,underthenullhypothesis,thattheteststatisticisequaltothevaluethatwasobservedorisevenfurtherinthedirectionofthealternative.

TheP-valueisbasedoncomparingtheobservedteststatisticandwhatthenullhypothesispredicts.AsmallP-valueimpliesthatunderthenullhypothesis,thevalueoftheteststatisticisunlikelytobeneartheoneweobserved.Whenahypothesisandthedataarenotinaccord,thehypothesishastogo.ThatiswhywerejectthenullhypothesisiftheP-valueissmall.

TheP-valueforthenullhypothesisthatSwain'sjurypanelwasselectedatrandomfromthepopulationofTalladegaisapproximatedusingtheempiricaldistributionasfollows:

np.count_nonzero(black_in_sample.column(1)<=8)/black_in_sample.

0.0

HISTORICALNOTEONTHECONVENTIONS

Thedeterminationofstatisticalsignificance,asdefinedabove,hasbecomestandardinstatisticalanalysesinallfieldsofapplication.Whenaconventionissouniversallyfollowed,itisinterestingtoexaminehowitarose.

Themethodofstatisticaltesting–choosingbetweenhypothesesbasedondatainrandomsamples–wasdevelopedbySirRonaldFisherintheearly20thcentury.SirRonaldmighthavesettheconventionforstatisticalsignificancesomewhatunwittingly,inthefollowingstatementinhis1925bookStatisticalMethodsforResearchWorkers.Aboutthe5%level,hewrote,"Itisconvenienttotakethispointasalimitinjudgingwhetheradeviationistobeconsideredsignificantornot."

Whatwas"convenient"forSirRonaldbecameacutoffthathasacquiredthestatusofauniversalconstant.NomatterthatSirRonaldhimselfmadethepointthatthevaluewashispersonalchoicefromamongmany:inanarticlein1926,hewrote,"Ifoneintwentydoesnotseemhighenoughodds,wemay,ifwepreferitdrawthelineatoneinfifty(the2percentpoint),oroneinahundred(the1percentpoint).Personally,theauthorpreferstosetalowstandardofsignificanceatthe5percentpoint..."

ComputationalandInferentialThinking

331HypothesisTesting

Fisherknewthat"low"isamatterofjudgmentandhasnouniquedefinition.Wesuggestthatyoufollowhisexcellentexample.Provideyourdata,makeyourjudgment,andexplainwhyyoumadeit.

ComputationalandInferentialThinking

332HypothesisTesting

PermutationInteract

ComparingTwoGroups¶Intheexamplesabove,weinvestigatedwhetherasampleappearstobechosenrandomlyfromanunderlyingpopulation.Wedidthisbycomparingthedistributionofthesamplewiththedistributionofthepopulation.Asimilarlineofreasoningcanbeusedtocomparethedistributionsoftwosamples.Inparticular,wecaninvestigatewhetherornottwosamplesappeartobedrawnfromthesameunderlyingdistribution.

Example:MarriedCouplesandUnmarriedPartners¶

Ournextexampleisbasedonastudyconductedin2010undertheauspicesoftheNationalCenterforFamilyandMarriageResearch.

IntheUnitedStates,theproportionofcoupleswholivetogetherbutarenotmarriedhasbeenrisinginrecentdecades.Thestudyinvolvedanationalrandomsampleofover1,000heterosexualcoupleswhowereeithermarriedor"cohabitingpartners"–livingtogetherbutunmarried.Oneofthegoalsofthestudywastocomparetheattitudesandexperiencesofthemarriedandunmarriedcouples.

Thetablebelowshowsasubsetofthedatacollectedinthestudy.Eachrowcorrespondstooneperson.Thevariablesthatwewillexamineinthissectionare:

MaritalStatus:marriedorunmarriedEmploymentStatus:oneofseveralcategoriesdescribedbelowGenderAge:Ageinyears

columns=['MaritalStatus','EmploymentStatus','Gender','Age',

couples=Table.read_table('couples.csv').select(columns)

couples

ComputationalandInferentialThinking

333Permutation

MaritalStatus EmploymentStatus Gender Age

married workingaspaidemployee male 51

married workingaspaidemployee female 53

married workingaspaidemployee male 57

married workingaspaidemployee female 57

married workingaspaidemployee male 60

married workingaspaidemployee female 57

married working,self-employed male 62

married workingaspaidemployee female 59

married notworking-other male 53

married notworking-retired female 61

...(2056rowsomitted)

Letusconsiderjustthemalesfirst.Thereare742marriedcouplesand292unmarriedcouples,andallcouplesinthisstudyhadonemaleandonefemale,making1,034malesinall.

#Separatetablesformarriedandcohabitingunmarriedcouples:

married_men=couples.where('Gender','male').where('MaritalStatus'

partnered_men=couples.where('Gender','male').where('MaritalStatus'

#Let'sseehowmanymarriedandunmarriedpeoplethereare:

couples.where('Gender','male').group('MaritalStatus').barh("MaritalStatus"

ComputationalandInferentialThinking

334Permutation

Societalnormshavechangedoverthedecades,andtherehasbeenagradualacceptanceofcoupleslivingtogetherwithoutbeingmarried.Thusitisnaturaltoexpectthatunmarriedcoupleswillingeneralconsistofyoungerpeoplethanmarriedcouples.

Thehistogramsoftheagesofthemarriedandunmarriedmenshowthatthisisindeedthecase.Wewilldrawthesehistogramsandcomparethem.Inordertocomparetwohistograms,bothshouldbedrawntothesamescale.Letuswriteafunctionthatdoesthisforus.

defplot_age(table,subject):

"""

DrawsahistogramoftheAgecolumninthegiventable.

tableshouldbeaTablewithacolumnofpeople'sagescalledAge.

subjectshouldbeastring--thenameofthegroupwe'redisplaying,

like"marriedmen".

"""

#Drawahistogramofagesrunningfrom15yearsto70years

table.hist('Age',bins=np.arange(15,71,5),unit='year')

#Setthelowerandupperboundsoftheverticalaxissothat

#theplotswemakeareallcomparable.

plt.ylim(0,0.045)

plt.title("Agesof"+subject)

ComputationalandInferentialThinking

335Permutation

#Agesofmen:

plot_age(married_men,"marriedmen")

plot_age(partnered_men,"cohabitingunmarriedmen")

Thedifferenceisevenmoremarkedwhenwecomparethemarriedandunmarriedwomen.

married_women=couples.where('Gender','female').where('MaritalStatus'

partnered_women=couples.where('Gender','female').where('MaritalStatus'

#Agesofwomen:

plot_age(married_women,"marriedwomen")

plot_age(partnered_women,"cohabitingunmarriedwomen")

ComputationalandInferentialThinking

336Permutation

Thehistogramsshowthatthemarriedmeninthesampleareingeneralolderthanunmarriedcohabitingmen.Marriedwomenareingeneralolderthanunmarriedwomen.Theseobservationsareconsistentwithwhatwehadpredictedbasedonchangingsocialnorms.

Ifmarriedcouplesareingeneralolder,theymightdifferfromunmarriedcouplesinotherwaysaswell.Letuscomparetheemploymentstatusofthemarriedandunmarriedmeninthesample.

Thetablebelowshowsthemaritalstatusandemploymentstatusofeachmaninthesample.

males=couples.where('Gender','male').select(['MaritalStatus','EmploymentStatus'

males.show(5)

ComputationalandInferentialThinking

337Permutation

MaritalStatus EmploymentStatus

married workingaspaidemployee

married workingaspaidemployee

married workingaspaidemployee

married working,self-employed

married notworking-other

...(1028rowsomitted)

ContingencyTables¶

Toinvestigatetheassociationbetweenemploymentandmarriage,wewouldliketobeabletoaskquestionslike,"Howmanymarriedmenareretired?"

Recallthatthemethodpivotletsusdoexactlythat.Itcross-classifieseachmanaccordingtothetwovariables–maritalstatusandemploymentstatus.Itsoutputisacontingencytablethatcontainsthecountsineachpairofcategories.

employment=males.pivot('MaritalStatus','EmploymentStatus')

employment

EmploymentStatus married partner

notworking-disabled 44 20

notworking-lookingforwork 28 33

notworking-onatemporarylayofffromajob 15 8

notworking-other 16 9

notworking-retired 44 4

workingaspaidemployee 513 170

working,self-employed 82 47

Theargumentsofpivotarethelabelsofthetwocolumnscorrespondingtothevariableswearestudying.Categoriesofthefirstargumentappearascolumns;categoriesofthesecondargumentaretherows.Eachcellofthetablecontainsthenumberofmeninapairofcategories–aparticularemploymentstatusandaparticularmaritalstatus.

Thetableshowsthatregardlessofmaritalstatus,themeninthesamplearemostlikelytobeworkingaspaidemployees.Butitisquitehardtocomparetheentiredistributionsbasedonthistable,becausethenumbersofmarriedandunmarriedmeninthesamplearenotthe

ComputationalandInferentialThinking

338Permutation

same.Thereare742marriedmenbutonly291unmarriedones.

employment.drop(0).sum()

married partner

742 291

Toadjustforthisdifferenceintotalnumbers,wewillconvertthecountsintoproportions,bydividingallthemarriedcountsby742andallthepartnercountsby291.

proportions=Table().with_columns([

"EmploymentStatus",cc.column("EmploymentStatus"),

"married",employment.column('married')/sum(employment.column

"partner",employment.column('partner')/sum(employment.column

])

proportions.show()

EmploymentStatus married partner

notworking-disabled 0.0592992 0.0687285

notworking-lookingforwork 0.0377358 0.113402

notworking-onatemporarylayofffromajob 0.0202156 0.0274914

notworking-other 0.0215633 0.0309278

notworking-retired 0.0592992 0.0137457

workingaspaidemployee 0.691375 0.584192

working,self-employed 0.110512 0.161512

Themarriedcolumnofthistableshowsthedistributionofemploymentstatusofthemarriedmeninthesample.Forexample,amongmarriedmen,theproportionwhoareretiredisabout0.059.Thepartnercolumnshowsthedistributionoftheemploymentstatusoftheunmarriedmeninthesample.Amongunmarriedmen,theproportionwhoareretiredisabout0.014.

Thetwodistributionslookdifferentfromeachotherinotherwaystoo,ascanbeseenmoreclearlyinthebargraphsbelow.Itappearsthatalargerproportionofthemarriedmeninthesampleworkaspaidemployees,whereasalargerproportionoftheunmarriedmenarenotworkingbutarelookingforwork.

ComputationalandInferentialThinking

339Permutation

proportions.barh('EmploymentStatus')

Thedistributionsofemploymentstatusofthemeninthetwogroups–marriedandunmarried–isclearlydifferentinthesample.

Arethetwodistributionsdifferentinthepopulation?¶

Thisraisesthequestionofwhetherthedifferenceisduetorandomnessinthesampling,orwhetherthedistributionsofemploymentstatusareindeeddifferentformarriedandumarriedcohabitingmenintheU.S.Rememberthatthedatathatwehavearefromasampleofjust1,033couples;wedonotknowthedistributionofemploymentstatusofmarriedorunmarriedcohabitingmenintheentirecountry.

Wecananswerthequestionbyperformingastatisticaltestofhypotheses.Letususetheterminologythatwedevelopedforthisintheprevioussection.

Nullhypothesis.IntheUnitedStates,thedistributionofemploymentstatusamongmarriedmenisthesameasamongunmarriedmenwholivewiththeirpartners.

Anotherwayofsayingthisisthatemploymentstatusandmaritalstatusareindependentornotassociated.

Ifthenullhypothesisweretrue,thenthedifferencethatwehaveobservedinthesamplewouldbejustduetochance.

Alternativehypothesis.IntheUnitedStates,thedistributionsoftheemploymentstatusofthetwogroupsofmenaredifferent.Inotherwords,employmentstatusandmaritalstatusareassociatedinsomeway.

Asourteststatistic,wewillusethetotalvariationdistancebetweentwodistributions.

Theobservedvalueoftheteststatisticisabout0.15:

ComputationalandInferentialThinking

340Permutation

#TVDbetweenthetwodistributionsinthesample

married=proportions.column('married')

partner=proportions.column('partner')

observed_tvd=0.5*sum(abs(married-partner))

observed_tvd

0.15273571011754242

RandomPermutations¶

Inordertocomparethisobservedvalueofthetotalvariationdistancewithwhatispredictedbythenullhypothesis,weneedtoknowhowthetotalvariationdistancewouldvaryacrossallpossiblerandomsamplesifemploymentstatusandmaritalstatuswerenotrelated.

Thisisquitedauntingtoderivebymathematics,butletusseeifwecangetagoodapproximationbysimulation.

Withjustonesampleathand,andnofurtherknowledgeofthedistributionofemploymentstatusamongmenintheUnitedStates,howcanwegoaboutreplicatingthesamplingprocedure?Thekeyistonotethatifmaritalstatusandemploymentstatuswerenotconnectedinanyway,thenwecouldreplicatethesamplingprocessbyreplacingeachman'semploymentstatusbyarandomlypickedemploymentstatusfromamongallthemen,marriedandunmarried.

Doingthisforallthemenisequivalenttorandomlyrearrangingtheentirecolumncontainingemploymentstatus,whileleavingthemaritalstatuscolumnunchanged.Sucharearrangementiscalledarandompermutation.

Thus,underthenullhypothesis,wecanreplicatethesamplingprocessbyassigningtoeachmananemploymentstatuschosenatrandomwithoutreplacementfromtheentriesinthecolumnEmploymentStatus.WecandothereplicationbysimplypermutingtheentireEmploymentStatuscolumnandleavingeverythingelseunchanged.

Let'simplementthisplan.First,wewillshufflethecolumnempl_statususingthesamplemethod,whichjustshufflesalltherowswhenprovidedwithnoarguments.

#Randomlypermutetheemploymentstatusofallmen

shuffled=males.select('EmploymentStatus').sample()

shuffled

ComputationalandInferentialThinking

341Permutation

EmploymentStatus

working,self-employed

working,self-employed

workingaspaidemployee

workingaspaidemployee

working,self-employed

workingaspaidemployee

notworking-disabled

workingaspaidemployee

workingaspaidemployee

workingaspaidemployee

...(1023rowsomitted)

Thefirsttwocolumnsofthetablebelowaretakenfromtheoriginalsample.ThethirdhasbeencreatedbyrandomlypermutingtheoriginalEmploymentStatuscolumn.

#Constructatableinwhichemploymentstatushasbeenshuffled

males_with_shuffled_empl=Table().with_columns([

"MaritalStatus",males.column('MaritalStatus'),

"EmploymentStatus",males.column('EmploymentStatus'),

"EmploymentStatus(shuffled)",shuffled.column('EmploymentStatus'

])

males_with_shuffled_empl

ComputationalandInferentialThinking

342Permutation

MaritalStatus EmploymentStatus EmploymentStatus

(shuffled)

married workingaspaidemployee working,self-employed

married workingaspaidemployee working,self-employed

married workingaspaidemployee workingaspaidemployee

married working,self-employed workingaspaidemployee

married notworking-other working,self-employed

married notworking-onatemporarylayofffromajob workingaspaidemployee

married notworking-disabled notworking-disabled

married workingaspaidemployee workingaspaidemployee

married workingaspaidemployee workingaspaidemployee

married notworking-retired workingaspaidemployee

...(1023rowsomitted)

Onceagain,thepivotmethodcomputesthecontingencytable,whichallowsustocalculatethetotalvariationdistancebetweenthedistributionsofthetwogroupsofmenaftertheiremploymentstatushasbeenshuffled.

employment_shuffled=males_with_shuffled_empl.pivot('MaritalStatus'

employment_shuffled

EmploymentStatus(shuffled) married partner

notworking-disabled 48 16

notworking-lookingforwork 44 17

notworking-onatemporarylayofffromajob 16 7

notworking-other 18 7

notworking-retired 39 9

workingaspaidemployee 489 194

working,self-employed 88 41

ComputationalandInferentialThinking

343Permutation

#TVDbetweenthetwodistributionsinthecontingencytableabove

e_s=employment_shuffled

married=e_s.column('married')/sum(e_s.column('married'))

partner=e_s.column('partner')/sum(e_s.column('partner'))

0.5*sum(abs(married-partner))

0.032423745611841297

Thistotalvariationdistancewascomputedbasedonthenullhypothesisthatthedistributionsofemploymentstatusforthetwogroupsofmenarethesame.Youcanseethatitisnoticeablysmallerthantheobservedvalueofthetotalvariationdistance(0.15)betweenthetwogroupsinouroriginalsample.

APermutationTest¶

Couldthisjustbeduetochancevariation?Wewillonlyknowifwerunmanymorereplications,byrandomlypermutingtheEmploymentStatuscolumnrepeatedly.Thismethodoftestingisknownasapermutationtest.

ComputationalandInferentialThinking

344Permutation

#Putitalltogetherinaforlooptoperformapermutationtest

repetitions=500

tvds=Table().with_column("TVDbetweenmarriedandpartneredmen",

foriinnp.arange(repetitions):

#Constructapermutedtable

shuffled=males.select('EmploymentStatus').sample()

combined=Table().with_columns([

"MaritalStatus",males.column('MaritalStatus'),

"EmploymentStatus",shuffled.column('EmploymentStatus'

])

employment_shuffled=combined.pivot('MaritalStatus','EmploymentStatus'

#ComputeTVD

e_s=employment_shuffled

married=e_s.column('married')/sum(e_s.column('married'))

partner=e_s.column('partner')/sum(e_s.column('partner'))

permutation_tvd=0.5*sum(abs(married-partner))

tvds.append([permutation_tvd])

tvds.hist(bins=np.arange(0,0.2,0.01))

Thefigureaboveistheempiricaldistributionofthetotalvariationdistancebetweenthedistributionsoftheemploymentstatusofmarriedandunmarriedmen,underthenullhypothesis.

ComputationalandInferentialThinking

345Permutation

Theobservedteststatisticof0.15isquitefarinthetail,andsothechanceofobservingsuchanextremevalueunderthenullhypothesisiscloseto0.

Asbefore,thischanceiscalledanempiricalP-value.TheP-valueisthechancethatourteststatistic(TVD)wouldcomeoutatleastasextremeastheobservedvalue(inthiscase0.15orgreater)underthenullhypothesis.

Conclusionofthetest¶

Ourempiricalestimatebasedonrepeatedsamplinggivesusalltheinformationweneedfordrawingconclusionsfromthedata:theobservedstatisticisveryunlikelyunderthenullhypothesis.

ThelowP-valueconstitutesevidenceinfavorofthealternativehypothesis.ThedatasupportthehypothesisthatintheUnitedStates,thedistributionoftheemploymentstatusofmarriedmenisnotthesameasthatofunmarriedmenwholivewiththeirpartners.

Wehavejustcompletedourfirstpermutationtest.Permutationtestsarequitecommoninpracticebecausetheymakeveryfewassumptionsabouttheunderlyingpopulationandarestraightforwardtoperformandinterpret.

NoteabouttheapproximateP-value

OursimulationgivesusanapproximateempiricalP-value,becauseitisbasedonjust500randompermutationsinsteadofallthepossiblerandompermutations.WecancomputethisempiricalP-valuedirectly,withoutdrawingthehistogram:

empirical_p_value=np.count_nonzero(tvds.column(0)>=observed_tvd

empirical_p_value

0.0

ComputingtheexactP-valuewouldrequireustoconsiderallpossibleoutcomesofshuffling(whichisverylarge)insteadof500randomshuffles.Ifwehadperformedalltherandomshuffles,therewouldhavebeenafewwithmoreextremeTVDs.ThetrueP-valueisgreaterthanzero,butnotbymuch.

GeneralizingOurHypothesisTest¶

ComputationalandInferentialThinking

346Permutation

Theexampleaboveincludesasubstantialamountofcodeinordertoinvestigatetherelationshipbetweentwocharacteristics(maritalstatusandemploymentstatus)foraparticularsubsetofthesurveyedpopulation(males).Supposewewouldliketoinvestigatedifferentcharacteristicsoradifferentpopulation.Howcanwereusethecodewehavewrittensofarinordertoexploremorerelationships?

Whenyouareabouttocopyyourcode,youshouldthink,"MaybeIshouldwritesomefunctions."

Whatfunctionstowrite?Agoodwaytomakethisdecisionistothinkaboutwhatyouhavetocomputerepeatedly.

Inourexample,thetotalvariationdistanceiscomputedoverandoveragain.Sowewillbeginwithageneralizedcomputationoftotalvariationdistancebetweenthedistributionofanycolumnofvalues(suchasemploymentstatus)whenseparatedintoanytwoconditions(suchasmaritalstatus)foracollectionofdatadescribedbyanytable.Ourimplementationincludesthesamestatementsasweusedabove,butusesgenericnamesthatarespecifiedbythefinalfunctioncall.

#TVDbetweenthedistributionsofvaluesunderanytwoconditions

deftvd(t,conditions,values):

"""Computethetotalvariationdistance

betweenproportionsofvaluesundertwoconditions.

t(Table)--atable

conditions(str)--acolumnlabelint;shouldhaveonlytwocategories

values(str)--acolumnlabelint

"""

e=t.pivot(conditions,values)

a=e.column(1)

b=e.column(2)

return0.5*sum(abs(a/sum(a)-b/sum(b)))

tvd(males,'MaritalStatus','EmploymentStatus')

0.15273571011754242

ComputationalandInferentialThinking

347Permutation

Next,wecanwriteafunctionthatperformsapermutationtestusingthistvdfunctiontocomputethesamestatisticonshuffledvariantsofanytable.It'sworthreadingthroughthisimplementationtounderstanditsdetails.

defpermutation_tvd(original,conditions,values):

"""

Performapermutationtestofwhether

thedistributionofvaluesfortwoconditions

isthesameinthepopulation,

usingthetotalvariationdistancebetweentwodistributions

astheteststatistic.

originalisaTablewithtwocolumns.Thevalueoftheargument

conditionsisthenameofonecolumn,andthevalueoftheargument

valuesisthenameoftheothercolumn.Theconditionstableshould

haveonly2possiblevaluescorrespondingto2categoriesinthe

data.

Thevaluescolumnisshuffledmanytimes,andthedataaregrouped

accordingtotheconditionscolumn.Thetotalvariationdistance

betweentheproportionsvaluesinthe2categoriesiscomputed.

ThenwedrawahistogramofallthoseTVdistances.Thisshowsus

whattheTVDbetweenthevaluesofthetwodistributionswouldtypically

looklikeifthevalueswereindependentoftheconditions.

"""

repetitions=500

stats=[]

foriinnp.arange(repetitions):

shuffled=original.sample()

combined=Table().with_columns([

conditions,original.column(conditions),

values,shuffled.column(values)

])

stats.append(tvd(combined,conditions,values))

observation=tvd(original,conditions,values)

p_value=np.count_nonzero(stats>=observation)/repetitions

ComputationalandInferentialThinking

348Permutation

print("Observation:",observation)

print("EmpiricalP-value:",p_value)

Table([stats],['Empiricaldistribution']).hist()

permutation_tvd(males,'MaritalStatus','EmploymentStatus')

Observation:0.152735710118

EmpiricalP-value:0.0

Nowthatwehavegeneralizedourpermutationtest,wecanapplyittootherhypotheses.Forexample,wecancomparethedistributionovertheemploymentstatusofwomen,groupingthembytheirmaritalstatus.Inthecaseofmenwefoundadifference,butwhataboutwithwomen?First,wecanvisualizethetwodistributions.

defcompare_bar(t,conditions,values):

"""Bargraphsofdistributionsofvaluesforeachoftwoconditions."""

e=t.pivot(conditions,values)

forlabeline.drop(0).labels:

#Converteachcolumnofcountsintoproportions

e.append_column(label,e.column(label)/sum(e.column(label)))

e.barh(values)

compare_bar(couples.where('Gender','female'),'MaritalStatus','EmploymentStatus'

ComputationalandInferentialThinking

349Permutation

Aglanceatthefigureshowsthatthetwodistributionsaredifferentinthesample.Thedifferenceinthecategory"Notworking–other"isparticularlystriking:about22%ofthemarriedwomenareinthiscategory,comparedtoonlyabout8%oftheunmarriedwomen.Thereareseveralreasonsforthisdifference.Forexample,thepercentofhomemakersisgreateramongmarriedwomenthanamongunmarriedwomen,possiblybecausemarriedwomenaremorelikelytobe"stay-at-home"mothersofyoungchildren.Thedifferencecouldalsobegenerational:aswesawearlier,themarriedcouplesareolderthantheunmarriedpartners,andolderwomenarelesslikelytobeintheworkforcethanyoungerwomen.

Whilewecanseethatthedistributionsaredifferentinthesample,wearenotreallyinterestedinthesampleforitsownsake.Weareexaminingthesamplebecauseitislikelytoreflectthepopulation.So,asbefore,wewillusethesampletotrytoansweraquestionaboutsomethingunknown:thedistributionsofemploymentstatusofmarriedandunmarriedcohabitingwomenintheUnitedStates.Thatisthepopulationfromwhichthesamplewasdrawn.

Wehavetoconsiderthepossibilitythattheobserveddifferenceinthesamplecouldsimplybetheresultofchancevariation.Rememberthatourdataareonlyfromarandomsampleofcouples.WedonothavedataforallthecouplesintheUnitedStates.

Nullhypothesis:IntheU.S.,thedistributionofemploymentstatusisthesameformarriedwomenasforunmarriedwomenlivingwiththeirpartners.Thedifferenceinthesampleisduetochance.

Alternativehypothesis:IntheU.S.,thedistributionsofemploymentstatusamongmarriedandunmarriedcohabitingwomenaredifferent.

Anotherpermutationtesttocomparedistributions¶

Wecantestthesehypothesesjustaswedidformen,byusingthefunctionpermuation_tvdthatwedefinedforthispurpose.

permutation_tvd(couples.where('Gender','female'),'MaritalStatus'

ComputationalandInferentialThinking

350Permutation

Observation:0.194755513565

EmpiricalP-value:0.0

Asforthemales,theempiricalP-valueis0basedonalarenumberofrepetitions.SotheexactP-valueiscloseto0,whichisevidenceinfavorofthealternativehypothesis.ThedatasupportthehypothesisthatforwomenintheUnitedStates,employmentstatusisassociatedwithwhethertheyaremarriedorunmarriedandlivingwiththeirpartners.

Anotherexample¶

Aregenderandemploymentstatusindependentinthepopulation?Wearenowinapositiontotestthisquiteswiftly:

Nullhypothesis.AmongmarriedandunmarriedcohabitingindividualsintheUnitedStates,genderisindependentofemploymentstatus.

Alternativehypothesis.AmongmarriedandunmarriedcohabitingpeopleintheUnitedStates,genderandemploymentstatusarerelated.

permutation_tvd(couples,'Gender','EmploymentStatus')

Observation:0.185866408519

EmpiricalP-value:0.0

ComputationalandInferentialThinking

351Permutation

Theconclusionofthetestisthatgenderandemploymentstatusarenotindependentinthepopulation.Thisisnosurprise;forexample,becauseofsocietalnorms,olderwomenwerelesslikelytohavegoneintotheworkforcethanmen.

Deflategate:PermutationTestsandQuantitativeVariables¶

OnJanuary18,2015,theIndianapolisColtsandtheNewEnglandPatriotsplayedtheAmericanFootballConference(AFC)championshipgametodeterminewhichofthoseteamswouldplayintheSuperBowl.Afterthegame,therewereallegationsthatthePatriots'footballshadnotbeeninflatedasmuchastheregulationsrequired;theyweresofter.Thiscouldbeanadvantage,assofterballsmightbeeasiertocatch.

Forseveralweeks,theworldofAmericanfootballwasconsumedbyaccusations,denials,theories,andsuspicions:thepresslabeledthetopicDeflategate,aftertheWatergatepoliticalscandalofthe1970's.TheNationalFootballLeague(NFL)commissionedanindependentanalysis.Inthisexample,wewillperformourownanalysisofthedata.

Pressureisoftenmeasuredinpoundspersquareinch(psi).NFLrulesstipulatethatgameballsmustbeinflatedtohavepressuresintherange12.5psiand13.5psi.Eachteamplayswith12balls.Teamshavetheresponsibilityofmaintainingthepressureintheirownfootballs,butgameofficialsinspecttheballs.BeforethestartoftheAFCgame,allthePatriots'ballswereatabout12.5psi.MostoftheColts'ballswereatabout13.0psi.However,thesepre-gamedatawerenotrecorded.

Duringthesecondquarter,theColtsinterceptedaPatriotsball.Onthesidelines,theymeasuredthepressureoftheballanddeterminedthatitwasbelowthe12.5psithreshold.Promptly,theyinformedofficials.

ComputationalandInferentialThinking

352Permutation

Athalf-time,allthegameballswerecollectedforinspection.Twoofficials,CleteBlakemanandDyrolPrioleau,measuredthepressureineachoftheballs.Herearethedata;pressureismeasuredinpsi.ThePatriotsballthathadbeeninterceptedbytheColtswasnotinspectedathalf-time.NorweremostoftheColts'balls–theofficialssimplyranoutoftimeandhadtorelinquishtheballsforthestartofsecondhalfplay.

football=Table.read_table('football.csv')

football.show()

Team Ball Blakeman Prioleau

0 Patriots1 11.5 11.8

0 Patriots2 10.85 11.2

0 Patriots3 11.15 11.5

0 Patriots4 10.7 11

0 Patriots5 11.1 11.45

0 Patriots6 11.6 11.95

0 Patriots7 11.85 12.3

0 Patriots8 11.1 11.55

0 Patriots9 10.95 11.35

0 Patriots10 10.5 10.9

0 Patriots11 10.9 11.35

1 Colts1 12.7 12.35

1 Colts2 12.75 12.3

1 Colts3 12.5 12.95

1 Colts4 12.55 12.15

Foreachofthe15ballsthatwereinspected,thetwoofficialsgotdifferentresults.Itisnotuncommonthatrepeatedmeasurementsonthesameobjectyielddifferentresults,especiallywhenthemeasurementsareperformedbydifferentpeople.Sowewillassigntoeachtheballtheaverageofthetwomeasurementsmadeonthatball.

ComputationalandInferentialThinking

353Permutation

football=football.with_column(

'Combined',(football.column('Blakeman')+football.column('Prioleau'

)

football.show()

Team Ball Blakeman Prioleau Combined

0 Patriots1 11.5 11.8 11.65

0 Patriots2 10.85 11.2 11.025

0 Patriots3 11.15 11.5 11.325

0 Patriots4 10.7 11 10.85

0 Patriots5 11.1 11.45 11.275

0 Patriots6 11.6 11.95 11.775

0 Patriots7 11.85 12.3 12.075

0 Patriots8 11.1 11.55 11.325

0 Patriots9 10.95 11.35 11.15

0 Patriots10 10.5 10.9 10.7

0 Patriots11 10.9 11.35 11.125

1 Colts1 12.7 12.35 12.525

1 Colts2 12.75 12.3 12.525

1 Colts3 12.5 12.95 12.725

1 Colts4 12.55 12.15 12.35

Ataglance,itseemsapparentthatthePatriots'footballswereatalowerpressurethantheColts'balls.Becausesomedeflationisnormalduringthecourseofagame,theindependentanalystsdecidedtocalculatethedropinpressurefromthestartofthegame.RecallthatthePatriots'ballshadallstartedoutatabout12.5psi,andtheColts'ballsatabout13.0psi.ThereforethedropinpressureforthePatriots'ballswascomputedas12.5minusthepressureathalf-time,andthedropinpressurefortheColts'ballswas13.0minusthepressureathalf-time.

ComputationalandInferentialThinking

354Permutation

football=football.with_column(

'Drop',np.array([12.5]*11+[13.0]*4)-football.column('Combined'

)

football.show()

Team Ball Blakeman Prioleau Combined Drop

0 Patriots1 11.5 11.8 11.65 0.85

0 Patriots2 10.85 11.2 11.025 1.475

0 Patriots3 11.15 11.5 11.325 1.175

0 Patriots4 10.7 11 10.85 1.65

0 Patriots5 11.1 11.45 11.275 1.225

0 Patriots6 11.6 11.95 11.775 0.725

0 Patriots7 11.85 12.3 12.075 0.425

0 Patriots8 11.1 11.55 11.325 1.175

0 Patriots9 10.95 11.35 11.15 1.35

0 Patriots10 10.5 10.9 10.7 1.8

0 Patriots11 10.9 11.35 11.125 1.375

1 Colts1 12.7 12.35 12.525 0.475

1 Colts2 12.75 12.3 12.525 0.475

1 Colts3 12.5 12.95 12.725 0.275

1 Colts4 12.55 12.15 12.35 0.65

Itisapparentthatthedropwaslarger,onaverage,forthePatriots'footballs.Butcouldthedifferencebejustduetochance?

Toanswerthis,wemustfirstexaminehowchancemightentertheanalysis.Thisisnotasituationinwhichthereisarandomsampleofdatafromalargepopulation.Itisalsonotclearhowtocreateajustifiableabstractchancemodel,astheballswerealldifferent,inflatedbydifferentpeople,andmaintainedunderdifferentconditions.

Onewaytointroducechancesistoaskwhetherthedropsinpressuresofthe11Patriotsballsandthe4Coltsballsresemblearandompermutationofthe15drops.Thenthe4Coltsdropswouldbeasimplerandomsampleofall15drops.Thisgivesusanullhypothesisthatwecantestusingrandompermutations.

ComputationalandInferentialThinking

355Permutation

Nullhypothesis.Thedropsinthepressuresofthe4Coltsballsarelikearandomsample(withoutreplacement)fromall15drops.

Anewteststatistic¶

Thedataarequantitative,sowecannotcomparethetwodistributionscategorybycategoryusingthetotalvariationdistance.IfwetrytobinthedatainordertousetheTVD,thechoiceofbinscananoticeableeffectonthestatistic.Soinstead,wewillworkwithasimplestatisticbasedonmeans.Wewilljustcomparetheaveragedropsinthetwogroups.

Theobserveddifferencebetweentheaveragedropsinthetwogroupswasabout0.7335psi.

patriots=football.where('Team',0).column('Drop')

colts=football.where('Team',1).column('Drop')

observed_difference=np.mean(patriots)-np.mean(colts)

observed_difference

0.73352272727272805

Nowthequestionbecomes:Ifwetookarandompermutationofthe15drops,howlikelyisitthatthedifferenceinthemeansofthefirst11andthelast4wouldbeatleastaslargeasthedifferenceobservedbytheofficials?

Toanswerthis,wewillrandomlypermutethe15drops,assignthefirst11permutedvaluestothePatriotsandthelast4totheColts.Thenwewillfindthedifferenceinthemeansofthetwopermutedgroups.

drops_shuffled=football.sample().column('Drop')

patriots_shuffled=drops_shuffled[:10]

colts_shuffled=drops_shuffled[11:]

shuffled_difference=np.mean(patriots_shuffled)-np.mean(colts_shuffled

shuffled_difference

0.010000000000000675

ComputationalandInferentialThinking

356Permutation

Thisisdifferentfromtheobservedvaluewecalculatedearlier.Buttogetabettersenseofthevariabilityunderrandomsamplingwemustrepeattheprocessmanytimes.Letustrymaking5000repetitionsanddrawingahistogramofthe5000differencesbetweenmeans.

repetitions=5000

test_stats=[]

foriinrange(repetitions):

drops_shuffled=football.sample().column('Drop')

patriots_shuffled=drops_shuffled[:10]

colts_shuffled=drops_shuffled[11:]

shuffled_difference=np.mean(patriots_shuffled)-np.mean(colts_shuffled

test_stats.append(shuffled_difference)

observation=np.mean(patriots)-np.mean(colts)

emp_p_value=np.count_nonzero(test_stats>=observation)/repetitions

differences=Table().with_column('DifferenceinMeans',test_stats

differences.hist(bins=20)

print("Observation:",observation)

print("EmpiricalP-value:",emp_p_value)

Observation:0.733522727273

EmpiricalP-value:0.0028

ComputationalandInferentialThinking

357Permutation

Theobserveddifferencewasroughly0.7335psi.Accordingtotheempiricaldistributionabove,thereisaverysmallchancethatarandompermutationwouldyieldadifferencethatlarge.Sothedatasupporttheconclusionthatthetwogroupsofpressureswerenotlikearandompermutationofall15pressures.

Theindependentinvestiagtiveteamanalyzedthedatainseveraldifferentways,takingintoaccountthelawsofphysics.Thefinalreportsaid,

"[T]heaveragepressuredropofthePatriotsgameballsexceededtheaveragepressuredropoftheColtsballsby0.45to1.02psi,dependingonvariouspossibleassumptionsregardingthegaugesused,andassuminganinitialpressureof12.5psiforthePatriotsballsand13.0fortheColtsballs."

--InvestigativereportcommissionedbytheNFLregardingtheAFCChampionshipgameonJanuary18,2015

Ouranalysisshowsanaveragepressuredropofabout0.73psi,whichisconsistentwiththeofficialanalysis.

Theall-importantquestioninthefootballworldwaswhethertheexcessdropofpressureinthePatriots'footballswasdeliberate.Tothatquestion,thedatahavenoanswer.Ifyouarecuriousabouttheanswergivenbytheinvestigators,hereisthefullreport.

ComputationalandInferentialThinking

358Permutation

A/BTestingInteract

UsingtheBootstrapMethodtoTestHypotheses¶Wehaveusedrandompermutationstoseewhethertwosamplesaredrawnfromthesameunderlyingdistribution.Thebootstrapmethodcanalsobeusedinsuchstatisticaltestsofhypotheses.

Theexamplesinthissectionarebasedondataonarandomsampleof1,174pairsofmothersandtheirnewborninfants.Thetablebabycontainsthedata.Eachrowrepresentsamother-babypair.Thevariablesare:

thebaby'sbirthweightinouncesthenumberofgestationaldaysthemother'sageincompletedyearsthemother'sheightininchesthemother'spregnancyweightinpoundswhetherornotthemotherwasasmoker

baby=Table.read_table('baby.csv')

baby

ComputationalandInferentialThinking

359A/BTesting

BirthWeight

GestationalDays

MaternalAge

MaternalHeight

MaternalPregnancyWeight

MaternalSmoker

120 284 27 62 100 False

113 282 33 64 135 False

128 279 28 64 115 True

108 282 23 67 125 True

136 286 25 62 93 False

138 244 33 62 178 False

132 245 23 65 140 False

120 289 25 62 125 False

143 299 30 66 136 True

140 351 27 68 120 False

...(1164rowsomitted)

Supposewewanttocomparethemotherswhosmokeandthemotherswhoarenon-smokers.Dotheydifferinanywayotherthansmoking?Akeyvariableofinterestisthebirthweightoftheirbabies.Tostudythisvariable,webeginbynotingthatthereare715non-smokersamongthewomeninthesample,and459smokers.

baby.group('MaternalSmoker')

MaternalSmoker count

False 715

True 459

Thefirsthistogrambelowdisplaysthedistributionofbirthweightsofthebabiesofthenon-smokersinthesample.Theseconddisplaysthebirthweightsofthebabiesofthesmokers.

baby.where('MaternalSmoker',False).hist('BirthWeight',bins=np.arange

ComputationalandInferentialThinking

360A/BTesting

baby.where('MaternalSmoker',True).hist('BirthWeight',bins=np.arange

Bothdistributionsareapproximatelybellshapedandcenterednear120ounces.Thedistributionsarenotidentical,ofcourse,whichraisesthequestionofwhetherthedifferencereflectsjustchancevariationoradifferenceinthedistributionsinthepopulation.

Thisquestioncanbeansweredbyatestofthenullandalternativehypothesesbelow.Becausewearetestingfortheequalityoftwodistributions,oneforCategoryA(motherswhodon'tsmoke)andtheotherforCategoryB(motherswhosmoke),themethodisknownratherunimaginativelyasA/Btesting.

Nullhypothesis.Inthepopulation,thedistributionofbirthweightsofbabiesisthesameformotherswhodon'tsmokeasformotherswhodo.Thedifferenceinthesampleisduetochance.

Alternativehypothesis.Thetwodistributionsaredifferentinthepopulation.

ComputationalandInferentialThinking

361A/BTesting

Teststatistic.Birthweightisaquantitativevariable,soitisreasonabletousethedifferencebetweenmeansastheteststatistic.Thereareotherreasonableteststatisticsaswell;thedifferencebetweenmeansisjustonesimplechoice.

Theobserveddifferencebetweenthemeansofthetwogroupsinthesampleisabout9.27ounces.

nonsmokers_mean=np.mean(baby.where('MaternalSmoker',False).column

smokers_mean=np.mean(baby.where('MaternalSmoker',True).column('BirthWeight'

nonsmokers_mean-smokers_mean

9.2661425720249184

Implementingthebootstrapmethod¶

Toseewhethersuchadifferencecouldhavearisenduetochanceunderthenullhypothesis,wecoulduseapermutationtestjustaswedidintheprevioussection.Analternativeistousethebootstrap,whichwewilldohere.

Underthenullhypothesis,thedistributionsofbirthweightsarethesameforbabiesofwomenwhosmokeandforbabiesofthosewhodon't.Toseehowmuchthedistributionscouldvaryduetochancealone,wewillusethebootstrapmethodanddrawnewsamplesatrandomwithreplacementfromtheentiresetofbirthweights.Theonlydifferencebetweenthisandthepermutationtestoftheprevioussectionisthatrandompermutationsarebasedonsamplingwithoutreplacement.

Wewillperform10,000repetitionsofthebootstrapprocess.Thereare1,174babies,soineachrepetitionofthebootstrapprocesswedrawn1,174timesatrandomwithreplacement.Ofthese1,174draws,715areassignedtothenon-smokingmothersandtheremaining459tothemotherswhosmoke.

Thecodestartsbysortingtherowssothattherowscorrespondingtothe715non-smokersappearfirst.Thiseliminatestheneedtouseconditionalexpressionstoidentifytherowscorrespondingtosmokersandnon-smokersineachreplicationofthesamplingprocess.

ComputationalandInferentialThinking

362A/BTesting

"""Bootstraptestforthedifferenceinmeanbirthweight

CategoryA:non-smokerCategoryB:smoker"""

sorted_by_smoking=baby.select(['MaternalSmoker','BirthWeight'])

#calculatetheobserveddifferenceinmeans

meanA=np.mean(sorted_by_smoking.column('BirthWeight')[:715])

meanB=np.mean(sorted_by_smoking.column('BirthWeight')[715:])

observed_difference=meanA-meanB

repetitions=10000

diffs=[]

foriinrange(repetitions):

#sampleWITHreplacement,samenumberasoriginalsamplesize

resample=sorted_by_smoking.sample(with_replacement=True)

#Computethedifferenceofthemeansoftheresampledvalues,betweenCategoriesAandB

diff_means=np.mean(resample.column('BirthWeight')[:715])-np

diffs.append([diff_means])

#Displayresults

differences=Table().with_column('diff_in_means',diffs)

differences.hist(bins=20)

plots.xlabel('Approxnulldistributionofdifferenceinmeans')

plots.title('BootstrapA/BTest')

plots.xlim(-4,4)

plots.ylim(0,0.4)

print('Observeddifferenceinmeans:',observed_difference)

Observeddifferenceinmeans:9.26614257202

ComputationalandInferentialThinking

363A/BTesting

Thefigureshowsthatunderthenullhypothesisofequaldistributionsinthepopulation,thebootstrapempiricaldistributionofthedifferencebetweenthesamplemeansofthetwogroupsisroughlybellshaped,centeredat0,stretchingfromabout ouncesto ounces.Theobserveddifferenceintheoriginalsampleisabout9.27ounces,whichisinconsistentwiththisdistribution.Sotheconclusionofthetestisthatinthepopulation,thedistributionsofbirthweightsofthebabiesofnon-smokersandsmokersaredifferent.

BootstrapA/Btesting¶

Letusdefineafunctionbootstrap_AB_testthatperformsanA/Btestusingthebootstrapmethodandthedifferenceinsamplemeansastheteststatistic.Thenullhypothesisisthatthetwounderlyingdistributionsinthepopulationareequal;thealternativeisthattheyarenot.

Theargumentsofthefunctionare:

thenameofthetablethatcontainsthedataintheoriginalsamplethelabelofthecolumncontainingthecode0forCategoryAand1forCategoryBthelabelofthecolumncontainingthevaluesoftheresponsevariable(thatis,thevariablewhosedistributionisofinterest)thenumberofrepetitionsoftheresamplingprocedure

Thefunctionreturnstheobserveddifferenceinmeans,thebootstrapempiricaldistributionofthedifferenceinmeans,andthebootstrapempiricalP-value.Becausethealternativesimplysaysthatthetwounderlyingdistributionsaredifferent,theP-valueiscomputedastheproportionofsampleddifferencesthatareatleastaslargeinabsolutevalueastheabsolutevalueoftheobserveddifference.

ComputationalandInferentialThinking

364A/BTesting

"""BootstrapA/Btestforthedifferenceinthemeanresponse

AssumesA=0,B=1"""

defbootstrap_AB_test(table,categories,values,repetitions):

#SortthesampletableaccordingtotheA/Bcolumn;

#thenselectonlythecolumnofeffects.

response=table.sort(categories).select(values)

#FindthenumberofentriesinCategoryA.

n_A=table.where(categories,0).num_rows

#Calculatetheobservedvalueoftheteststatistic.

meanA=np.mean(response.column(values)[:n_A])

meanB=np.mean(response.column(values)[n_A:])

observed_difference=meanA-meanB

#Runthebootstrapprocedureandgetalistofresampleddifferencesinmeans

diffs=[]

foriinrange(repetitions):

resample=response.sample(with_replacement=True)

d=np.mean(resample.column(values)[:n_A])-np.mean(resample

diffs.append([d])

#ComputethebootstrapempiricalP-value

diff_array=np.array(diffs)

empirical_p=np.count_nonzero(abs(diff_array)>=abs(observed_difference

#Displayresults

differences=Table().with_column('diff_in_means',diffs)

differences.hist(bins=20,normed=True)

plots.xlabel('Approxnulldistributionofdifferenceinmeans')

plots.title('BootstrapA/BTest')

print("Observeddifferenceinmeans:",observed_difference)

print("BootstrapempiricalP-value:",empirical_p)

Wecannowusethefunctionbootstrap_AB_testtocomparethesmokersandnon-smokerswithrespecttoseveraldifferentresponsevariables.Thetestsshowastatisticallysignificantdifferencebetweenthetwogroupsinbirthweight(asshownearlier),gestationaldays,

ComputationalandInferentialThinking

365A/BTesting

maternalage,andmaternalpregnancyweight.Itcomesasnosurprisethatthetwogroupsdonotdiffersignificantlyintheirmeanheights.

bootstrap_AB_test(baby,'MaternalSmoker','BirthWeight',10000)

Observeddifferenceinmeans:9.26614257202

BootstrapempiricalP-value:0.0

bootstrap_AB_test(baby,'MaternalSmoker','GestationalDays',10000

Observeddifferenceinmeans:1.97652238829

BootstrapempiricalP-value:0.0367

ComputationalandInferentialThinking

366A/BTesting

bootstrap_AB_test(baby,'MaternalSmoker','MaternalAge',10000)

Observeddifferenceinmeans:0.80767250179

BootstrapempiricalP-value:0.0192

bootstrap_AB_test(baby,'MaternalSmoker','MaternalPregnancyWeight'

Observeddifferenceinmeans:2.56033030151

BootstrapempiricalP-value:0.0374

ComputationalandInferentialThinking

367A/BTesting

bootstrap_AB_test(baby,'MaternalSmoker','MaternalHeight',10000

Observeddifferenceinmeans:-0.0905891494127

BootstrapempiricalP-value:0.5411

ComputationalandInferentialThinking

368A/BTesting

RegressionInferenceInteract

Regression:aReview¶Earlierinthecourse,wedevelopedtheconceptsofcorrelationandregressionaswaystodescriberelationsbetweenquantitativevariables.Wewillnowseehowtheseconceptscanbecomepowerfultoolsforinference,whenusedappropriately.

First,letusbrieflyreviewthemainideasofcorrelationandregression,alongwithcodeforcalculations.

Itisoftenconvenienttomeasurevariablesinstandardunits.Whenyouconvertavaluetostandardunits,youaremeasuringhowmanystandarddeviationsaboveaveragethevalueis.

Thefunctionstandard_unitstakesanarrayasitsargumentandreturnsthearrayconvertedtostandardunits.

defstandard_units(x):

return(x-np.mean(x))/np.std(x)

Thecorrelationbetweentwovariablesmeasuresthedegreetowhichascatterplotofthetwovariablesisclusteredaroundastraightline.Ithasnounitsandisanumberbetween-1and1.Itiscalculatedastheaverageoftheproductofthetwovariables,whenbothvariablesaremeasuredinstandardunits.

Thefunctioncorrelationtakesthreearguments–thenameofatableandthelabelsoftwocolumnsofthetable–andreturnsthecorrelationbetweenthetwocolumns.

defcorrelation(table,x,y):

x_in_standard_units=standard_units(table.column(x))

y_in_standard_units=standard_units(table.column(y))

returnnp.mean(x_in_standard_units*y_in_standard_units)

Hereisthescatterplotofbirthweight(they-variable)versusgestationaldays(thex-variable)ofthebabiesinthetablebaby.Theplotshowsapositiveassociation,andcorrelationreturnsavalueofjustover0.4.

ComputationalandInferentialThinking

369RegressionInference

correlation(baby,'GestationalDays','BirthWeight')

0.40754279338885108

baby.scatter('GestationalDays','BirthWeight')

Whenscatteriscalledwiththeoptionfit_line=True,theregressionlineisdrawnonthescatterplot.

baby.scatter('GestationalDays','BirthWeight',fit_line=True)

ComputationalandInferentialThinking

370RegressionInference

Theregressionlineisthebestamongallstraightlinesthatcouldbeusedtoestimatebasedon ,inthesensethatitminimizesthemeansquarederrorofestimation.

Theslopeoftheregressionlineisgivenby:

where isthecorrelation.

Theinterceptoftheregressionlineisgivenby

Thefunctionsslopeandinterceptcalculatethesetwoquantities.Eachtakesthesamethreeargumentsascorrelation:thenameofthetableandthelabelsofcolumns and ,inthatorder.

defslope(table,x,y):

r=correlation(table,x,y)

returnr*np.std(table.column(y))/np.std(table.column(x))

defintercept(table,x,y):

a=slope(table,x,y)

returnnp.mean(table.column(y))-a*np.mean(table.column(x))

ComputationalandInferentialThinking

371RegressionInference

Wecannowcalculatetheslopeandtheinterceptoftheregressionlinedrawnabove.Theslopeisabout0.47ouncespergestationalday.

slope(baby,'GestationalDays','BirthWeight')

0.46655687694921522

Theinterceptisabout-10.75ounces.

intercept(baby,'GestationalDays','BirthWeight')

-10.754138914450252

Thustheequationoftheregressionlineforestimatingbirthweightbasedongestationaldaysis:

Thefittedvalueatagivenvalueof istheestimateof basedonthatvalueof .Inotherwords,thefittedvalueatagivenvalueof istheheightoftheregressionlineatthat .

Thefunctionfitted_valuecomputesthisheight.Likethefunctionscorrelation,slope,andintercept,itsargumentsincludethenameofthetableandthelabelsofthe andcolumns.Butitalsorequiresafourthargument,whichisthevalueof atwhichtheestimatewillbemade.

deffitted_value(table,x,y,given_x):

a=slope(table,x,y)

b=intercept(table,x,y)

returna*given_x+b

Thefittedvalueat300gestationaldaysisabout129.2ounces.Forapregnancythathaslength300gestationaldays,ourestimateforthebaby'sweightisabout129.2ounces.

fitted_value(baby,'GestationalDays','BirthWeight',300)

ComputationalandInferentialThinking

372RegressionInference

129.2129241703143

Thefunctionfitreturnsanarrayconsistingofthefittedvaluesatallofthevaluesof inthetable.

deffit(table,x,y):

a=slope(table,x,y)

b=intercept(table,x,y)

returna*table.column(x)+b

Asweknow,theTablemethodscattercanbeusedtodrawascatterplotwitharegressionlinethroughit.Hereisanotherway,usingfit.Thefunctionscatter_fittakesthesamethreeargumentsasfitandreturnsascatterdiagramalongwiththestraightlineoffittedvalues.

defscatter_fit(table,x,y):

plots.scatter(table.column(x),table.column(y),s=15)

plots.plot(table.column(x),fit(table,x,y),lw=2,color='darkblue'

plots.xlabel(x)

plots.ylabel(y)

scatter_fit(baby,'GestationalDays','BirthWeight')

#Thenexttwolinesjustmaketheploteasiertoread

plots.ylim([0,200])

None

ComputationalandInferentialThinking

373RegressionInference

Assumptionsofrandomness:a"regressionmodel"¶Thusfarinthissection,ouranalysishasbeenpurelydescriptive.Butrecallthatthedataarearandomsampleofallthebirthsinasystemofhospitals.Whatdothedatasayaboutthewholepopulationofbirths?Whatcouldtheysayaboutanewbirthatoneofthehospitals?

Questionsofinferencemayariseifwebelievethatascatterplotreflectstheunderlyingrelationbetweenthetwovariablesbeingplottedbutdoesnotspecifytherelationcompletely.Forexample,ascatterplotofbirthweightversusgestationaldaysshowsusthepreciserelationbetweenthetwovariablesinoursample;butwemightwonderwhetherthatrelationholdstrue,oralmosttrue,forallbabiesinthepopulationfromwhichthesamplewasdrawn,orindeedamongbabiesingeneral.

Asalways,inferentialthinkingbeginswithacarefulexaminationoftheassumptionsaboutthedata.Setsofassumptionsareknownasmodels.Setsofassumptionsaboutrandomnessinroughlylinearscatterplotsarecalledregressionmodels.

Inbrief,suchmodelssaythattheunderlyingrelationbetweenthetwovariablesisperfectlylinear;thisstraightlineisthesignalthatwewouldliketoidentify.However,wearenotabletoseethelineclearly.Whatweseearepointsthatarescatteredaroundtheline.Ineachofthepoints,thesignalhasbeencontaminatedbyrandomnoise.Ourinferentialgoal,therefore,istoseparatethesignalfromthenoise.

Ingreaterdetail,theregressionmodelspecifiesthatthepointsinthescatterplotaregeneratedatrandomasfollows.

Therelationbetween and isperfectlylinear.Wecannotseethis"trueline"but

ComputationalandInferentialThinking

374RegressionInference

Tychecan.SheistheGoddessofChance.Tychecreatesthescatterplotbytakingpointsonthelineandpushingthemoffthelinevertically,eitheraboveorbelow,asfollows:

Foreach ,Tychefindsthecorrespondingpointonthetrueline,andthenaddsanerror.Theerrorsaredrawnatrandomwithreplacementfromapopulationoferrorsthathasanormaldistributionwithmean0.Tychecreatesapointwhosehorizontalcoordinateis andwhoseverticalcoordinateis"theheightofthetruelineat ,plustheerror".

Finally,Tycheerasesthetruelinefromthescatter,andshowsusjustthescatterplotofherpoints.

Basedonthisscatterplot,howshouldweestimatethetrueline?Thebestlinethatwecanputthroughascatterplotistheregressionline.Sotheregressionlineisanaturalestimateofthetrueline.

Thesimulationbelowshowshowclosetheregressionlineistothetrueline.ThefirstpanelshowshowTychegeneratesthescatterplotfromthetrueline;thesecondshowthescatterplotthatwesee;thethirdshowstheregressionlinethroughtheplot;andthefourthshowsboththeregressionlineandthetrueline.

Torunthesimulation,callthefunctiondraw_and_comparewiththreearguments:theslopeofthetrueline,theinterceptofthetrueline,andthesamplesize.

Runthesimulationafewtimes,withdifferentvaluesfortheslopeandinterceptofthetrueline,andvaryingsamplesizes.Becauseallthepointsaregeneratedaccordingtothemodel,youwillseethattheregressionlineisagoodestimateofthetruelineifthesamplesizeismoderatelylarge.

ComputationalandInferentialThinking

375RegressionInference

defdraw_and_compare(true_slope,true_int,sample_size):

x=np.random.normal(50,5,sample_size)

xlims=np.array([np.min(x),np.max(x)])

eps=np.random.normal(0,6,sample_size)

y=(true_slope*x+true_int)+eps

tyche=Table([x,y],['x','y'])

plots.figure(figsize=(6,16))

plots.subplot(4,1,1)

plots.scatter(tyche['x'],tyche['y'],s=15)

plots.plot(xlims,true_slope*xlims+true_int,lw=2,color='green'

plots.title('WhatTychedraws')

plots.subplot(4,1,2)

plots.scatter(tyche['x'],tyche['y'],s=15)

plots.title('Whatwegettosee')

plots.subplot(4,1,3)

scatter_fit(tyche,'x','y')

plots.xlabel("")

plots.ylabel("")

plots.title('Regressionline:ourestimateoftrueline')

plots.subplot(4,1,4)

scatter_fit(tyche,'x','y')

xlims=np.array([np.min(tyche['x']),np.max(tyche['x'])])

plots.plot(xlims,true_slope*xlims+true_int,lw=2,color='green'

plots.title("Regressionlineandtrueline")

#Tyche'strueline,

#thepointsshecreates,

#andourestimateofthetrueline.

#Arguments:trueslope,trueintercept,numberofpoints

draw_and_compare(3,-5,25)

ComputationalandInferentialThinking

376RegressionInference

ComputationalandInferentialThinking

377RegressionInference

PredictionusingRegression¶Inreality,ofcourse,wearenotTyche,andwewillneverseethetrueline.Whatthesimulationshowsthatiftheregressionmodellooksplausible,andifwehavealargesample,thentheregressionlineisagoodapproximationtothetrueline.

Thescatterdiagramofbirthweightsversusgestationaldayslooksroughlylinear.Letusassumethattheregressionmodelholds.Supposenowthatatthehospitalthereisanewbabywhohas300gestationaldays.Wecanusetheregressionlinetopredictthebirthweightofthisbaby.

Aswesawearlier,thefittedvalueat300gestationaldayswasabout129.2ounces.Thatisourpredictionforthebirthweightofthenewbaby.

fit_300=fitted_value(baby,'GestationalDays','BirthWeight',300

fit_300

129.2129241703143

Thefigurebelowshowswherethepredictionliesontheregressionline.Theredlineisat.

scatter_fit(baby,'GestationalDays','BirthWeight')

plots.scatter(300,fit_300,color='red',s=20)

plots.plot([300,300],[0,fit_300],color='red',lw=1)

plots.ylim([0,200])

None

ComputationalandInferentialThinking

378RegressionInference

TheVariabilityofthePrediction¶Wehavedevelopedamethodforpredictinganewbaby'sbirthweightbasedonthenumberofgestationaldays.Butasdatascientists,weknowthatthesamplemighthavebeendifferent.Hadthesamplebeendifferent,theregressionlinewouldhavebeendifferenttoo,andsowouldourprediction.Toseehowgoodourpredictionis,wemustgetasenseofhowvariablethepredictioncanbe.

Onewaytodothiswouldbebygeneratingnewrandomsamplesofpointsandmakingapredictionbasedoneachnewsample.Togeneratenewsamples,wecanbootstrapthescatterplot.

Specifically,wecansimulatenewsamplesbyrandomsamplingwithreplacementfromtheoriginalscatterplot,asmanytimesastherearepointsinthescatter.

Hereistheoriginalscatterdiagramfromthesample,andfourreplicationsofthebootstrapresamplingprocedure.Noticehowtheresampledscatterplotsareingeneralalittlemoresparsethantheoriginal.Thatisbecausesomeoftheoriginalpointdonotgetselectedinthesamples.

ComputationalandInferentialThinking

379RegressionInference

plots.figure(figsize=(8,18))

plots.subplot(5,1,1)

plots.scatter(baby[1],baby[0],s=10)

plots.xlim([150,400])

plots.title('Originalsample')

foriinnp.arange(1,5,1):

plots.subplot(5,1,i+1)

rep=baby.sample(with_replacement=True)

plots.scatter(rep[1],rep[0],s=10)

plots.xlim([150,400])

plots.title('Bootstrapsample'+str(i))

ComputationalandInferentialThinking

380RegressionInference

ComputationalandInferentialThinking

381RegressionInference

Thenextstepistofittheregressionlinetothescatterplotineachreplication,andmakeapredictionbasedoneachline.Thefigurebelowshows10suchlines,andthecorrespondingpredictedbirthweightat300gestationaldays.

x=300

lines=Table(['slope','intercept'])

foriinrange(10):

rep=baby.sample(with_replacement=True)

a=slope(rep,'GestationalDays','BirthWeight')

b=intercept(rep,'GestationalDays','BirthWeight')

lines.append([a,b])

lines['predictionatx='+str(x)]=lines.column('slope')*x+lines.

xlims=np.array([291,309])

left=xlims[0]*lines[0]+lines[1]

right=xlims[1]*lines[0]+lines[1]

fit_x=x*lines['slope']+lines['intercept']

foriinrange(10):

plots.plot(xlims,np.array([left[i],right[i]]),lw=1)

plots.scatter(x,fit_x[i],s=20)

Thepredictionsvaryfromonelinetothenext.Thetablebelowshowstheslopeandinterceptofeachofthe10lines,alongwiththeprediction.

ComputationalandInferentialThinking

382RegressionInference

lines

slope intercept predictionatx=300

0.446731 -5.08136 128.938

0.417925 3.65604 129.034

0.395636 8.95328 127.644

0.512511 -22.9981 130.755

0.477034 -13.4303 129.68

0.515773 -24.5881 130.144

0.48213 -15.7025 128.936

0.492106 -18.7225 128.909

0.434989 -1.72409 128.773

0.486076 -16.5506 129.272

BootstrapPredictionInterval¶Ifweincreasethenumberofrepetitionsoftheresamplingprocess,wecangenerateanempiricalhistogramofthepredictions.Thiswillallowustocreateapredictioninterval,usingmethodslikethoseweusedearliertocreatebootstrapconfidenceintervalsfornumericalparameters.

Letusdefineafunctioncalledbootstrap_predictiontodothis.Thefunctiontakesfivearguments:

thenameofthetablethecolumnlabelsofthepredictorandresponsevariables,inthatorderthevalueof atwhichtomakethepredictionthedesirednumberofbootstraprepetitions

Ineachrepetition,thefunctionbootstrapstheoriginalscatterplotandcalculatestheequationoftheregressionlinethroughthebootstrappedplot.Itthenusesthislinetofindthepredictedvalueof basedonthespecifiedvalueof .Specifically,itcallsthefunctionfitted_valuethatwedefinedearlierinthissection,tocalculatetheslopeandinterceptoftheregressionlineandthencalcuatethefittedvalueatthespecified .

ComputationalandInferentialThinking

383RegressionInference

Finally,itcollectsallofthepredictedvaluesintoacolumnofatable,drawstheempiricalhistogramofthesevalues,andprintstheintervalconsistingofthe"middle95%"ofthepredictedvalues.Italsoprintsthepredictedvaluebasedontheregressionlinethroughtheoriginalscatterplot.

#Bootstrappredictionofvariableyatnew_x

#Datacontainedintable;predictionbyregressionofybasedonx

#repetitions=numberofbootstrapreplicationsoftheoriginalscatterplot

defbootstrap_prediction(table,x,y,new_x,repetitions):

#Foreachrepetition:

#Bootstrapthescatter;

#gettheregressionpredictionatnew_x;

#augmentthepredictionslist

predictions=[]

foriinrange(repetitions):

bootstrap_sample=table.sample(with_replacement=True)

predicted_value=fitted_value(bootstrap_sample,x,y,new_x

predictions.append(predicted_value)

#Predictionbasedonoriginalsample

original=fitted_value(table,x,y,new_x)

#Displayresults

pred=Table().with_column('Prediction',predictions)

pred.hist(bins=20)

plots.xlabel('predictionsatx='+str(new_x))

print('Heightofregressionlineatx='+str(new_x)+':',original

print('Approximate95%-confidenceinterval:')

print((pred.percentile(2.5).rows[0][0],pred.percentile(97.5).rows

bootstrap_prediction(baby,'GestationalDays','BirthWeight',300,

Heightofregressionlineatx=300:129.21292417

Approximate95%-confidenceinterval:

(127.25683835787581,131.33042912205815)

ComputationalandInferentialThinking

384RegressionInference

Thefigureaboveshowsabootstrapempiricalhistogramofthepredictedbirthweightofababyat300gestationaldays,basedon5,000repetitionsofthebootstrapprocess.Theempiricaldistributionisroughlynormal.

Anapproximate95%predictionintervalofscoreshasbeenconstructedbytakingthe"middle95%"ofthepredictions,thatis,theintervalfromthe2.5thpercentiletothe97.5thpercentileofthepredictions.Theintervalrangesfromabout127toabout131.Thepredictionbasedontheoriginalsamplewasabout129,whichisclosetothecenteroftheinterval.

Thefigurebelowshowsthehistogramof5,000bootstrappredictionsat285gestationaldays.Thepredictionbasedontheoriginalsampleisabout122ounces,andtheintervalrangesfromabout121ouncestoabout123ounces.

bootstrap_prediction(baby,'GestationalDays','BirthWeight',285,

Heightofregressionlineatx=285:122.214571016

Approximate95%-confidenceinterval:

(121.17047827189757,123.29506281889765)

ComputationalandInferentialThinking

385RegressionInference

Noticethatthisintervalisnarrowerthanthepredictionintervalat300gestationaldays.Letusinvestigatethereasonforthis.

Themeannumberofgestationaldaysisabout279days:

np.mean(baby['GestationalDays'])

279.10136286201021

So285isnearertothecenterofthedistributionthan300is.Typically,theregressionlinesbasedonthebootstrapsamplesareclosertoeachothernearthecenterofthedistributionofthepredictorvariable.Thereforeallofthepredictedvaluesareclosertogetheraswell.Thisexplainsthenarrowerwidthofthepredictioninterval.

Youcanseethisinthefigurebelow,whichshowspredictionsat and foreachoftenbootstrapreplications.Typically,thelinesarefartherapartat thanat

,andthereforethepredictionsat aremorevariable.

ComputationalandInferentialThinking

386RegressionInference

x1=300

x2=285

lines=Table(['slope','intercept'])

foriinrange(10):

rep=baby.sample(with_replacement=True)

a=slope(rep,'GestationalDays','BirthWeight')

b=intercept(rep,'GestationalDays','BirthWeight')

lines.append([a,b])

xlims=np.array([260,310])

left=xlims[0]*lines[0]+lines[1]

right=xlims[1]*lines[0]+lines[1]

fit_x1=x1*lines['slope']+lines['intercept']

fit_x2=x2*lines['slope']+lines['intercept']

plots.xlim(xlims)

foriinrange(10):

plots.plot(xlims,np.array([left[i],right[i]]),lw=1)

plots.scatter(x1,fit_x1[i],s=20)

plots.scatter(x2,fit_x2[i],s=20)

Istheobservedslopereal?¶

ComputationalandInferentialThinking

387RegressionInference

Wehavedevelopedamethodofpredictingthebirthweightofanewbabyinthehospitalsystem,basedonthebaby'snumberofgestationaldays.Ourpredictionmakesuseofthepositiveassociationthatweobservedbetweenbirthweightandgestationaldays.

Butwhatifthatobservationisspurious?Inotherwords,whatifTyche'struelinewasflat,andthepositiveassociationthatweobservedwasjustduetorandomnessinthepointsthatshegenerated?

Hereisasimulationthatillustrateswhythisquestionarises.Wewillonceagaincallthefunctiondraw_and_compare,thistimerequiringTyche'struelinetohaveslope0.Ourgoalistoseewhetherourregressionlineshowsaslopethatisnot0.

Rememberthattheargumentstothefunctiondraw_and_comparearetheslopeandtheinterceptofTyche'strueline,andthenumberofpointsthatshegenerates.

draw_and_compare(0,10,25)

ComputationalandInferentialThinking

388RegressionInference

ComputationalandInferentialThinking

389RegressionInference

Runthesimulationafewtimes,keepingtheslopeofTyche'sline0eachtime.Youwillnoticethatwhiletheslopeofthetruelineis0,theslopeoftheregressiontypicallyisnot0.Theregressionlinesometimesslopesupwards,andsometimesdownwards,eachtimegivingusafalseimpressionthatthetwovariablesarecorrelated.

Testingwhethertheslopeofthetruelineis0¶Thesesimulationsshowthatbeforeweputtoomuchfaithinourregressionpredictions,itisworthcheckingwhetherthetruelinedoesindeedhaveaslopethatisnot0.

Weareinagoodpositiontodothis,becauseweknowhowtobootstrapthescatterdiagramanddrawaregressionlinethrougheachbootstrappedplot.Eachofthoselineshasaslope,sowecansimplycollectalltheslopesanddrawtheirempiricalhistogram.Wecanthenconstructanapproximate95%confidenceintervalfortheslopeofthetrueline,usingthebootstrappercentilemethodthatisnowsofamiliar.

Iftheconfidenceintervalforthetrueslopedoesnotcontain0,thenwecanbeabout95%confidentthatthetrueslopeisnot0.Iftheintervaldoescontain0,thenwecannotconcludethatthereisagenuninelinearassociationbetweenthevariablethatwearetryingtopredictandthevariablethatwehavechosentouseasapredictor.Wemightthereforewanttolookforadifferentpredictorvariableonwhichtobaseourpredictions.

Thefunctionbootstrap_slopeexecutesourplan.Itsargumentsarethenameofthetableandthelabelsofthepredictorandresponsevariables,andthedesirednumberofbootstrapreplications.Ineachreplication,thefunctionbootstrapstheoriginalscatterplotandcalculatestheslopeoftheresultingregressionline.Itthendrawsthehistogramofallthegeneratedslopesandprintstheintervalconsistingofthe"middle95%"oftheslopes.

Noticethatthecodeforbootstrap_slopeisalmostidenticaltothatforbootstrap_prediction,exceptthatnowallweneedfromeachreplicationistheslope,notapredictedvalue.

ComputationalandInferentialThinking

390RegressionInference

defbootstrap_slope(table,x,y,repetitions):

#Foreachrepetition:

#Bootstrapthescatter,gettheslopeoftheregressionline,

#augmentthelistofgeneratedslopes

slopes=[]

foriinrange(repetitions):

bootstrap_scatter=table.sample(with_replacement=True)

slopes.append(slope(bootstrap_scatter,x,y))

#Slopeoftheregressionlinefromtheoriginalsample

observed_slope=slope(table,x,y)

#Displayresults

slopes=Table().with_column('slopes',slopes)

slopes.hist(bins=20)

print('Slopeofregressionline:',observed_slope)

print('Approximate95%-confidenceintervalforthetrueslope:'

print((slopes.percentile(2.5).rows[0][0],slopes.percentile(97.5

Letuscallbootstrap_slopetoseeifthatthetruelinearrelationbetweenbirthweightandgestationaldayshasslope0;inotherwords,ifTyche'struelineisflat.

bootstrap_slope(baby,'GestationalDays','BirthWeight',5000)

Slopeofregressionline:0.466556876949

Approximate95%-confidenceintervalforthetrueslope:

(0.37981209488027268,0.55753912103512426)

ComputationalandInferentialThinking

391RegressionInference

Theapproximate95%confidenceintervalforthetruesloperunsfromabout0.38ouncesperdaytoabout0.56ouncesperday.Thisintervaldoesnotcontain0.Thereforewecanconclude,withabout95%confidence,thattheslopeofthetruelineisnot0.

Thisconclusionsupportsourchoiceofusinggestationaldaystopredictbirthweight.

Supposenowwetrytoestimatethebirthweightofthebabybasedonthemother'sage.Basedonthesample,theslopeoftheregressionlineforestimatingbirthweightbasedonmaternalageispositive,about0.08ouncesperyear:

slope(baby,'MaternalAge','BirthWeight')

0.085007669415825132

Butanapproximate95%bootstrapconfidenceintervalforthetrueslopehasanegativeleftendpointandapositiverightendpoint–inotherwords,theintervalcontains0.Thereforewecannotrejectthehypothesisthattheslopeofthetruelinearrelationbetweenmaternalageandbaby'sbirthweightis0.Basedonthisobservation,itwouldbeunwisetopredictbirthweightbasedontheregressionmodelwithmaternalageasthepredictor.

bootstrap_slope(baby,'MaternalAge','BirthWeight',5000)

Slopeofregressionline:0.0850076694158

Approximate95%-confidenceintervalforthetrueslope:

(-0.10015674488514391,0.26974979944992733)

ComputationalandInferentialThinking

392RegressionInference

Usingconfidenceintervalstotesthypotheses¶Noticehowwehaveusedconfidenceintervalsforthetrueslopetotestthehypothesisthatthetrueslopeis0.Constructingaconfidenceintervalforaparameterisonewaytotestthenullhypothesisthattheparameterhasaspecifiedvalue.Ifthatvalueisnotintheconfidenceinterval,wecanrejectthenullhypothesisthattheparameterhasthatvalue.Ifthevalueisintheconfidenceinterval,wedonothaveenoughevidencetorejectthenullhypothesis.

Ifthesamplesizeislargeandiftheempiricaldistributionoftheestimateoftheparameterisroughlynormal,thenthismethodoftestingviaconfidenceintervalsisjustifiable.Usinga95%confidenceintervalcorrespondstousinga5%cutofffortheP-value.Thejustificationsofthesestatementsarerathertechnical,andwewillleavethemtoamoreadvancedcourse.Fornow,itisenoughtonotethatconfidenceintervalscanbeusedtotesthypothesesinthisway.

Indeed,themethodismoreconstructivethanothermethodsoftestinghypotheses,becausenotonlydoesitresultinadecisionaboutwhetherornottorejectthenullhypothesis,italsoprovidesanestimateoftheparameter.Forexample,intheexampleofusinggestationaldaystoestimatebirthweights,wedidn'tjustconcludethatthetrueslopewasnot0–wealsoestimatedthatthetrueslopeisbetween0.38ouncesperdayto0.56ouncesperday.

Wordsofcaution¶Allofthepredictionsandteststhatwehaveperformedinthissectionassumethattheregressionmodelholds.Specifically,themethodsassumethatthescatterplotresemblespointsgeneratedbystartingwithpointsthatareonastraightlineandthenpushingthemoffthelinebyaddingrandomnoise.

ComputationalandInferentialThinking

393RegressionInference

Ifthescatterplotdoesnotlooklikethat,thenperhapsthemodeldoesnotholdforthedata.Ifthemodeldoesnothold,thencalculationsthatassumethemodeltobetruearenotvalid.

Therefore,wemustfirstdecidewhethertheregressionmodelholdsforourdata,beforewestartmakingpredictionsbasedonthemodelortestinghypothesesaboutparametersofthemodel.Asimplewayistodowhatwedidinthissection,whichistodrawthescatterdiagramofthetwovariablesandseewhetheritlooksroughlylinearandevenlyspreadoutaroundaline.Inthenextsection,wewillstudysomemoretoolsforassessingthemodel.

ComputationalandInferentialThinking

394RegressionInference

RegressionDiagnosticsInteractThemethodsthatwehavedevelopedforinferenceinregressionarevalidiftheregressionmodelholdsforourdata.Ifthedatadonotsatisfytheassumptionsofthemodel,thenitcanbehardtojustifycalculationsthatassumethatthemodelisgood.

Inthissectionwewilldevelopsomesimpledescriptivemethodsthatcanhelpusdecidewhetherourdatasatisfytheregressionmodel.

Wewillstartwithareviewofthedetailsofthemodel,whichspecifiesthatthepointsinthescatterplotaregeneratedatrandomasfollows.

Tychecreatesthescatterplotbystartingwithpointsthatlieperfectlyonastraightlineandthenpushingthemoffthelinevertically,eitheraboveorbelow,asfollows:

Foreach ,Tychefindsthecorrespondingpointonthetrueline,andthenaddsanerror.Theerrorsaredrawnatrandomwithreplacementfromapopulationoferrorsthathasanormaldistributionwithmean0andanunknownstandarddeviation.Tychecreatesapointwhosehorizontalcoordinateis andwhoseverticalcoordinateis"theheightofthetruelineat ,plustheerror".Iftheerrorispositive,thepointisabovetheline.Iftheerrorisnegative,thepointisbelowtheline.

WegettoseetheresultingpointsbutnotthetruelineonwhichthepointsstartedoutnortherandomerrorsthatTycheadded.

FirstDiagnostic:TheScatterPlot¶

Itisalwayshelpfultostartwithavisualization.Continuingtheexampleofestimatingbirthweightbasedongestationaldays,letuslookatascatterplotofthetwovariables.

baby.scatter('GestationalDays','BirthWeight')

ComputationalandInferentialThinking

395RegressionDiagnostics

Thepointsdoappeartoberoughlyclusteredaroundastraightline.Thereisnostrikinglynoticeablecurveinthescatter.

Aplotoftheregressionlinethroughthescatterplotshowsthatabouthalfthepointsareabovethelineandhalfarebelowthelinealmostsymmetrically,supportingthemodel'sassumptionthatTycheisaddingrandomerrorsthatarenormallydistributedwithmean0.

scatter_fit(baby,'GestationalDays','BirthWeight')

SecondDiagnostic:AResidualPlot¶

ComputationalandInferentialThinking

396RegressionDiagnostics

Wecannotseethetruelinethatisthebasisoftheregressionmodel,buttheregressionlinethatwecalculateactsasasubstituteforit.SoalsotheverticaldistancesofthepointsfromtheregressionlineactassubstitutesfortherandomerrorsthatTycheusestopushthepointsverticallyoffthetrueline.

Aswehaveseenearlier,theverticaldistancesofthepointsfromtheregressionlinearecalledresiduals.Thereisoneresidualforeachpointinthescatterplot.Theresidualisthedifferencebetweentheobservedvalueof andthefittedvalueof :

Forthepoint ,

Thetablebaby_regressioncontainsthedata,thefittedvaluesandtheresiduals:

baby2=baby.select(['GestationalDays','BirthWeight'])

fitted=fit(baby,'GestationalDays','BirthWeight')

residuals=baby.column('BirthWeight')-fitted

baby_regression=baby2.with_columns([

'FittedValue',fitted,

'Residual',residuals

])

baby_regression

GestationalDays BirthWeight FittedValue Residual

284 120 121.748 -1.74801

282 113 120.815 -7.8149

279 128 119.415 8.58477

282 108 120.815 -12.8149

286 136 122.681 13.3189

244 138 103.086 34.9143

245 132 103.552 28.4477

289 120 124.081 -4.0808

299 143 128.746 14.2536

351 140 153.007 -13.0073

...(1164rowsomitted)

ComputationalandInferentialThinking

397RegressionDiagnostics

Onceagain,ithelpstovisualize.Thefigurebelowshowsfouroftheresiduals;foreachofthefourpoints,thelengthoftheredsegmentistheresidualforthatpoint.Thereisonesuchresidualforeachoftheother1170pointsaswell.

points=Table().with_columns([

'GestationalDays',[148,244,338,351],

'BirthWeight',[116,138,102,140]

])

scatter_fit(baby,'GestationalDays','BirthWeight')

foriinrange(4):

f=fitted_value(baby,'GestationalDays','BirthWeight',points

plots.plot([points.column(0)[i]]*2,[f,points.column(1)[i]],color

Accordingtotheregressionmodel,Tychegenerateserrorsbydrawingatrandomwithreplacementfromapopulationoferrorsthatisnormallydistributedwithmean0.Toseewhetherthislooksplausibleforourdata,wewillexaminetheresiduals,whichareoursubstitutesforTyche'serrors.

Aresidualplotcanbedrawnbyplottingtheresidualsagainst .Thefunctionresidual_plotdoesjustthat.Thefunctionregression_diagnostic_plotsdrawstheorigbinalscatterplotaswellastheresidualplotforeaseofcomparison.

ComputationalandInferentialThinking

398RegressionDiagnostics

defresidual_plot(table,x,y):

fitted=fit(table,x,y)

residuals=table[y]-fitted

t=Table().with_columns([

x,table[x],

'residuals',residuals

])

t.scatter(x,'residuals',color='r')

xlims=[min(table[x]),max(table[x])]

plots.plot(xlims,[0,0],color='darkblue',lw=2)

plots.title('ResidualPlot')

defregression_diagnostic_plots(table,x,y):

scatter_fit(table,x,y)

residual_plot(table,x,y)

regression_diagnostic_plots(baby,'GestationalDays','BirthWeight'

ComputationalandInferentialThinking

399RegressionDiagnostics

Theheightofeachredpointintheplotaboveisaresidual.Noticehowtheresidualsaredistributedfairlysymmetricallyaboveandbelowthehorizontallineat0.Noticealsothattheverticalspreadoftheplotisfairlyevenacrossthemostcommonvaluesof ,inaboutthe200-300rangeofgestationaldays.Inotherwords,apartfromafewoutlyingpoints,theplotisn'tnarrowerinsomeplacesandwiderinothers.

Allofthisisconsistentwiththemodel'sassumptionsthattheerrorsaredrawnatrandomwithreplacementfromanormallydistributedpopulationwithmean0.Indeed,ahistogramoftheresidualsiscenteredat0andlooksroughlybellshaped.

baby_regression.select('Residual').hist()

ComputationalandInferentialThinking

400RegressionDiagnostics

Theseobservationsshowusthatthedataarefairlyconsistentwiththeassumptionsoftheregressionmodel.Thereforeitmakessensetodoinferencebasedonthemodel,aswedidintheprevioussection.

UsingtheResidualPlottoDetectNonlinearity¶

Residualplotscanbeusedtodetectwhethertheregressionmodeldoesnothold.Themostseriousdeparturefromthemodelisnon-linearity.

Iftwovariableshaveanon-linearrelation,thenon-linearityissometimesvisibleinthescatterplot.Often,however,itiseasiertospotnon-linearityinaresidualplotthanintheoriginalscatterplot.Thisisusuallybecauseofthescalesofthetwoplots:theresidualplotallowsustozoominontheerrorsandhencemakesiteasiertospotpatterns.

Asanexample,wewillstudyadatasetontheageandlengthofdugongs,whicharemarinemammalsrelatedtomanateesandseacows.Thedataareinatablecalleddugong.Ageismeasuredinyearsandlengthinmeters.Theagesareestimated;mostdugongsdon'tkeeptrackoftheirbirthdays.

dugong=Table.read_table('http://www.statsci.org/data/oz/dugongs.txt'

dugong

ComputationalandInferentialThinking

401RegressionDiagnostics

Age Length

1 1.8

1.5 1.85

1.5 1.87

1.5 1.77

2.5 2.02

4 2.27

5 2.15

5 2.26

7 2.35

8 2.47

...(17rowsomitted)

Hereisaregressionoflength(theresponse)onage(thepredictor).Thecorrelationbetweenthetwovariablesisabout0.83,whichisquitehigh.

correlation(dugong,'Age','Length')

0.82964745549057139

scatter_fit(dugong,'Age','Length')

ComputationalandInferentialThinking

402RegressionDiagnostics

Highcorrelationnotwithstanding,theplotshowsacurvedpatternthatismuchmorevisibleintheresidualplot.

regression_diagnostic_plots(dugong,'Age','Length')

ComputationalandInferentialThinking

403RegressionDiagnostics

Atthelowendofages,theresidualsarealmostallnegative;thentheyarealmostallpositive;thennegativeagainatthehighendofages.SuchadiscerniblepatternwouldhavebeenunlikelyhadTychebeenpickingerrorsatrandomwithreplacementandusingthemtopushpointsoffastraightline.Thereforetheregressionmodeldoesnotholdforthesedata.

Asanotherexample,thedatasetus_womencontainstheaverageweightsofU.S.womenofdifferentheightsinthe1970s.Weightsaremeasuredinpoundsandheightsininches.Thecorrelationisspectacularlyhigh–morethan0.99.

us_women=Table.read_table('us_women.csv')

correlation(us_women,'height','aveweight')

0.99549476778421608

Thepointsseemtohugtheregressionlineprettyclosely,buttheresidualplotflashesawarningsign:itisU-shaped,indicatingthattherelationbetweenthetwovariablesisnotlinear.

regression_diagnostic_plots(us_women,'height','aveweight')

ComputationalandInferentialThinking

404RegressionDiagnostics

Theseexamplesdemonstratethatevenwhencorrelationishigh,fittingastraightlinemightnotbetherightthingtodo.Residualplotshelpusdecidewhetherornottouselinearregression.

UsingtheResidualPlottoDetectHeteroscedasticity¶

HeteroscedasticityisawordthatwillsurelybeofinteresttothosewhoarepreparingforSpellingBees.Fordatascientists,itsinterestliesinitsmeaning,whichis"unevenspread".

RecallthetablehybridthatcontainsdataonhybridcarsintheU.S.Hereisaregressionoffuelefficiencyontherateofacceleration.Theassociationisnegative:carsthataccelearatequicklytendtobelessefficient.

ComputationalandInferentialThinking

405RegressionDiagnostics

regression_diagnostic_plots(hybrid,'acceleration','mpg')

Noticehowtheresidualplotflaresouttowardsthelowendoftheaccelerations.Inotherwords,thevariabilityinthesizeoftheerrorsisgreaterforlowvaluesofaccelerationthanforhighvalues.Suchapatternwouldbeunlikelyundertheregressionmodel,becausethemodelassumesthatTychepickstheerrorsatrandomwithreplacementfromthesamenormallydistributedpopulation.

Unevenverticalvariationinaresidualplotcanindicatethattheregressionmodeldoesnothold.Unevenvariationisoftenmoreeasilynoticedinaresidualplotthanintheoriginalscatterplot.

ComputationalandInferentialThinking

406RegressionDiagnostics

TheAccuracyoftheRegressionEstimate¶

Wewillendourdiscussionofregressionbypointingoutsomeinterestingmathematicalfactsthatleadtoameasureoftheaccuracyofregression.Wewillleavetheproofstoanothercourse,butwewilldemonstratethefactsbyexamples.Notethatallofthefactsholdforallscatterplots,whetherornottheregressionmodelisgood.

Fact1.Nomatterwhattheshapeofthescatterdiagram,theresidualshaveanaverageof0.

Inalltheresidualplotsabove,youhaveseenthehorizontallineat0goingthroughthecenteroftheplot.Thatisavisualizationofthisfact.

Asanumericalexample,hereistheaverageoftheresidualsintheregressionofbirthweightongestationaldaysofbabies:

np.mean(baby_regression.column('Residual'))

-6.3912532279613784e-15

Thatdoesn'tlooklike0,butinfactitisaverysmallnumberthatis0apartfromroundingerror.

Hereistheaverageoftheresidualsintheregressionofthelengthofdugongsontheirage:

dugong_residuals=dugong.column('Length')-fit(dugong,'Age','Length'

np.mean(dugong_residuals)

-2.8783559897689243e-16

Onceagainthemeanoftheresidualsis0apartfromroundingerror.

Thisisanalogoustothefactthatifyoutakeanylistofnumbersandcalculatethelistofdeviationsfromaverage,theaverageofthedeviationsis0.

Fact2.Nomatterwhattheshapeofthescatterdiagram,theSDofthefittedvaluesistimestheSDoftheobservedvaluesof .

Tounderstandthisresult,itisworthnotingthatthefittedvaluesareallontheregressionlinewhereastheobservedvaluesof aretheheightsofallthepointsinthescatterplotandaremorevariable.

ComputationalandInferentialThinking

407RegressionDiagnostics

scatter_fit(baby,'GestationalDays','BirthWeight')

Thefittedvaluesofbirthweightare,therefore,lessvariablethantheobservedvaluesofbirthweight.Buthowmuchlessvariable?

Thescatterplotshowsacorrelationofabout0.4:

correlation(baby,'GestationalDays','BirthWeight')

0.40754279338885108

HereisratiooftheSDofthefittedvaluesandtheSDoftheobservedvaluesofbirthweight:

fitted_birth_weight=baby_regression.column('FittedValue')

observed_birth_weight=baby_regression.column('BirthWeight')

np.std(fitted_birth_weight)/np.std(observed_birth_weight)

0.40754279338885108

Theratiois .ThustheSDofthefittedvaluesis timestheSDoftheobservedvaluesof .

Theexampleoffuelefficiencyandaccelerationisonewherethecorrelationisnegative:

ComputationalandInferentialThinking

408RegressionDiagnostics

correlation(hybrid,'acceleration','mpg')

-0.5060703843771186

fitted_mpg=fit(hybrid,'acceleration','mpg')

observed_mpg=hybrid.column('mpg')

np.std(fitted_mpg)/np.std(observed_mpg)

0.5060703843771186

TheratioofthetwoSDsis .ThustheSDofthefittedvaluesis timestheSDoftheobservedvaluesoffuelefficiency.

Therelationsaysthatthecloserthescatterdiagramistoastraightline,thecloserthefittedvalueswillbetotheobservedvaluesof ,andthereforethecloserthevarianceofthefittedvalueswillbetothevarianceoftheobservedvalues.

Therefore,thecloserthescatterdiagramistoastraightline,thecloser willbeto1,andthecloser willbeto1or-1.

Fact3.Nomatterwhattheshapeofthescatterdiagram,theSDoftheresidualsistimestheSDoftheobservedvaluesof .Inotherwords,therootmeansquared

errorofregressionis timestheSDof .

Thisfactgivesameasureoftheaccuracyoftheregressionestimate.

Again,wewilldemonstratethisbyexample.Inthecaseofgestationaldaysandbirth

weights, isabout0.4and isabout0.91:

r=correlation(baby,'GestationalDays','BirthWeight')

np.sqrt(1-r**2)

0.91318611003278638

ComputationalandInferentialThinking

409RegressionDiagnostics

residuals=baby_regression.column('Residual')

observed_birth_weight=baby_regression.column('BirthWeight')

np.std(residuals)/np.std(observed_birth_weight)

0.9131861100327866

ThustheSDoftheresidualsis timestheSDoftheobservedbirthweights.

Itwillcomeasnosurprisethatthesamerelationistruefortheregressionoffuelefficiencyonacceleration:

r=correlation(hybrid,'acceleration','mpg')

np.sqrt(1-r**2)

0.862492183185677

residuals=hybrid.column('mpg')-fitted_mpg

observed_mpg=hybrid.column('mpg')

np.std(residuals)/np.std(observed_mpg)

0.862492183185677

LetusexaminetheextremecasesofthegeneralfactthattheSDoftheresidualsis

timestheSDoftheobservedvaluesof .Remembertheresidualsalwaysaverageoutto0.

If or ,then .Thereforetheresidualshaveanaverageof0andanSDof0aswell;therefore,theresidualsareallequalto0.Thisisconsistentwiththeobservationthatif ,thescatterplotisaperfectstraightlineandthereisnoerrorintheregressionestimate.

If ,the andtheSDoftheregressionestimateisequaltotheSDof .Thisisconsistentwiththeobservationthatif thentheregressionlineisaflatlineataverageof ,andtherootmeansquareerrorofregressionistherootmeansquareddeviationfromtheaverageof ,whichistheSDof .

ComputationalandInferentialThinking

410RegressionDiagnostics

Foreveryothervalueof ,theroughsizeoftheerroroftheregressionestimateissomewherebetween0andtheSDof .

ComputationalandInferentialThinking

411RegressionDiagnostics

Chapter5

Probability

ComputationalandInferentialThinking

412Probability

ConditionalProbabilityInteractInspiredbyPeterNorvig'sAConcreteIntroductiontoProbability

Probabilityisthestudyoftheobservationsthataregeneratedbysamplingfromaknowndistribution.Thestatisticalmethodswehavestudiedsofarallowustoreasonabouttheworldfromdata.Probabilityallowsustoreasonintheoppositedirections:whatobservationsresultfromknowfactsabouttheworld.

Inpractice,werarelyknowpreciselyhowrandomprocessesintheworldwork.Eventhesimplestrandomprocesssuchasflippingacoinisnotassimpleasonemightassume.Nonetheless,ourabilitytoreasonaboutwhatdatawillresultfromaknownrandomprocessisusefulinmanyaspectsofdatascience.Fundamentally,therulesofprobabilityallowustoreasonabouttheconsequencesofassumptionswemake.

ProbabilityDistributions¶Probabilityusesthefollowingvocabularytodescribetherelationshipbetweenknowndistributionsandtheirobservedoutcomes.

Experiment:Anoccurrencewithanuncertainoutcomethatwecanobserve.Forexample,theresultofrollingtwodiceinsequence.Outcome:Theresultofanexperiment;oneparticularstateoftheworld.Forexample:thefirstdiecomesup4andthesecondcomesup2.Thisoutcomecouldbesummarizedas(4,2).SampleSpace:Thesetofallpossibleoutcomesfortheexperiment.Forexample,(1,1),(1,2),(1,3),...,(1,6),(2,1),(2,2),...(6,4),(6,5),(6,6)areallpossibleoutcomesofrollingtwodiceinsequence.Thereare6*6=36differentoutcomes.Event:Asubsetofpossibleoutcomesthattogetherhavesomepropertyweareinterestedin.Forexample,theevent"thetwodicesumto5"isthesetofoutcomes(1,4),(2,3),(3,2),(4,1).Probability:Theproportionofexperimentsforwhichtheeventoccurs.Forexample,theprobabilitythatthetwodicesumto5is4/36or1/9.Distribution:Theprobabilityofallevents.

ComputationalandInferentialThinking

413ConditionalProbability

Theimportantpartofthisterminologyisthatthesamplespaceisthesetofalloutcomes,andaneventisasubsetofthesamplespace.Foranoutcomethatisanelementofanevent,wesaythatitisanoutcomeforwhichthateventoccurs.

MultinomialDistributions¶

Therearemanykindsofdistributions,butwewillfocusonthecasewherethesamplespaceisafixed,finitesetofmutuallyexclusiveoutcomes,calledamultinomialdistribution.

Forexample,eitherthetwodicecomeup(2,1)ortheycomeup(4,5)inasingleroll(butbothcannotoccur),andthereareexactly36outcomepossibilitiesthateachoccurwithequalchance.

two_dice=Table(['First','Second','Chance'])

forfirstinnp.arange(1,7):

forsecondinnp.arange(1,7):

two_dice.append([first,second,1/36])

two_dice.set_format('Chance',PercentFormatter(1))

First Second Chance

1 1 2.8%

1 2 2.8%

1 3 2.8%

1 4 2.8%

1 5 2.8%

1 6 2.8%

2 1 2.8%

2 2 2.8%

2 3 2.8%

2 4 2.8%

...(26rowsomitted)

Theoutcomesarenotalwaysequallylikely.Intheexampleoftwodicerolledinsequence,thereare36differentoutcomesthateachhavea1/36chanceofoccurring.Iftwodicearerolledtogetherandonlythesumoftheresultisobserved,insteadthereare11outcomesthathavevariousdifferentchances.

ComputationalandInferentialThinking

414ConditionalProbability

two_dice_sums=Table(['Sum','Chance']).with_rows([

[2,1/36],[3,2/36],[4,3/36],[5,4/36],[6,5/36],[

[12,1/36],[11,2/36],[10,3/36],[9,4/36],[8,5/36],

]).sort(0)

two_dice_sums.set_format('Chance',PercentFormatter(1))

Sum Chance

2 2.8%

3 5.6%

4 8.3%

5 11.1%

6 13.9%

7 16.7%

8 13.9%

9 11.1%

10 8.3%

11 5.6%

...(1rowsomitted)

OutcomesandEvents¶

Thesetwodifferentdicedistributionsarecloselyrelatedtoeachother.Let'sseehow.

Inamultinomialdistribution,eachoutcomealoneisaneventwithitsownchance(orprobability).Thechancesofalloutcomessumto1.Theprobabilityofanyeventisthesumoftheprobabilitiesoftheoutcomesinthatevent.Therefore,bywritingdownthechanceofeachoutcome,weimplicitlydescribetheprobabilityofeveryeventforthewholedistribution.

Let'sconsidertheeventthatthesumofthediceintwosequentialrollsis5.Wecanrepresentthateventasthematchingrowsinthetwo_dicetable.

dice_sums=two_dice.column('First')+two_dice.column('Second')

sum_of_5=two_dice.where(dice_sums==5)

sum_of_5

ComputationalandInferentialThinking

415ConditionalProbability

First Second Chance

1 4 2.8%

2 3 2.8%

3 2 2.8%

4 1 2.8%

Theprobabilityoftheeventisthesumoftheresultingchances.

sum(sum_of_5.column('Chance'))

0.1111111111111111

IfweconsistentlystoretheprobabilitiesofindividualoutcomesinacolumncalledChance,thenwecandefinetheprobabilityofanyeventusingthePfunction.

defP(event):

returnsum(event.column('Chance'))

P(sum_of_5)

0.1111111111111111

Onedistribution'seventscanbeanotherdistribution'soutcomes,asisthecasewiththetwo_diceandtwo_dice_sumdistributions.Thereare11differentpossiblesumsinthetwo_dicetable,andtheseare11differentmutuallyexclusiveevents.

with_sums=two_dice.with_column('Sum',dice_sums)

with_sums

ComputationalandInferentialThinking

416ConditionalProbability

First Second Chance Sum

1 1 2.8% 2

1 2 2.8% 3

1 3 2.8% 4

1 4 2.8% 5

1 5 2.8% 6

1 6 2.8% 7

2 1 2.8% 3

2 2 2.8% 4

2 3 2.8% 5

2 4 2.8% 6

...(26rowsomitted)

Groupingtheseoutcomesbytheirsumshowshowmanydifferentoutcomesappearineachevent.

with_sums.group('Sum')

Sum count

2 1

3 2

4 3

5 4

6 5

7 6

8 5

9 4

10 3

11 2

...(1rowsomitted)

Finally,thechanceofeachsumeventisthetotalchanceofalloutcomesthatmatchthesumevent.Inthisway,wecangeneratethetwo_dice_sumsdistributionviaaddition.

ComputationalandInferentialThinking

417ConditionalProbability

grouped=with_sums.select(['Sum','Chance']).group('Sum',sum)

grouped.relabeled(1,'Chance').set_format('Chance',PercentFormatter

Sum Chance

2 2.8%

3 5.6%

4 8.3%

5 11.1%

6 13.9%

7 16.7%

8 13.9%

9 11.1%

10 8.3%

11 5.6%

...(1rowsomitted)

Bothdistributionswillassignthesameprobabilitytocertainevents.Forexample,theprobabilityoftheeventthatthesumofthediceisabove8canbecomputedfromeitherdistribution.

P(with_sums.where('Sum',8))

0.1388888888888889

P(two_dice_sums.where('Sum',8))

0.1388888888888889

U.S.BirthTimes¶

Here'sanexamplebasedondatacollectedintheworld.

ComputationalandInferentialThinking

418ConditionalProbability

MorebabiesintheUnitedStatesareborninthemorningthanthenight.In2013,theCenterforDiseaseControlmeasuredthefollowingproportionsofbabiesforeachhouroftheday.TheTimecolumnisadescriptionofeachhour,andtheHourcolumnisitsnumericequivalent.

birth=Table.read_table('birth_time.csv').select(['Time','Hour',

birth.set_format('Chance',PercentFormatter(1)).show()

ComputationalandInferentialThinking

419ConditionalProbability

Time Hour Chance

6a.m. 6 2.9%

7a.m. 7 4.5%

8a.m. 8 6.3%

9a.m. 9 5.0%

10a.m. 10 5.0%

11a.m. 11 5.0%

Noon 12 6.0%

1p.m. 13 5.7%

2p.m. 14 5.1%

3p.m. 15 4.8%

4p.m. 16 4.9%

5p.m. 17 5.0%

6p.m. 18 4.5%

7p.m. 19 4.0%

8p.m. 20 4.0%

9p.m. 21 3.7%

10p.m. 22 3.5%

11p.m. 23 3.3%

Midnight 0 2.9%

1a.m. 1 2.9%

2a.m. 2 2.8%

3a.m. 3 2.7%

4a.m. 4 2.7%

5a.m. 5 2.8%

Ifweassumethatthisdistributiondescribestheworldcorrectly,whatisthechancethatababywillbebornbetween8a.m.and6p.m.?Itturnsoutthatmorethanhalfofallbabiesarebornduring"businesshours",eventhoughthistimeintervalonlycovers5/12oftheday.

business_hours=birth.where('Hour',are.between(8,18))

business_hours

ComputationalandInferentialThinking

420ConditionalProbability

Time Hour Chance

8a.m. 8 6.3%

9a.m. 9 5.0%

10a.m. 10 5.0%

11a.m. 11 5.0%

Noon 12 6.0%

1p.m. 13 5.7%

2p.m. 14 5.1%

3p.m. 15 4.8%

4p.m. 16 4.9%

5p.m. 17 5.0%

P(business_hours)

0.52800000000000002

Ontheotherhand,birthslateatnightareuncommon.Thechanceofhavingababybetweenmidnightand6a.m.ismuchlessthan25%,eventhoughthistimeintervalcovers1/4oftheday.

P(birth.where('Hour',are.between(0,6)))

0.16799999999999998

ConditionalDistributions¶Acommonwaytotransformonedistributionintoanotheristoconditiononanevent.Conditioningmeansassumingthattheeventoccurs.Theconditionaldistributiongivenaneventtakesthechancesofalloutcomesforwhichtheeventoccursandscalesuptheirchancessothattheysumto1.

Toscaleupthechances,wedividethechanceofeachoutcomeintheeventbytheprobabilityoftheevent.Forexample,ifweconditionontheeventthatthesumoftwodiceisabove8,wearriveatthefollowingconditionaldistribution.

ComputationalandInferentialThinking

421ConditionalProbability

above_8=two_dice_sums.where('Sum',are.above(8))

given_8=above_8.with_column('Chance',above_8.column('Chance')/

given_8

Sum Chance

9 40.0%

10 30.0%

11 20.0%

12 10.0%

Aconditionaldistributiondescribesthechancesundertheoriginaldistribution,updatedtoreflectanewpieceofinformation.Inthiscase,weseethechancesundertheoriginaltwo_dice_sumsdistribution,giventhatthesumisabove8.

Theconditioningoperationcanbeexpressedbythegivenfunction,whichtakesaneventtableandreturnsthecorrespondingconditionaldistributon.

defgiven(event):

returnevent.with_column('Chance',event.column('Chance')/P(event

given(two_dice_sums.where('Sum',are.above(8)))

Sum Chance

9 40.0%

10 30.0%

11 20.0%

12 10.0%

GiventhatababyisbornintheUSduringbusinesshours(between8a.m.and6p.m.),whatarethechancesthatitisbornbeforenoon?Toanswerthisquestion,wefirstformtheconditionaldistributiongivenbusinesshours,thencomputetheprobabilityoftheeventthatthebabyisbornbeforenoon.

given(business_hours)

ComputationalandInferentialThinking

422ConditionalProbability

Time Hour Chance

8a.m. 8 11.9%

9a.m. 9 9.5%

10a.m. 10 9.5%

11a.m. 11 9.5%

Noon 12 11.4%

1p.m. 13 10.8%

2p.m. 14 9.7%

3p.m. 15 9.1%

4p.m. 16 9.3%

5p.m. 17 9.5%

P(given(business_hours).where('Hour',are.below(12)))

0.40340909090909094

Thisresultisnotthesameasthechancethatababyisbornbetween8a.m.andnooningeneral.

morning=birth.where('Hour',are.between(8,12))

P(morning)

0.21300000000000002

Norisitthechancethatababyisbornbeforenooningeneral.

P(birth.where('Hour',are.below(12)))

0.45500000000000007

Instead,itistheprobabilitythatboththeeventandtheconditioningeventaretrue(i.e.,thatthebirthisinthemorning),dividedbytheprobabilityoftheconditioningevent(i.e.,thatthebirthisduringbusinesshours).

ComputationalandInferentialThinking

423ConditionalProbability

P(morning)/P(business_hours)

0.40340909090909094

Themorningeventisajointeventthatcanbedescribedasbothduringbusinesshoursandbeforenoon.Ajointeventisjustanevent,butonethatisdescribedbytheintersectionoftwootherevents.

business_hours.where('Hour',are.below(12))

Time Hour Chance

8a.m. 8 6.3%

9a.m. 9 5.0%

10a.m. 10 5.0%

11a.m. 11 5.0%

Ingeneral,theprobabilityofaneventB(e.g.,beforenoon)conditionedonaneventA(e.g.,duringbusinesshours)istheprobabilityofthejointeventofAandBdividedbytheprobabilityoftheconditioningeventA.

P(business_hours.where('Hour',are.below(12)))/P(business_hours)

0.40340909090909094

Thestandardnotationforthisrelationshipusesaverticalbarforconditioningandacommaforajointevent.

JointDistributions¶Ajointeventisoneinwhichtwodifferenteventsbothoccur.Similarly,ajointoutcomeisanoutcomethatiscomposedoftwodifferentoutcomes.Forexample,theoutcomesofthetwo_dicedistributionareeachjointoutcomesofthefirstandseconddicerolls.

ComputationalandInferentialThinking

424ConditionalProbability

two_dice

First Second Chance

1 1 2.8%

1 2 2.8%

1 3 2.8%

1 4 2.8%

1 5 2.8%

1 6 2.8%

2 1 2.8%

2 2 2.8%

2 3 2.8%

2 4 2.8%

...(26rowsomitted)

Anothercommonwaytotransformdistributionsistoformajointdistributionfromconditionaldistributions.Thedefinitionofaconditionalprobabilityprovidesaformulaforcomputingajointprobability,simplybymultiplyingbothsidesoftheequationaboveby .

Thechanceofeachoutcomeofajointdistributioncanbecomputedfromtheprobabilityofthefirstoutcomeandtheconditionalprobabilityofthesecondoutcomegiventhefirstoutcome.Inthetwo_dicetableabove,thechanceofanyfirstoutcomeis1/6,andthechanceofanysecondoutcomegivenanyfirstoutcomeis1/6,sothechanceofanyjointoutcomeis1/36or2.8%.

U.S.BirthDays¶

TheCDCalsopublishedthatin2013,78.25%ofbabiesbornonweekdays(MondaythroughFriday)and21.75%ofbabieswerebornonweekends(SaturdayandSunday).Notably,thisdistributionisnot5/7(71.4%)to2/7(28.6%),butinsteadfavorsweekdays.

Moreover,theCDCpublishedtheconditionaldistributionsof2013birthtimes,givenaweekdaybirth(green)andgivenaweekendbirth(gray).Theblacklineisthedistributionofallbirths.

ComputationalandInferentialThinking

425ConditionalProbability

Alloftheproportionsusedtogeneratethischartappearinthebirth_daytable.

birth_day=Table.read_table('birth_time.csv').drop('Chance')

birth_day.set_format([2,3],PercentFormatter(1))

Time Hour Weekday Weekend

6a.m. 6 2.7% 3.6%

7a.m. 7 4.7% 3.8%

8a.m. 8 6.7% 4.6%

9a.m. 9 5.1% 5.0%

10a.m. 10 5.0% 5.0%

11a.m. 11 5.0% 4.9%

Noon 12 6.3% 5.0%

1p.m. 13 5.9% 4.7%

2p.m. 14 5.3% 4.6%

3p.m. 15 4.9% 4.6%

...(14rowsomitted)

TheWeekdayandWeekendcolumnsarethechancesoftwodifferentconditionaldistributions.

ComputationalandInferentialThinking

426ConditionalProbability

weekday=birth_day.select(['Hour','Weekday']).relabeled(1,'Chance'

weekend=birth_day.select(['Hour','Weekend']).relabeled(1,'Chance'

Thejointdistributioniscreatedbymultiplyingeachconditionalchancecolumnbythecorrespondingprobabilityoftheconditioningevent.Inthiscase,wemultiplyweekdaychancesbythechancethatthebabywasbornonaweekday(78.25%);likewise,wemultiplyweekendchancesbythechancethatthebabywasbornonaweekend(21.75%).

birth_joint=Table(['Day','Hour','Chance'])

forrowinweekday.rows:

birth_joint.append(['Weekday',row.item('Hour'),row.item('Chance'

forrowinweekend.rows:

birth_joint.append(['Weekend',row.item('Hour'),row.item('Chance'

birth_joint.set_format('Chance',PercentFormatter(1))

Day Hour Chance

Weekday 6 2.1%

Weekday 7 3.7%

Weekday 8 5.2%

Weekday 9 4.0%

Weekday 10 3.9%

Weekday 11 3.9%

Weekday 12 4.9%

Weekday 13 4.6%

Weekday 14 4.1%

Weekday 15 3.8%

...(38rowsomitted)

Thisjointdistributionhas48jointoutcomes,andthetotalchancesumsto1.

P(birth_joint)

1.0

ComputationalandInferentialThinking

427ConditionalProbability

Now,wecancomputetheprobabilityofanyeventthatinvolvesdaysandhours.Forexample,thechanceofaweekdaymorningbirthisquitehigh.

P(birth_joint.where('Day','Weekday').where('Hour',are.between(8,

0.17058499999999999

Wehaveseenthatlessthan22%ofbabiesarebornonweekends.However,ifweknowthatababywasbornbetween5a.m.and6a.m.,thenthechancesthatitisaweekendbabyarequiteabithigherthan22%.

early_morning=birth_joint.where('Hour',5)

early_morning

Day Hour Chance

Weekday 5 2.0%

Weekend 5 0.8%

P(given(early_morning).where('Day','Weekend'))

0.28584466551063248

Bayes'Rule¶Inthefinalexampleabove,webeganwithadistributionoverweekdayandweekendbirths,

,alongwithconditionaldistributionsoverhours, .Wewereabletocomputetheprobabilityofaweekendbirthconditionedonanhour,anoutcomeinthedistribution

.Bayes'ruleistheformulathatexpressestherelationshipbetweenallofthesequantities.

ComputationalandInferentialThinking

428ConditionalProbability

Thenumeratorontheright-handsideisjustthejointprobabilityofAandB.Bayes'rulewritesthisprobabilityinitsexpandedformbecausesooftenwearegiventhesetwocomponentsandmustformthejointdistributionthroughmultiplication.

Thedenominatorontheright-handsideisaneventthatcanbecomputedfromthejointprobability.Forexample,theearly_morningeventintheexampleaboveisaneventjustabouthours,butitincludesalljointoutcomeswherethecorrecthouroccurs.

Eachprobabilityinthisequationhasaname.ThenamesarederivedfromthemosttypicalapplicationofBayes'rule,whichistoupdateone'sbeliefsbasedonnewevidence.

iscalledtheprior;itistheprobabilityofeventAbeforeanyevidenceisobserved.

iscalledthelikelihood;itistheconditionalprobabilityoftheevidenceeventBgiventheeventA.

iscalledtheevidence;itistheprobabilityoftheevidenceeventBforanyoutcome.

iscalledtheposterior;itistheprobabilytofeventAafterevidenceeventBisobserved.

DependingonthejointdistributionofAandB,observingsomeBcanmakeAmoreorlesslikely.

DiagnosticExample¶

Inapopulation,thereisararedisease.Researchershavedevelopedamedicaltestforthedisease.Mostly,thetestcorrectlyidentifieswhetherornotthetestedpersonhasthedisease.Butsometimes,thetestiswrong.Herearetherelevantproportions.

1%ofthepopulationhasthediseaseIfapersonhasthedisease,thetestreturnsthecorrectresultwithchance99%.Ifapersondoesnothavethedisease,thetestreturnsthecorrectresultwithchance99.5%.

Onepersonispickedatrandomfromthepopulation.Giventhatthepersontestspositive,whatisthechancethatthepersonhasthedisease?

Webeginbypartitioningthepopulationintofourcategoriesinthetreediagrambelow.

ComputationalandInferentialThinking

429ConditionalProbability

ByBayes'Rule,thechancethatthepersonhasthediseasegiventhatheorshehastestedpositiveisthechanceofthetop"TestPositive"branchrelativetothetotalchanceofthetwo"TestPositive"branches.Theansweris

#Thepersonispickedatrandomfromthepopulation.

#ByBayes'Rule:

#Chancethatthepersonhasthedisease,giventhattestwas+

(0.01*0.99)/(0.01*0.99+0.99*0.005)

0.6666666666666666

Thisis2/3,anditseemsrathersmall.Thetesthasveryhighaccuracy,99%orhigher.Yetisouranswersayingthatifapatientgetstestedandthetestresultispositive,thereisonlya2/3chancethatheorshehasthedisease?

Tounderstandouranswer,itisimportanttorecallthechancemodel:ourcalculationisvalidforapersonpickedatrandomfromthepopulation.Amongallthepeopleinthepopulation,thepeoplewhotestpositivesplitintotwogroups:thosewhohavethediseaseandtestpositive,andthosewhodon'thavethediseaseandtestpositive.Thelattergroupiscalledthegroupoffalsepositives.Theproportionoftruepositivesistwiceashighasthatof

ComputationalandInferentialThinking

430ConditionalProbability

thefalsepositives– comparedto –andhencethechanceofatruepositivegivenapositivetestresultis .Thechanceisaffectedbothbytheaccuracyofthetestandalsobytheprobabilitythatthesampledpersonhasthedisease.

Thesameresultcanbecomputedusingatable.Below,webeginwiththejointdistributionovertruehealthandtestresults.

rare=Table(['Health','Test','Chance']).with_rows([

['Diseased','Positive',0.01*0.99],

['Diseased','Negative',0.01*0.01],

['NotDiseased','Positive',0.99*0.005],

['NotDiseased','Negative',0.99*0.995]

])

rare

Health Test Chance

Diseased Positive 0.0099

Diseased Negative 0.0001

NotDiseased Positive 0.00495

NotDiseased Negative 0.98505

Thechancethatapersonselectedatrandomisdiseased,giventhattheytestedpositive,iscomputedfromthefollowingexpression.

positive=rare.where('Test','Positive')

P(given(positive).where('Health','Diseased'))

0.66666666666666663

Butapatientwhogoestogettestedforadiseaseisnotwellmodeledasarandommemberofthepopulation.Peoplegettestedbecausetheythinktheymighthavethedisease,orbecausetheirdoctorthinksso.Insuchacase,sayingthattheirchanceofhavingthediseaseis1%isnotjustified;theyarenotpickedatrandomfromthepopulation.

So,whileouransweriscorrectforarandommemberofthepopulation,itdoesnotanswerthequestionforapersonwhohaswalkedintoadoctor'sofficetobetested.

ComputationalandInferentialThinking

431ConditionalProbability

Toanswerthequestionforsuchaperson,wemustfirstaskourselveswhatistheprobabilitythatthepersonhasthedisease.Itisnaturaltothinkthatitislargerthan1%,asthepersonhassomereasontobelievethatheorshemighthavethedisease.Buthowmuchlarger?

Thiscannotbedecidedbasedonrelativefrequencies.Theprobabilitythataparticularindividualhasthediseasehastobebasedonasubjectiveopinion,andisthereforecalledasubjectiveprobability.Someresearchersinsistthatallprobabilitiesmustberelativefrequencies,butsubjectiveprobabilitiesabound.Thechancethatacandidatewinsthenextelection,thechancethatabigearthquakewillhittheBayAreainthenextdecade,thechancethataparticularcountrywinsthenextsoccerWorldCup:noneofthesearebasedonrelativefrequenciesorlongrunfrequencies.Eachonecontainsasubjectiveelement.

Itisfinetoworkwithsubjectiveprobabilitiesaslongasyoukeepinmindthattherewillbeasubjectiveelementinyouranswer.Beawarealsothatdifferentpeoplecanhavedifferentsubjectiveprobabilitiesofthesameevent.Forexample,thepatient'ssubjectiveprobabilitythatheorshehasthediseasecouldbequitedifferentfromthedoctor'ssubjectiveprobabilityofthesameevent.Herewewillworkfromthepatient'spointofview.

Supposethepatientassignedanumbertohis/herdegreeofuncertaintyaboutwhetherhe/shehadthedisease,beforeseeingthetestresult.Thisnumberisthepatient'ssubjectivepriorprobabilityofhavingthedisease.

Ifthatprobabilitywere10%,thentheprobabilitiesontheleftsideofthetreediagramwouldchangeaccordingly,withthe0.1and0.9nowinterpretedassubjectiveprobabilities:

Thechangehasanoticeableeffectontheanswer,asyoucanseebyrunningthecellbelow.

ComputationalandInferentialThinking

432ConditionalProbability

#Subjectivepriorprobabilityof10%thatthepersonhasthedisease

#ByBayes'Rule:

#Chancethatthepersonhasthedisease,giventhattestwas+

(0.1*0.99)/(0.1*0.99+0.9*0.005)

0.9565217391304347

Ifthepatient'spriorprobabilityofhavingthediseaseis10%,thenafterapositivetestresultthepatientmustupdatethatprobabilitytoover95%.Thisupdatedprobabilityiscalledaposteriorprobability.Itiscalculatedafterlearningthetestresult.

Ifthepatient'spriorprobabilityofhavngthediseaseis50%,thentheresultchangesyetagain.

ComputationalandInferentialThinking

433ConditionalProbability

#Subjectivepriorprobabilityof50%thatthepersonhasthedisease

#ByBayes'Rule:

#Chancethatthepersonhasthedisease,giventhattestwas+

(0.5*0.99)/(0.5*0.99+0.5*0.005)

0.9949748743718593

Startingoutwitha50-50subjectivechanceofhavingthedisease,thepatientmustupdatethatchancetoabout99.5%aftergettingapositivetestresult.

ComputationalNote.Inthecalculationabove,thefactorof0.5iscommontoallthetermsandcancelsout.Hencearithmeticallyitisthesameasacalculationwherethepriorprobabilitiesareapparentlymissing:

0.99/(0.99+0.005)

0.9949748743718593

Butinfact,theyarenotmissing.Theyhavejustcanceledout.Whenthepriorprobabilitiesarenotallequal,thentheyareallvisibleinthecalculationaswehaveseenearlier.

Inmachinelearningapplicationssuchasspamdetection,Bayes'Ruleisusedtoupdateprobabilitiesofmessagesbeingspam,basedonnewmessagesbeinglabeledSpamorNotSpam.Youwillneedmoreadvancedmathematicstocarryoutallthecalculations.Butthefundamentalmethodisthesameaswhatyouhaveseeninthissection.

ComputationalandInferentialThinking

434ConditionalProbability

top related