computational and inferential thinking

434
0 1 1.1 1.1.1 1.1.2 1.2 1.2.1 1.3 1.3.1 1.3.2 1.3.3 1.3.4 1.3.5 1.4 1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 1.5 1.5.1 1.5.2 1.6 1.6.1 1.6.2 2 2.1 2.2 3 3.1 Table of Contents Introduction Data Science Introduction Computational Tools Statistical Techniques Why Data Science? Example: Plotting the Classics Causality and Experiments Observation and Visualization: John Snow and the Broad Street Pump Snow’s “Grand Experiment” Establishing Causality Randomization Endnote Expressions Names Example: Growth Rates Call Expressions Data Types Comparisons Sequences Arrays Ranges Tables Functions Functions and Tables Visualization Bar Charts Histograms Sampling Sampling Computational and Inferential Thinking 1

Upload: sergey-makarevich

Post on 09-Jan-2017

502 views

Category:

Data & Analytics


19 download

TRANSCRIPT

Page 1: Computational and inferential thinking

0

1

1.1

1.1.1

1.1.2

1.2

1.2.1

1.3

1.3.1

1.3.2

1.3.3

1.3.4

1.3.5

1.4

1.4.1

1.4.2

1.4.3

1.4.4

1.4.5

1.5

1.5.1

1.5.2

1.6

1.6.1

1.6.2

2

2.1

2.2

3

3.1

TableofContentsIntroduction

DataScience

Introduction

ComputationalTools

StatisticalTechniques

WhyDataScience?

Example:PlottingtheClassics

CausalityandExperiments

ObservationandVisualization:JohnSnowandtheBroadStreetPump

Snow’s“GrandExperiment”

EstablishingCausality

Randomization

Endnote

Expressions

Names

Example:GrowthRates

CallExpressions

DataTypes

Comparisons

Sequences

Arrays

Ranges

Tables

Functions

FunctionsandTables

Visualization

BarCharts

Histograms

Sampling

Sampling

ComputationalandInferentialThinking

1

Page 2: Computational and inferential thinking

3.2

3.3

3.4

3.5

3.6

3.7

4

4.1

4.2

4.3

4.4

4.5

4.6

4.7

4.8

5

5.1

5.2

5.3

5.4

5.5

5.6

5.7

6

6.1

Iteration

Estimation

Center

Spread

TheNormalDistribution

Exploration:Privacy

Prediction

Correlation

Regression

Prediction

Higher-OrderFunctions

Errors

MultipleRegression

Classification

Features

Inference

ConfidenceIntervals

DistanceBetweenDistributions

HypothesisTesting

Permutation

A/BTesting

RegressionInference

RegressionDiagnostics

Probability

ConditionalProbability

ComputationalandInferentialThinking

2

Page 3: Computational and inferential thinking

ComputationalandInferentialThinking

TheFoundationsofDataScienceByAniAdhikariandJohnDeNero

ContributionsbyDavidWagner

ThisisthetextbookfortheFoundationsofDataScienceclassatUCBerkeley.

ViewthistextbookonlineonGitbooks.

ComputationalandInferentialThinking

3Introduction

Page 4: Computational and inferential thinking

WhatisDataScienceDataScienceisaboutdrawingusefulconclusionsfromlargeanddiversedatasetsthroughexploration,prediction,andinference.Explorationinvolvesidentifyingpatternsininformation.Predictioninvolvesusinginformationweknowtomakeinformedguessesaboutvalueswewishweknew.Inferenceinvolvesquantifyingourdegreeofcertainty:willthosepatternswefoundalsoappearinnewobservations?Howaccurateareourpredictions?Ourprimarytoolsforexplorationarevisualizations,forpredictionaremachinelearningandoptimization,andforinferencearestatisticaltestsandmodels.

Statisticsisacentralcomponentofdatasciencebecausestatisticsstudieshowtomakerobustconclusionswithincompleteinformation.Computingisacentralcomponentbecauseprogrammingallowsustoapplyanalysistechniquestothelargeanddiversedatasetsthatariseinreal-worldapplications:notjustnumbers,buttext,images,videos,andsensorreadings.Datascienceisallofthesethings,butitmorethanthesumofitspartsbecauseoftheapplications.Throughunderstandingaparticulardomain,datascientistslearntoaskappropriatequestionsabouttheirdataandcorrectlyinterprettheanswersprovidedbyourinferentialandcomputationaltools.

ComputationalandInferentialThinking

4DataScience

Page 5: Computational and inferential thinking

Chapter1:IntroductionDataaredescriptionsoftheworldaroundus,collectedthroughobservationandstoredoncomputers.Computersenableustoinferpropertiesoftheworldfromthesedescriptions.Datascienceisthedisciplineofdrawingconclusionsfromdatausingcomputation.Therearethreecoreaspectsofeffectivedataanalysis:exploration,prediction,andinference.Thistextdevelopsaconsistentapproachtoallthree,introducingstatisticalideasandfundamentalideasincomputerscienceconcurrently.Wefocusonaminimalsetofcoretechniquesthattheyapplytoavastrangeofreal-worldapplications.Afoundationindatasciencerequiresnotonlyunderstandingstatisticalandcomputationaltechniques,butalsorecognizinghowtheyapplytorealscenarios.

Forwhateveraspectoftheworldwewishtostudy---whetherit'stheEarth'sweather,theworld'smarkets,politicalpolls,orthehumanmind---datawecollecttypicallyofferanincompletedescriptionofthesubjectathand.Thecentralchallengeofdatascienceistomakereliableconclusionsusingthispartialinformation.

Inthisendeavor,wewillcombinetwoessentialtools:computationandrandomization.Forexample,wemaywanttounderstandclimatechangetrendsusingtemperatureobservations.Computerswillallowustouseallavailableinformationtodrawconclusions.Ratherthanfocusingonlyontheaveragetemperatureofaregion,wewillconsiderthewholerangeoftemperaturestogethertoconstructamorenuancedanalysis.Randomnesswillallowustoconsiderthemanydifferentwaysinwhichincompleteinformationmightbecompleted.Ratherthanassumingthattemperaturesvaryinaparticularway,wewilllearntouserandomnessasawaytoimaginemanypossiblescenariosthatareallconsistentwiththedataweobserve.

Applyingthisapproachrequireslearningtoprogramacomputer,andsothistextinterleavesacompleteintroductiontoprogrammingthatassumesnopriorknowledge.Readerswithprogrammingexperiencewillfindthatwecoverseveraltopicsincomputationthatdonotappearinatypicalintroductorycomputersciencecurriculum.Datasciencealsorequirescarefulreasoningaboutquantities,butthistextdoesnotassumeanybackgroundinmathematicsorstatisticsbeyondbasicalgebra.Youwillfindveryfewequationsinthistext.Instead,techniquesaredescribedtoreadersinthesamelanguageinwhichtheyaredescribedtothecomputersthatexecutethem---aprogramminglanguage.

ComputationalandInferentialThinking

5Introduction

Page 6: Computational and inferential thinking

ComputationalToolsThistextusesthePython3programminglanguage,alongwithastandardsetofnumericalanddatavisualizationtoolsthatareusedwidelyincommercialapplications,scientificexperiments,andopen-sourceprojects.Pythonhasrecruitedenthusiastsfrommanyprofessionsthatusedatatodrawconclusions.BylearningthePythonlanguage,youwilljoinamillion-person-strongcommunityofsoftwaredevelopersanddatascientists.

GettingStarted.TheeasiestandrecommendedwaytostartwritingprogramsinPythonistologintothecompanionsiteforthistext,data8.berkeley.edu.Ifyouareenrolledinacourseofferingthatusesthistext,youshouldhavefullaccesstotheprogrammingenvironmenthostedonthatsite.Thefirsttimeyoulogin,youwillfindabrieftutorialdescribinghowtoproceed.

Youarenotatallrestrictedtousingthisweb-basedprogrammingenvironment.APythonprogramcanbeexecutedbyanycomputer,regardlessofitsmanufactureroroperatingsystem,providedthatsupportforthelanguageisinstalled.IfyouwishtoinstalltheversionofPythonanditsaccompanyinglibrariesthatwillmatchthistext,werecommendtheAnacondadistributionthatpackagestogetherthePython3languageinterpreter,IPythonlibraries,andtheJupyternotebookenvironment.

Thistextincludesacompleteintroductiontoallofthesecomputationaltools.Youwilllearntowriteprograms,generateimagesfromdata,andworkwithreal-worlddatasetsthatarepublishedonline.

ComputationalandInferentialThinking

6ComputationalTools

Page 7: Computational and inferential thinking

StatisticalTechniquesThedisciplineofstatisticshaslongaddressedthesamefundamentalchallengeasdatascience:howtodrawconclusionsabouttheworldusingincompleteinformation.Oneofthemostimportantcontributionsofstatisticsisaconsistentandprecisevocabularyfordescribingtherelationshipbetweenobservationsandconclusions.Thistextcontinuesinthesametradition,focusingonasetofcoreinferentialproblemsfromstatistics:testinghypotheses,estimatingconfidence,andpredictingunknownquantities.

Datascienceextendsthefieldofstatisticsbytakingfulladvantageofcomputing,datavisualization,andaccesstoinformation.ThecombinationoffastcomputersandtheInternetgivesanyonetheabilitytoaccessandanalyzevastdatasets:millionsofnewsarticles,fullencyclopedias,databasesforanydomain,andmassiverepositoriesofmusic,photos,andvideo.

Applicationstorealdatasetsmotivatethestatisticaltechniquesthatwedescribethroughoutthetext.Realdataoftendonotfollowregularpatternsormatchstandardequations.Theinterestingvariationinrealdatacanbelostbyfocusingtoomuchattentiononsimplisticsummariessuchasaveragevalues.Computersenableafamilyofmethodsbasedonresamplingthatapplytoawiderangeofdifferentinferenceproblems,takeintoaccountallavailableinformation,andrequirefewassumptionsorconditions.Althoughthesetechniqueshaveoftenbeenreservedforgraduatecoursesinstatistics,theirflexibilityandsimplicityareanaturalfitfordatascienceapplications.

ComputationalandInferentialThinking

7StatisticalTechniques

Page 8: Computational and inferential thinking

WhyDataScience?Mostimportantdecisionsaremadewithonlypartialinformationanduncertainoutcomes.However,thedegreeofuncertaintyformanydecisionscanbereducedsharplybypublicaccesstolargedatasetsandthecomputationaltoolsrequiredtoanalyzethemeffectively.Data-drivendecisionmakinghasalreadytransformedatremendousbreadthofindustries,includingfinance,advertising,manufacturing,andrealestate.Atthesametime,awiderangeofacademicdisciplinesareevolvingrapidlytoincorporatelarge-scaledataanalysisintotheirtheoryandpractice.

Studyingdatascienceenablesindividualstobringthesetechniquestobearontheirwork,theirscientificendeavors,andtheirpersonaldecisions.Criticalthinkinghaslongbeenahallmarkofarigorouseducation,butcritiquesareoftenmosteffectivewhensupportedbydata.Acriticalanalysisofanyaspectoftheworld,mayitbebusinessorsocialscience,involvesinductivereasoning---conclusionscanrarelybeenprovenoutright,onlysupportedbytheavailableevidence.Datascienceprovidesthemeanstomakeprecise,reliable,andquantitativeargumentsaboutanysetofobservations.Withunprecedentedaccesstoinformationandcomputing,criticalthinkingaboutanyaspectoftheworldthatcanbemeasuredwouldbeincompletewithouttheinferentialtechniquesthatarecoretodatascience.

Theworldhastoomanyunansweredquestionsanddifficultchallengestoleavethiscriticalreasoningtoonlyafewspecialists.However,alleducatedadultscanbuildthecapacitytoreasonaboutdata.Thetools,techniques,anddatasetsareallreadilyavailable;thistextaimstomakethemaccessibletoeveryone.

ComputationalandInferentialThinking

8WhyDataScience?

Page 9: Computational and inferential thinking

Example:PlottingtheClassicsInteractInthisexample,wewillexplorestatisticsfortwoclassicnovels:TheAdventuresofHuckleberryFinnbyMarkTwain,andLittleWomenbyLouisaMayAlcott.Thetextofanybookcanbereadbyacomputeratgreatspeed.Bookspublishedbefore1923arecurrentlyinthepublicdomain,meaningthateveryonehastherighttocopyorusethetextinanyway.ProjectGutenbergisawebsitethatpublishespublicdomainbooksonline.UsingPython,wecanloadthetextofthesebooksdirectlyfromtheweb.

ThefeaturesofPythonusedinthisexamplewillbeexplainedindetaillaterinthecourse.Thisexampleismeanttoillustratesomeofthebroadthemesofthistext.Don'tworryifthedetailsoftheprogramdon'tyetmakesense.Instead,focusoninterpretingtheimagesgeneratedbelow.The"Expressions"sectionlaterinthischapterwilldescribemostofthefeaturesofthePythonprogramminglanguageusedbelow.

First,wereadthetextofbothbooksintolistsofchapters,calledhuck_finn_chaptersandlittle_women_chapters.InPython,anamecannotcontainanyspaces,andsowewillofoftenuseanunderscore_tostandinforaspace.The=inthelinesbelowgiveanameonthelefttotheresultofsomecomputationdescribedontheright.AuniformresourcelocatororURLisanaddressontheInternetforsomecontent;inthiscase,thetextofabook.

#Readtwobooks,fast!

huck_finn_url='http://www.gutenberg.org/cache/epub/76/pg76.txt'

huck_finn_text=read_url(huck_finn_url)

huck_finn_chapters=huck_finn_text.split('CHAPTER')[44:]

little_women_url='http://www.gutenberg.org/cache/epub/514/pg514.txt'

little_women_text=read_url(little_women_url)

little_women_chapters=little_women_text.split('CHAPTER')[1:]

Whileacomputercannotunderstandthetextofabook,wecanstilluseittoprovideuswithsomeinsightintothestructureofthetext.Thenamehuck_finn_chaptersiscurrentlyboundtoalistofallthechaptersinthebook.Wecanplacethosechaptersintoatabletoseehoweachbegins.

ComputationalandInferentialThinking

9Example:PlottingtheClassics

Page 10: Computational and inferential thinking

Table([huck_finn_chapters],['Chapters'])

Chapters

I.YOUdon'tknowaboutmewithoutyouhavereadabook...

II.WEwenttiptoeingalongapathamongstthetreesbac...

III.WELL,Igotagoodgoing-overinthemorningfromo...

IV.WELL,threeorfourmonthsrunalong,anditwaswel...

V.Ihadshutthedoorto.ThenIturnedaroundandther...

VI.WELL,prettysoontheoldmanwasupandaroundagai...

VII."GITup!Whatyou'bout?"Iopenedmyeyesandlook...

VIII.THEsunwasupsohighwhenIwakedthatIjudged...

IX.Iwantedtogoandlookataplacerightaboutthem...

X.AFTERbreakfastIwantedtotalkaboutthedeadmana...

...(33rowsomitted)

EachchapterbeginswithachapternumberinRomannumerals,followedbythefirstsentenceofthechapter.ProjectGutenberghasprintedthefirstwordofeachchapterinuppercase.

ThebookdescribesajourneythatHuckandJimtakealongtheMississippiRiver.TomSawyerjoinsthemtowardstheendastheactionheatsup.Wecanquicklyvisualizehowmanytimesthesecharactershaveeachbeenmentionedatanypointinthebook.

counts=Table([np.char.count(huck_finn_chapters,"Jim"),

np.char.count(huck_finn_chapters,"Tom"),

np.char.count(huck_finn_chapters,"Huck")],

["Jim","Tom","Huck"])

counts.cumsum().plot(overlay=True)

ComputationalandInferentialThinking

10Example:PlottingtheClassics

Page 11: Computational and inferential thinking

Intheplotabove,thehorizontalaxisshowschapternumbersandtheverticalaxisshowshowmanytimeseachcharacterhasbeenmentionedsofar.YoucanseethatJimisacentralcharacterbythelargenumberoftimeshisnameappears.NoticehowTomishardlymentionedformuchofthebookuntilafterChapter30.HiscurveandJim'srisesharplyatthatpoint,astheactioninvolvingbothofthemintensifies.AsforHuck,hisnamehardlyappearsatall,becauseheisthenarrator.

LittleWomenisastoryoffoursistersgrowinguptogetherduringthecivilwar.Inthisbook,chapternumbersarespelledoutandchaptertitlesarewritteninallcapitalletters.

Table('Chapters':little_women_chapters)

Chapters

ONEPLAYINGPILGRIMS"Christmaswon'tbeChristmaswitho...

TWOAMERRYCHRISTMASJowasthefirsttowakeinthegr...

THREETHELAURENCEBOY"Jo!Jo!Whereareyou?"criedMe...

FOURBURDENS"Oh,dear,howharditdoesseemtotakeup...

FIVEBEINGNEIGHBORLY"Whatintheworldareyougoingt...

SIXBETHFINDSTHEPALACEBEAUTIFULThebighousedidpr...

SEVENAMY'SVALLEYOFHUMILIATION"Thatboyisaperfect...

EIGHTJOMEETSAPOLLYON"Girls,whereareyougoing?"as...

NINEMEGGOESTOVANITYFAIR"Idothinkitwasthemost...

TENTHEP.C.ANDP.O.Asspringcameon,anewsetofam...

...(37rowsomitted)

ComputationalandInferentialThinking

11Example:PlottingtheClassics

Page 12: Computational and inferential thinking

Wecantrackthementionsofmaincharacterstolearnabouttheplotofthisbookaswell.TheprotagonistJointeractswithhersistersMeg,Beth,andAmyregularly,upuntilChapter27whenshemovestoNewYorkalone.

people=["Meg","Jo","Beth","Amy","Laurie"]

people_counts=Table(pp:np.char.count(little_women_chapters,pp)

people_counts.cumsum().plot(overlay=True)

Laurieisayoungmanwhomarriesoneofthegirlsintheend.Seeifyoucanusetheplotstoguesswhichone.

Inadditiontovisualizingdatatofindpatterns,datascienceisoftenconcernedwithwhethertwosimilarpatternsarereallythesameordifferent.Inthecontextofnovels,theword"character"hasasecondmeaning:aprintedsymbolsuchasaletterornumber.Below,wecomparethecountsofthistypeofcharacterinthefirstchapterofLittleWomen.

fromcollectionsimportCounter

chapter_one=little_women_chapters[0]

counts=Counter(chapter_one.lower())

letters=Table([counts.keys(),counts.values()],['Letters','Chapter1Count'

letters.sort('Letters').show(20)

ComputationalandInferentialThinking

12Example:PlottingtheClassics

Page 13: Computational and inferential thinking

Letters Chapter1Count

4102

! 28

" 190

' 125

, 423

- 21

. 189

; 4

? 16

_ 6

a 1352

b 290

c 366

d 788

e 1981

f 339

g 446

h 1069

i 1071

j 51

...(16rowsomitted)

HowdothesecountscomparewiththecharactercountsinChapter2?Byplottingbothsetsofcountstogether,wecanseeslightvariationsforeverycharacter.Isthatdifferencemeaningful?Howsurecanwebe?Suchquestionswillbeaddressedpreciselyinthistext.

counts=Counter(little_women_chapters[1].lower())

two_letters=Table([counts.keys(),counts.values()],['Letters','Chapter2Count'

compare=letters.join('Letters',two_letters)

compare.barh('Letters',overlay=True)

ComputationalandInferentialThinking

13Example:PlottingtheClassics

Page 14: Computational and inferential thinking

ComputationalandInferentialThinking

14Example:PlottingtheClassics

Page 15: Computational and inferential thinking

Insomecases,therelationshipsbetweenquantitiesallowustomakepredictions.Thistextwillexplorehowtomakeaccuratepredictionsbasedonincompleteinformationanddevelopmethodsforcombiningmultiplesourcesofuncertaininformationtomakedecisions.

Asanexample,letusfirstusethecomputertogetussomeinformationthatwouldbereallytedioustoacquirebyhand:thenumberofcharactersandthenumberofperiodsineachchapterofbothbooks.

chlen_per_hf=Table([[len(s)forsinhuck_finn_chapters],

np.char.count(huck_finn_chapters,'.')],

["HFChapterLength","Numberofperiods"])

chlen_per_lw=Table([[len(s)forsinlittle_women_chapters],

np.char.count(little_women_chapters,'.')],

["LWChapterLength","Numberofperiods"])

HerearethedataforHuckleberryFinn.Eachrowofthetablebelowcorrespondstoonechapterofthenovelanddisplaysthenumberofcharactersaswellasthenumberofperiodsinthechapter.Notsurprisingly,chapterswithfewercharactersalsotendtohavefewerperiods,ingeneral;theshorterthechapter,thefewersentencestheretendtobe,andviceversa.Therelationisnotentirelypredictable,however,assentencesareofvaryinglengthsandcaninvolveotherpunctuationsuchasquestionmarks.

chlen_per_hf

HFChapterLength Numberofperiods

7026 66

11982 117

8529 72

6799 84

8166 91

14550 125

13218 127

22208 249

8081 71

7036 70

...(33rowsomitted)

ComputationalandInferentialThinking

15Example:PlottingtheClassics

Page 16: Computational and inferential thinking

HerearethecorrespondingdataforLittleWomen.

chlen_per_lw

LWChapterLength Numberofperiods

21759 189

22148 188

20558 231

25526 195

23395 255

14622 140

14431 131

22476 214

33767 337

18508 185

...(37rowsomitted)

YoucanseethatthechaptersofLittleWomenareingenerallongerthanthoseofHuckleberryFinn.Letusseeifthesetwosimplevariables–thelengthandnumberofperiodsineachchapter–cantellusanythingmoreaboutthetwoauthors.Onewayforustodothisistoplotbothsetsofdataonthesameaxes.

Intheplotbelow,thereisadotforeachchapterineachbook.BluedotscorrespondtoHuckleberryFinnandyellowdotstoLittleWomen.Thehorizontalaxisrepresentsthenumberofcharactersandtheverticalaxisrepresentsthenumberofperiods.

plots.figure(figsize=(10,10))

plots.scatter([len(s)forsinhuck_finn_chapters],np.char.count(huck_finn_chapters

plots.scatter([len(s)forsinlittle_women_chapters],np.char.count

<matplotlib.collections.PathCollectionat0x109ebee10>

ComputationalandInferentialThinking

16Example:PlottingtheClassics

Page 17: Computational and inferential thinking

TheplotshowsusthatmanybutnotallofthechaptersofLittleWomenarelongerthanthoseofHuckleberryFinn,aswehadobservedbyjustlookingatthenumbers.Butitalsoshowsussomethingmore.Noticehowthebluepointsareroughlyclusteredaroundastraightline,asaretheyellowpoints.Moreover,itlooksasthoughbothcolorsofpointsmightbeclusteredaroundthesamestraightline.

Indeed,itappearsfromlookingattheplotthatonaveragebothbookstendtohavesomewherebetween100and150charactersbetweenperiods.Perhapsthesetwogreat19thcenturynovelsweresignalingsomethingsoveryfamiliarusnow:the140-characterlimitofTwitter.

ComputationalandInferentialThinking

17Example:PlottingtheClassics

Page 18: Computational and inferential thinking

CausalityandExperiments"Theseproblemsare,andwillprobablyeverremain,amongtheinscrutablesecretsofnature.Theybelongtoaclassofquestionsradicallyinaccessibletothehumanintelligence."—TheTimesofLondon,September1849,onhowcholeraiscontractedandspread

Doesthedeathpenaltyhaveadeterrenteffect?Ischocolategoodforyou?Whatcausesbreastcancer?

Allofthesequestionsattempttoassignacausetoaneffect.Acarefulexaminationofdatacanhelpshedlightonquestionslikethese.Inthissectionyouwilllearnsomeofthefundamentalconceptsinvolvedinestablishingcausality.

Observationisakeytogoodscience.Anobservationalstudyisoneinwhichscientistsmakeconclusionsbasedondatathattheyhaveobservedbuthadnohandingenerating.Indatascience,manysuchstudiesinvolveobservationsonagroupofindividuals,afactorofinterestcalledatreatment,andanoutcomemeasuredoneachindividual.

Itiseasiesttothinkoftheindividualsaspeople.Inastudyofwhetherchocolateisgoodforthehealth,theindividualswouldindeedbepeople,thetreatmentwouldbeeatingchocolate,andtheoutcomemightbeameasureofbloodpressure.Butindividualsinobservationalstudiesneednotbepeople.Inastudyofwhetherthedeathpenaltyhasadeterrenteffect,theindividualscouldbethe50statesoftheunion.Astatelawallowingthedeathpenaltywouldbethetreatment,andanoutcomecouldbethestate’smurderrate.

Thefundamentalquestioniswhetherthetreatmenthasaneffectontheoutcome.Anyrelationbetweenthetreatmentandtheoutcomeiscalledanassociation.Ifthetreatmentcausestheoutcometooccur,thentheassociationiscausal.Causalityisattheheartofallthreequestionsposedatthestartofthissection.Forexample,oneofthequestionswaswhetherchocolatedirectlycausesimprovementsinhealth,notjustwhethertherethereisarelationbetweenchocolateandhealth.

Theestablishmentofcausalityoftentakesplaceintwostages.First,anassociationisobserved.Next,amorecarefulanalysisleadstoadecisionaboutcausality.

ComputationalandInferentialThinking

18CausalityandExperiments

Page 19: Computational and inferential thinking

ObservationandVisualization:JohnSnowandtheBroadStreetPumpOneoftheearliestexamplesofastuteobservationeventuallyleadingtotheestablishmentofcausalitydatesbackmorethan150years.Togetyourmindintotherighttimeframe,trytoimagineLondoninthe1850’s.Itwastheworld’swealthiestcitybutmanyofitspeopleweredesperatelypoor.CharlesDickens,thenattheheightofhisfame,waswritingabouttheirplight.Diseasewasrifeinthepoorerpartsofthecity,andcholerawasamongthemostfeared.Itwasnotyetknownthatgermscausedisease;theleadingtheorywasthat“miasmas”werethemainculprit.Miasmasmanifestedthemselvesasbadsmells,andwerethoughttobeinvisiblepoisonousparticlesarisingoutofdecayingmatter.PartsofLondondidsmellverybad,especiallyinhotweather.Toprotectthemselvesagainstinfection,thosewhocouldaffordtoheldsweet-smellingthingstotheirnoses.

Forseveralyears,adoctorbythenameofJohnSnowhadbeenfollowingthedevastatingwavesofcholerathathitEnglandfromtimetotime.Thediseasearrivedsuddenlyandwasalmostimmediatelydeadly:peoplediedwithinadayortwoofcontractingit,hundredscoulddieinaweek,andthetotaldeathtollinasinglewavecouldreachtensofthousands.Snowwasskepticalofthemiasmatheory.Hehadnoticedthatwhileentirehouseholdswerewipedoutbycholera,thepeopleinneighboringhousessometimesremainedcompletelyunaffected.Astheywerebreathingthesameair—andmiasmas—astheirneighbors,therewasnocompellingassociationbetweenbadsmellsandtheincidenceofcholera.

Snowhadalsonoticedthattheonsetofthediseasealmostalwaysinvolvedvomitinganddiarrhea.Hethereforebelievedthatthatinfectionwascarriedbysomethingpeopleateordrank,notbytheairthattheybreathed.Hisprimesuspectwaswatercontaminatedbysewage.

AttheendofAugust1854,cholerastruckintheovercrowdedSohodistrictofLondon.Asthedeathsmounted,Snowrecordedthemdiligently,usingamethodthatwentontobecomestandardinthestudyofhowdiseasesspread:hedrewamap.Onastreetmapofthedistrict,herecordedthelocationofeachdeath.

HereisSnow’soriginalmap.Eachblackbarrepresentsonedeath.Theblackdiscsmarkthelocationsofwaterpumps.Themapdisplaysastrikingrevelation–thedeathsareroughlyclusteredaroundtheBroadStreetpump.

ComputationalandInferentialThinking

19ObservationandVisualization:JohnSnowandtheBroadStreetPump

Page 20: Computational and inferential thinking

Snowstudiedhismapcarefullyandinvestigatedtheapparentanomalies.AllofthemimplicatedtheBroadStreetpump.Forexample:

ThereweredeathsinhousesthatwerenearertheRupertStreetpumpthantheBroadStreetpump.ThoughtheRupertStreetpumpwascloserasthecrowflies,itwaslessconvenienttogettobecauseofdeadendsandthelayoutofthestreets.TheresidentsinthosehousesusedtheBroadStreetpumpinstead.Therewerenodeathsintwoblocksjusteastofthepump.ThatwasthelocationoftheLionBrewery,wheretheworkersdrankwhattheybrewed.Iftheywantedwater,thebreweryhaditsownwell.TherewerescattereddeathsinhousesseveralblocksawayfromtheBroadStreetpump.ThosewerechildrenwhodrankfromtheBroadStreetpumpontheirwaytoschool.Thepump’swaterwasknowntobecoolandrefreshing.

ThefinalpieceofevidenceinsupportofSnow’stheorywasprovidedbytwoisolateddeathsintheleafyandgenteelHampsteadarea,quitefarfromSoho.SnowwaspuzzledbytheseuntilhelearnedthatthedeceasedwereMrs.SusannahEley,whohadoncelivedinBroad

ComputationalandInferentialThinking

20ObservationandVisualization:JohnSnowandtheBroadStreetPump

Page 21: Computational and inferential thinking

Street,andherniece.Mrs.EleyhadwaterfromtheBroadStreetpumpdeliveredtoherinHampsteadeveryday.Shelikeditstaste.

LateritwasdiscoveredthatacesspitthatwasjustafewfeetawayfromthewelloftheBroadStreetpumphadbeenleakingintothewell.Thusthepump’swaterwascontaminatedbysewagefromthehousesofcholeravictims.

SnowusedhismaptoconvincelocalauthoritiestoremovethehandleoftheBroadStreetpump.Thoughthecholeraepidemicwasalreadyonthewanewhenhedidso,itispossiblethatthedisablingofthepumppreventedmanydeathsfromfuturewavesofthedisease.

TheremovaloftheBroadStreetpumphandlehasbecomethestuffoflegend.AttheCentersforDiseaseControl(CDC)inAtlanta,whenscientistslookforsimpleanswerstoquestionsaboutepidemics,theysometimesaskeachother,“Whereisthehandletothispump?”

Snow’smapisoneoftheearliestandmostpowerfulusesofdatavisualization.Diseasemapsofvariouskindsarenowastandardtoolfortrackingepidemics.

TowardsCausality

ThoughthemapgaveSnowastrongindicationthatthecleanlinessofthewatersupplywasthekeytocontrollingcholera,hewasstillalongwayfromaconvincingscientificargumentthatcontaminatedwaterwascausingthespreadofthedisease.Tomakeamorecompellingcase,hehadtousethemethodofcomparison.

Scientistsusecomparisontoidentifyanassociationbetweenatreatmentandanoutcome.Theycomparetheoutcomesofagroupofindividualswhogotthetreatment(thetreatmentgroup)totheoutcomesofagroupwhodidnot(thecontrolgroup).Forexample,researcherstodaymightcomparetheaveragemurderrateinstatesthathavethedeathpenaltywiththeaveragemurderrateinstatesthatdon’t.

Iftheresultsaredifferent,thatisevidenceforanassociation.Todeterminecausation,however,evenmorecareisneeded.

ComputationalandInferentialThinking

21ObservationandVisualization:JohnSnowandtheBroadStreetPump

Page 22: Computational and inferential thinking

Snow’s“GrandExperiment”EncouragedbywhathehadlearnedinSoho,Snowcompletedamorethoroughanalysisofcholeradeaths.Forsometime,hehadbeengatheringdataoncholeradeathsinanareaofLondonthatwasservedbytwowatercompanies.TheLambethwatercompanydrewitswaterupriverfromwheresewagewasdischargedintotheRiverThames.Itswaterwasrelativelyclean.ButtheSouthwarkandVauxhall(S&V)companydrewitswaterbelowthesewagedischarge,andthusitssupplywascontaminated.

SnownoticedthattherewasnosystematicdifferencebetweenthepeoplewhoweresuppliedbyS&VandthosesuppliedbyLambeth.“Eachcompanysuppliesbothrichandpoor,bothlargehousesandsmall;thereisnodifferenceeitherintheconditionoroccupationofthepersonsreceivingthewaterofthedifferentCompanies…thereisnodifferencewhateverinthehousesorthepeoplereceivingthesupplyofthetwoWaterCompanies,orinanyofthephysicalconditionswithwhichtheyaresurrounded…”

Theonlydifferencewasinthewatersupply,“onegroupbeingsuppliedwithwatercontainingthesewageofLondon,andamongstit,whatevermighthavecomefromthecholerapatients,theothergrouphavingwaterquitefreefromimpurity.”

Confidentthathewouldbeabletoarriveataclearconclusion,Snowsummarizedhisdatainthetablebelow.

SupplyArea Numberofhouses

choleradeaths

deathsper10,000houses

S&V 40,046 1,263 315

Lambeth 26,107 98 37

RestofLondon 256,423 1,422 59

ThenumberspointedaccusinglyatS&V.ThedeathratefromcholeraintheS&VhouseswasalmosttentimestherateinthehousessuppliedbyLambeth.

ComputationalandInferentialThinking

22Snow’s“GrandExperiment”

Page 23: Computational and inferential thinking

EstablishingCausalityInthelanguagedevelopedearlierinthesection,youcanthinkofthepeopleintheS&Vhousesasthetreatmentgroup,andthoseintheLambethhousesatthecontrolgroup.AcrucialelementinSnow’sanalysiswasthatthepeopleinthetwogroupswerecomparabletoeachother,apartfromthetreatment.

Inordertoestablishwhetheritwasthewatersupplythatwascausingcholera,Snowhadtocomparetwogroupsthatweresimilartoeachotherinallbutoneaspect–theirwatersupply.Onlythenwouldhebeabletoascribethedifferencesintheiroutcomestothewatersupply.Ifthetwogroupshadbeendifferentinsomeotherwayaswell,itwouldhavebeendifficulttopointthefingeratthewatersupplyasthesourceofthedisease.Forexample,ifthetreatmentgroupconsistedoffactoryworkersandthecontrolgroupdidnot,thendifferencesbetweentheoutcomesinthetwogroupscouldhavebeenduetothewatersupply,ortofactorywork,orboth,ortoanyothercharacteristicthatmadethegroupsdifferentfromeachother.Thefinalpicturewouldhavebeenmuchmorefuzzy.

Snow’sbrilliancelayinidentifyingtwogroupsthatwouldmakehiscomparisonclear.Hehadsetouttoestablishacausalrelationbetweencontaminatedwaterandcholerainfection,andtoagreatextenthesucceeded,eventhoughthemiasmatistsignoredandevenridiculedhim.Ofcourse,Snowdidnotunderstandthedetailedmechanismbywhichhumanscontractcholera.Thatdiscoverywasmadein1883,whentheGermanscientistRobertKochisolatedtheVibriocholerae,thebacteriumthatentersthehumansmallintestineandcausescholera.

InfacttheVibriocholeraehadbeenidentifiedin1854byFilippoPaciniinItaly,justaboutwhenSnowwasanalyzinghisdatainLondon.BecauseofthedominanceofthemiasmatistsinItaly,Pacini’sdiscoverylanguishedunknown.Butbytheendofthe1800’s,themiasmabrigadewasinretreat.SubsequenthistoryhasvindicatedPaciniandJohnSnow.Snow’smethodsledtothedevelopmentofthefieldofepidemiology,whichisthestudyofthespreadofdiseases.

Confounding

Letusnowreturntomoremoderntimes,armedwithanimportantlessonthatwehavelearnedalongtheway:

Inanobservationalstudy,ifthetreatmentandcontrolgroupsdifferinwaysotherthanthetreatment,itisdifficulttomakeconclusionsaboutcausality.

Anunderlyingdifferencebetweenthetwogroups(otherthanthetreatment)iscalledaconfoundingfactor,becauseitmightconfoundyou(thatis,messyouup)whenyoutrytoreachaconclusion.

ComputationalandInferentialThinking

23EstablishingCausality

Page 24: Computational and inferential thinking

Example:Coffeeandlungcancer.Studiesinthe1960’sshowedthatcoffeedrinkershadhigherratesoflungcancerthanthosewhodidnotdrinkcoffee.Becauseofthis,somepeopleidentifiedcoffeeasacauseoflungcancer.Butcoffeedoesnotcauselungcancer.Theanalysiscontainedaconfoundingfactor–smoking.Inthosedays,coffeedrinkerswerealsolikelytohavebeensmokers,andsmokingdoescauselungcancer.Coffeedrinkingwasassociatedwithlungcancer,butitdidnotcausethedisease.

Confoundingfactorsarecommoninobservationalstudies.Goodstudiestakegreatcaretoreduceconfounding.

ComputationalandInferentialThinking

24EstablishingCausality

Page 25: Computational and inferential thinking

RandomizationAnexcellentwaytoavoidconfoundingistoassignindividualstothetreatmentandcontrolgroupsatrandom,andthenadministerthetreatmenttothosewhowereassignedtothetreatmentgroup.Randomizationkeepsthetwogroupssimilarapartfromthetreatment.

Ifyouareabletorandomizeindividualsintothetreatmentandcontrolgroups,youarerunningarandomizedcontrolledexperiment,alsoknownasarandomizedcontrolledtrial(RCT).Sometimes,people’sresponsesinanexperimentareinfluencedbytheirknowingwhichgrouptheyarein.Soyoumightwanttorunablindexperimentinwhichindividualsdonotknowwhethertheyareinthetreatmentgrouporthecontrolgroup.Tomakethiswork,youwillhavetogivethecontrolgroupaplacebo,whichissomethingthatlooksexactlylikethetreatmentbutinfacthasnoeffect.

Randomizedcontrolledexperimentshavelongbeenagoldstandardinthemedicalfield,forexampleinestablishingwhetheranewdrugworks.Theyarealsobecomingmorecommonlyusedinotherfieldssuchaseconomics.

Example:WelfaresubsidiesinMexico.InMexicanvillagesinthe1990’s,childreninpoorfamilieswereoftennotenrolledinschool.Oneofthereasonswasthattheolderchildrencouldgotoworkandthushelpsupportthefamily.SantiagoLevy,aministerinMexicanMinistryofFinance,setouttoinvestigatewhetherwelfareprogramscouldbeusedtoincreaseschoolenrollmentandimprovehealthconditions.HeconductedanRCTonasetofvillages,selectingsomeofthematrandomtoreceiveanewwelfareprogramcalledPROGRESA.Theprogramgavemoneytopoorfamiliesiftheirchildrenwenttoschoolregularlyandthefamilyusedpreventivehealthcare.Moremoneywasgivenifthechildrenwereinsecondaryschoolthaninprimaryschool,tocompensateforthechildren’slostwages,andmoremoneywasgivenforgirlsattendingschoolthanforboys.Theremainingvillagesdidnotgetthistreatment,andformedthecontrolgroup.Becauseoftherandomization,therewerenoconfoundingfactorsanditwaspossibletoestablishthatPROGRESAincreasedschoolenrollment.Forboys,theenrollmentincreasedfrom73%inthecontrolgroupto77%inthePROGRESAgroup.Forgirls,theincreasewasevengreater,from67%inthecontrolgrouptoalmost75%inthePROGRESAgroup.Duetothesuccessofthisexperiment,theMexicangovernmentsupportedtheprogramunderthenewnameOPORTUNIDADES,asaninvestmentinahealthyandwelleducatedpopulation.

Insomesituationsitmightnotbepossibletocarryoutarandomizedcontrolledexperiment,evenwhentheaimistoinvestigatecausality.Forexample,supposeyouwanttostudytheeffectsofalcoholconsumptionduringpregnancy,andyourandomlyassignsomepregnant

ComputationalandInferentialThinking

25Randomization

Page 26: Computational and inferential thinking

womentoyour“alcohol”group.Youshouldnotexpectcooperationfromthemifyoupresentthemwithadrink.Insuchsituationsyouwillalmostinvariablybeconductinganobservationalstudy,notanexperiment.Bealertforconfoundingfactors.

ComputationalandInferentialThinking

26Randomization

Page 27: Computational and inferential thinking

EndnoteIntheterminologyofthatwehavedeveloped,JohnSnowconductedanobservationalstudy,notarandomizedexperiment.Buthecalledhisstudya“grandexperiment”because,ashewrote,“Nofewerthanthreehundredthousandpeople…weredividedintotwogroupswithouttheirchoice,andinmostcases,withouttheirknowledge…”

StudiessuchasSnow’saresometimescalled“naturalexperiments.”However,truerandomizationdoesnotsimplymeanthatthetreatmentandcontrolgroupsareselected“withouttheirchoice.”

Themethodofrandomizationcanbeassimpleastossingacoin.Itmayalsobequiteabitmorecomplex.Buteverymethodofrandomizationconsistsofasequenceofcarefullydefinedstepsthatallowchancestobespecifiedmathematically.Thishastwoimportantconsequences.

1. Itallowsustoaccount–mathematically–forthepossibilitythatrandomizationproducestreatmentandcontrolgroupsthatarequitedifferentfromeachother.

2. Itallowsustomakeprecisemathematicalstatementsaboutdifferencesbetweenthetreatmentandcontrolgroups.Thisinturnhelpsusmakejustifiableconclusionsaboutwhetherthetreatmenthasanyeffect.

Inthiscourse,youwilllearnhowtoconductandanalyzeyourownrandomizedexperiments.Thatwillinvolvemoredetailthanhasbeenpresentedinthissection.Fornow,justfocusonthemainidea:totrytoestablishcausality,runarandomizedcontrolledexperimentifpossible.Ifyouareconductinganobservationalstudy,youmightbeabletoestablishassociationbutnotcausation.Beextremelycarefulaboutconfoundingfactorsbeforemakingconclusionsaboutcausalitybasedonanobservationalstudy.

Terminology

observationalstudytreatmentoutcomeassociationcausalassociationcausalitycomparisontreatmentgroupcontrolgroupepidemiology

ComputationalandInferentialThinking

27Endnote

Page 28: Computational and inferential thinking

confoundingrandomizationrandomizedcontrolledexperimentrandomizedcontrolledtrial(RCT)blindplacebo

Funfacts

1. JohnSnowissometimescalledthefatherofepidemiology,buthewasananesthesiologistbyprofession.OneofhispatientswasQueenVictoria,whowasanearlyrecipientofanestheticsduringchildbirth.

2. FlorenceNightingale,theoriginatorofmodernnursingpracticesandfamousforherworkintheCrimeanWar,wasadie-hardmiasmatist.Shehadnotimefortheoriesaboutcontagionandgerms,andwasnotoneformincingherwords.“Thereisnoendtotheabsurditiesconnectedwiththisdoctrine,”shesaid.“Sufficeittosaythatintheordinarysenseoftheword,thereisnoproofsuchaswouldbeadmittedinanyscientificenquirythatthereisanysuchthingascontagion.”

3. AlaterRCTestablishedthattheconditionsonwhichPROGRESAinsisted–childrengoingtoschool,preventivehealthcare–werenotnecessarytoachieveincreasedenrollment.Justthefinancialboostofthewelfarepaymentswassufficient.

Goodreads

TheStrangeCaseoftheBroadStreetPump:JohnSnowandtheMysteryofCholerabySandraHempel,publishedbyourownUniversityofCaliforniaPress,readslikeawhodunit.Itwasoneofthemainsourcesforthissection'saccountofJohnSnowandhiswork.Awordofwarning:someofthecontentsofthebookarestomach-churning.

PoorEconomics,thebestsellerbyAbhijitV.BanerjeeandEstherDufloofMIT,isanaccessibleandlivelyaccountofwaystofightglobalpoverty.ItincludesnumerousexamplesofRCTs,includingthePROGRESAexampleinthissection.

ComputationalandInferentialThinking

28Endnote

Page 29: Computational and inferential thinking

ExpressionsInteractProgrammingcandramaticallyimproveourabilitytocollectandanalyzeinformationabouttheworld,whichinturncanleadtodiscoveriesthroughthekindofcarefulreasoningdemonstratedintheprevioussection.Indatascience,thepurposeofwritingaprogramistoinstructacomputertocarryoutthestepsofananalysis.Computerscannotstudytheworldontheirown.Peoplemustdescribepreciselywhatstepsthecomputershouldtakeinordertocollectandanalyzedata,andthosestepsareexpressedthroughprograms.

Tolearntoprogram,wewillbeginbylearningsomeofthegrammarrulesofaprogramminglanguage.Programsaremadeupofexpressions,whichdescribetothecomputerhowtocombinepiecesofdata.Forexample,amultiplicationexpressionconsistsofa*symbolbetweentwonumericalexpressions.Expressions,suchas3*4,areevaluatedbythecomputer.Thevalue(theresultofevaluation)ofthelastexpressionineachcell,12inthiscase,isdisplayedbelowthecell.

3*4

12

Therulesofaprogramminglanguagearerigid.InPython,the*symbolcannotappeartwiceinarow.Thecomputerwillnottrytointerpretanexpressionthatdiffersfromitsprescribedexpressionstructures.Instead,itwillshowaSyntaxErrorerror.TheSyntaxofalanguageisitssetofgrammarrules,andaSyntaxErrorindicatesthatsomerulehasbeenviolated.

3**4

File"<ipython-input-4-d90564f70db7>",line1

3**4

^

SyntaxError:invalidsyntax

ComputationalandInferentialThinking

29Expressions

Page 30: Computational and inferential thinking

Smallchangestoanexpressioncanchangeitsmeaningentirely.Below,thespacebetweenthe*'shasbeenremoved.Because**appearsbetweentwonumericalexpressions,theexpressionisawell-formedexponentiationexpression(thefirstnumberraisedtothepowerofthesecond:3times3times3times3).Thesymbols*and**arecalledoperators,andthevaluestheycombinearecalledoperands.

3**4

81

CommonOperators.Datascienceofteninvolvescombiningnumericalvalues,andthesetofoperatorsinaprogramminglanguagearedesignedtosothatexpressionscanbeusedtoexpressanysortofarithmetic.InPython,thefollowingoperatorsareessential.

ExpressionType Operator Example Value

Addition + 2+3 5

Subtraction - 2-3 -1

Multiplication * 2*3 6

Division / 7/3 2.66667

Remainder % 7%3 1

Exponentiation ** 2**0.5 1.41421

Pythonexpressionsobeythesamefamiliarrulesofprecedenceasinalgebra:multiplicationanddivisionoccurbeforeadditionandsubtraction.Parenthesescanbeusedtogrouptogethersmallerexpressionswithinalargerexpression.

1+2*3*4*5/6**3+7+8

16.555555555555557

1+2*((3*4*5/6)**3)+7+8

2016.0

ComputationalandInferentialThinking

30Expressions

Page 31: Computational and inferential thinking

Thischapterintroducesmanytypesofexpressions.Learningtoprograminvolvestryingouteverythingyoulearnincombination,investigatingthebehaviorofthecomputer.Whathappensifyoudividebyzero?Whathappensifyoudividetwiceinarow?Youdon'talwaysneedtoaskanexpert(ortheInternet);manyofthesedetailscanbediscoveredbytryingthemoutyourself.

ComputationalandInferentialThinking

31Expressions

Page 32: Computational and inferential thinking

NamesInteractNamesaregiventovaluesinPythonusinganassignmentexpression.Inanassignment,anameisfollowedby=,whichisfollowedbyanotherexpression.Thevalueoftheexpressiontotherightof=isassignedtothename.Onceanamehasavalueassignedtoit,thevaluewillbesubstitutedforthatnameinfutureexpressions.

a=10

b=20

a+b

30

Apreviouslyassignednamecanbeusedintheexpressiontotherightof=.

quarter=2

whole=4*quarter

whole

8

However,onlythecurrentvalueofanexpressionisassigedtoaname.Ifthatvaluechangeslater,namesthatweredefinedintermsofthatvaluewillnotchangeautomatically.

quarter=4

whole

8

Namesmuststartwithaletter,butcancontainbothlettersandnumbers.Anamecannotcontainaspace;instead,itiscommontouseanunderscorecharacter_toreplaceeachspace.Namesareonlyasusefulasyoumakethem;it'suptotheprogrammertochoose

ComputationalandInferentialThinking

32Names

Page 33: Computational and inferential thinking

namesthatareeasytointerpret.Typically,moremeaningfulnamescanbeinventedthanaandb.Forexample,todescribethesalestaxona$5purchaseinBerkeley,CA,thefollowingnamesclarifythemeaningofthevariousquantitiesinvolved.

purchase_price=5

state_tax_rate=0.075

county_tax_rate=0.02

city_tax_rate=0

sales_tax_rate=state_tax_rate+county_tax_rate+city_tax_rate

sales_tax=purchase_price*sales_tax_rate

sales_tax

0.475

ComputationalandInferentialThinking

33Names

Page 34: Computational and inferential thinking

Example:GrowthRatesInteractTherelationshipbetweentwomeasurementsofthesamequantitytakenatdifferenttimesisoftenexpressedasagrowthrate.Forexample,theUnitedStatesfederalgovernmentemployed2,766,000peoplein2002and2,814,000peoplein2012\footnotehttp://www.bls.gov/opub/mlr/2013/article/industry-employment-and-output-projections-to-2022-1.htm.Tocomputeagrowthrate,wemustfirstdecidewhichvaluetotreatastheinitialamount.Forvaluesovertime,theearliervalueisanaturalchoice.Then,wedividethedifferencebetweenthechangedandinitialamountbytheinitialamount.

initial=2766000

changed=2814000

(changed-initial)/initial

0.01735357917570499

Itismoretypicaltosubtractonefromtheratioofthetwomeasurements,whichyieldsthesamevalue.

(changed/initial)-1

0.017353579175704903

Thisvalueisthegrowthrateover10years.Ausefulpropertyofgrowthratesisthattheydon'tchangeevenifthevaluesareexpressedindifferentunits.So,forexample,wecanexpressthesamerelationshipbetweenthousandsofpeoplein2002and2012.

initial=2766

changed=2814

(changed/initial)-1

0.017353579175704903

ComputationalandInferentialThinking

34Example:GrowthRates

Page 35: Computational and inferential thinking

In10years,thenumberofemployeesoftheUSFederalGovernmenthasincreasedbyonly1.74%.Inthattime,thetotalexpendituresoftheUSFederalGovernmentincreasedfrom$2.37trillionto$3.38trillionin2012.

initial=2.37

changed=3.38

(changed/initial)-1

0.4261603375527425

A42.6%increaseinthefederalbudgetismuchlargerthanthe1.74%increaseinfederalemployees.Infact,thenumberoffederalemployeeshasgrownmuchmoreslowlythanthepopulationoftheUnitedStates,whichincreased9.21%inthesametimeperiodfrom287.6millionpeoplein2002to314.1millionin2012.

initial=287.6

changed=314.1

(changed/initial)-1

0.09214186369958277

Agrowthratecanbenegative,representingadecreaseinsomevalue.Forexample,thenumberofmanufacturingjobsintheUSdecreasedfrom15.3millionin2002to11.9millionin2012,a-22.2%growthrate.

initial=15.3

changed=11.9

(changed/initial)-1

-0.2222222222222222

Anannualgrowthrateisagrowthrateofsomequantityoverasingleyear.Anannualgrowthrateof0.035,accumulatedeachyearfor10years,givesamuchlargerten-yeargrowthrateof0.41(or41%).

ComputationalandInferentialThinking

35Example:GrowthRates

Page 36: Computational and inferential thinking

1.035*1.035*1.035*1.035*1.035*1.035*1.035*1.035*1.035

0.410598760621121

Thissamecomputationcanbeexpressedusingnamesandexponentsforclarity.

annual_growth_rate=0.035

ten_year_growth_rate=(1+annual_growth_rate)**10-1

ten_year_growth_rate

0.410598760621121

Likewise,aten-yeargrowthratecanbeusedtocomputeanequivalentannualgrowthrate.Below,tisthenumberofyearsthathavepassedbetweenmeasurements.Thefollowingcomputestheannualgrowthrateoffederalexpendituresoverthelast10years.

initial=2.37

changed=3.38

t=10

(changed/initial)**(1/t)-1

0.03613617208346853

Thetotalgrowthover10yearsisequivalenttoa3.6%increaseeachyear.

ComputationalandInferentialThinking

36Example:GrowthRates

Page 37: Computational and inferential thinking

CallExpressionsInteractCallexpressionsinvokefunctions,whicharenamedoperations.Thenameofthefunctionappearsfirst,followedbyexpressionsinparentheses.

abs(-12)

12

round(5-1.3)

4

max(2,2+3,4)

5

Inthislastexample,themaxfunctioniscalledonthreearguments:2,5,and4.Thevalueofeachexpressionwithinparenthesesispassedtothefunction,andthefunctionreturnsthefinalvalueofthefullcallexpression.Themaxfunctioncantakeanynumberofargumentsandreturnsthemaximum.

Afewfunctionsareavailablebydefault,suchasabsandround,butmostfunctionsthatarebuiltintothePythonlanguagearestoredinacollectionoffunctionscalledamodule.Animportstatementisusedtoprovideaccesstoamodule,suchasmathoroperator.

importmath

importoperator

math.sqrt(operator.add(4,5))

3.0

ComputationalandInferentialThinking

37CallExpressions

Page 38: Computational and inferential thinking

Anequivalentexpressioncouldbeexpressedusingthe+and**operatorsinstead.

(4+5)**0.5

3.0

Operatorsandcallexpressionscanbeusedtogetherinanexpression.Thepercentdifferencebetweentwovaluesisusedtocomparevaluesforwhichneitheroneisobviouslyinitialorchanged.Forexample,in2014Floridafarmsproduced2.72billioneggswhileIowafarmsproduced16.25billioneggs\footnotehttp://quickstats.nass.usda.gov/.Thepercentdifferenceis100timestheabsolutevalueofthedifferencebetweenthevalues,dividedbytheiraverage.Inthiscase,thedifferenceislargerthantheaverage,andsothepercentdifferenceisgreaterthan100.

florida=2.72

iowa=16.25

100*abs(florida-iowa)/((florida+iowa)/2)

142.6462836056932

Learninghowdifferentfunctionsbehaveisanimportantpartoflearningaprogramminglanguage.AJupyternotebookcanassistinrememberingthenamesandeffectsofdifferentfunctions.Wheneditingacodecell,pressthetabkeyaftertypingthebeginningofanametobringupalistofwaystocompletethatname.Forexample,presstabaftermath.toseeallofthefunctionsavailableinthemathmodule.Typingwillnarrowdownthelistofoptions.Tolearnmoreaboutafunction,placea?afteritsname.Forexample,typingmath.log?willbringupadescriptionofthelogfunctioninthemathmodule.

math.log?

log(x[,base])

Returnthelogarithmofxtothegivenbase.

Ifthebasenotspecified,returnsthenaturallogarithm(basee)ofx.

Thesquarebracketsintheexamplecallindicatethatanargumentisoptional.Thatis,logcanbecalledwitheitheroneortwoarguments.

ComputationalandInferentialThinking

38CallExpressions

Page 39: Computational and inferential thinking

math.log(16,2)

4.0

math.log(16)/math.log(2)

4.0

ComputationalandInferentialThinking

39CallExpressions

Page 40: Computational and inferential thinking

DataTypesInteractPythonincludesseveraltypesofvaluesthatwillallowustocomputeaboutmorethanjustnumbers.

TypeName PythonClass Example1 Example2

Integer int 1 -2

Floating-PointNumber float 1.5 -2.0

Boolean bool True False

String str 'helloworld' "False"

Thefirsttwotypesrepresentnumbers.A"floating-point"numberdescribesitsrepresentationbyacomputer,whichisatopicforanothertext.Practically,thedifferencebetweenanintegerandafloating-pointnumberisthatthelattercanrepresentfractionsinadditiontowholevalues.

Booleanvalues,namedforthelogicianGeorgeBoole,representtruthandtakeonlytwopossiblevalues:TrueandFalse.

Strings¶Stringsarethemostflexibletypeofall.Theycancontainarbitrarytext.Astringcandescribeanumberoratruthvalue,aswellasanywordorsentence.Muchoftheworld'sdataistext,andtextisrepresentedincomputersasstrings.

Themeaningofanexpressiondependsbothuponitsstructureandthetypesofvaluesthatarebeingcombined.So,forinstance,addingtwostringstogetherproducesanotherstring.Thisexpressionisstillanadditionexpression,butitiscombiningadifferenttypeofvalue.

"data"+"science"

'datascience'

Additioniscompletelyliteral;itcombinesthesetwostringstogetherwithoutregardfortheircontents.Itdoesn'taddaspacebecausethesearedifferentwords;that'suptotheprogrammer(you)tospecify.

ComputationalandInferentialThinking

40DataTypes

Page 41: Computational and inferential thinking

"data"+"sc"+"ienc"+"e"

'datascience'

Singleanddoublequotescanbothbeusedtocreatestrings:'hi'and"hi"areidenticalexpressions.Doublequotesareoftenpreferredbecausetheyallowyoutoincludeapostrophesinsideofstrings.

"Thiswon'tworkwithasingle-quotedstring!"

"Thiswon'tworkwithasingle-quotedstring!"

Whynot?Tryitout.

Thestrfunctionreturnsastringrepresentationofanyvalue.Usingthisfunction,stringscanbeconstructedthathaveembeddedvalues.

"That's"+str(1+1)+''+str(True)

"That's2True"

StringMethods¶

Fromanexistingstring,relatedstringscanbeconstructedusingstringmethods,whicharefunctionsthatoperateonstrings.Thesemethodsarecalledbyplacingadotafterthestring,thencallingthefunction.

Forexample,thefollowingmethodgeneratesanuppercasedversionofastring.

"loud".upper()

'LOUD'

Perhapsthemostimportantmethodisreplace,whichreplacesallinstancesofasubstringwithinthestring.Thereplacemethodtakestwoarguments,thetexttobereplacedanditsreplacement.

ComputationalandInferentialThinking

41DataTypes

Page 42: Computational and inferential thinking

'hitchhiker'.replace('hi','ma')

'matchmaker'

Stringmethodscanalsobeinvokedusingvariablenames,aslongasthosenamesareboundtostrings.So,forinstance,thefollowingtwo-stepprocessgeneratestheword"degrade"startingfrom"train"byfirstcreating"ingrain"andthenapplyingasecondreplacement.

s="train"

t=s.replace('t','ing')

u=t.replace('in','de')

u

'degrade'

ComputationalandInferentialThinking

42DataTypes

Page 43: Computational and inferential thinking

ComparisonsInteractBooleanvaluesmostoftenarisefromcomparisonoperators.Pythonincludesavarietyofoperatorsthatcomparevalues.Forexample,3islargerthan1+1.

3>1+1

True

ThevalueTrueindicatesthatthecomparisonisvalid;Pythonhasconfirmedthissimplefactabouttherelationshipbetween3and1+1.Thefullsetofcommoncomparisonoperatorsarelistedbelow.

Comparison Operator Trueexample FalseExample

Lessthan < 2<3 2<2

Greaterthan > 3>2 3>3

Lessthanorequal <= 2<=2 3<=2

Greaterorequal >= 3>=3 2>=3

Equal == 3==3 3==2

Notequal != 3!=2 2!=2

Anexpressioncancontainmultiplecomparisons,andtheyallmustholdinorderforthewholeexpressiontobeTrue.Forexample,wecanexpressthat1+1isbetween1and3usingthefollowingexpression.

1<1+1<3

True

Theaverageoftwonumbersisalwaysbetweenthesmallernumberandthelargernumber.Weexpressthisrelationshipforthenumbersxandybelow.Youcantrydifferentvaluesofxandytoconfirmthisrelationship.

ComputationalandInferentialThinking

43Comparisons

Page 44: Computational and inferential thinking

x=12

y=5

min(x,y)<=(x+y)/2<=max(x,y)

True

Stringscanalsobecompared,andtheirorderisalphabetical.Ashorterstringislessthanalongerstringthatbeginswiththeshorterstring.

"Dog">"Catastrophe">"Cat"

True

ComputationalandInferentialThinking

44Comparisons

Page 45: Computational and inferential thinking

SequencesInteractValuescanbegroupedtogetherintocollections,whichallowsprogrammerstoorganizethosevaluesandrefertoallofthemwithasinglename.Separatingexpressionsbycommaswithinsquarebracketsplacesthevaluesofthoseexpressionsintoalist,whichisabuilt-insequentialcollection.Afteracomma,it'sokaytostartanewline.Below,wecollectfourdifferenttemperaturesintoalistcalledtemps.ThesearetheestimatedaveragedailyhightemperaturesoveralllandonEarth(indegreesCelsius)forthedecadessurrounding1850,1900,1950,and2000,respectively,expressedasdeviationsfromtheaverageabsolutehightemperaturebetween1951and1980,whichwas14.48degrees.\footnotehttp://berkeleyearth.lbl.gov/regions/global-land

baseline_high=14.48

highs=[baseline_high-0.880,baseline_high-0.093,

baseline_high+0.105,baseline_high+0.684]

highs

[13.6,14.387,14.585,15.164]

Listsallowustopassmultiplevaluesintoafunctionusingasinglename.Forinstance,thesumfunctioncomputesthesumofallvaluesinalist,andthelenfunctioncomputesthelengthofthelist.Usingthemtogether,wecancomputetheaverageofthelist.

sum(highs)/len(highs)

14.434000000000001

Tuples.Valuescanalsobegroupedtogetherintotuplesinadditiontolists.Atupleisexpressedbyseparatingelementsusingcommaswithinparentheses:(1,2,3).Inmanycases,listsandtuplesareinterchangeable.Wewilluseatupletoholdtheaveragedailylowtempuraturesforthesamefourtimeranges.Theonlychangeinthecodeistheswitchtoparenthesesinsteadofbrackets.

ComputationalandInferentialThinking

45Sequences

Page 46: Computational and inferential thinking

baseline_low=3.00

lows=(baseline_low-0.872,baseline_low-0.629,

baseline_low-0.126,baseline_low+0.728)

lows

(2.128,2.371,2.874,3.7279999999999998)

Thecompletechartofdailyhighandlowtemperaturesappearsbelow.

MeanofDailyHighTemperature¶

MeanofDailyLowTemperature¶

ComputationalandInferentialThinking

46Sequences

Page 47: Computational and inferential thinking

ComputationalandInferentialThinking

47Sequences

Page 48: Computational and inferential thinking

ArraysInteractManyexperimentsanddatasetsinvolvemultiplevaluesofthesametype.Anarrayisacollectionofvaluesthatallhavethesametype.Thenumpypackage,abbreviatednpinprograms,providesPythonprogrammerswithconvenientandpowerfulfunctionsforcreatingandmanipulatingarrays.

importnumpyasnp

Anarrayiscreatedusingthenp.arrayfunction,whichtakesalistortupleasanargument.Here,wecreatearraysofaveragedailyhighandlowtemperaturesforthedecadessurrounding1850,1900,1950,and2000.

baseline_high=14.48

highs=np.array([baseline_high-0.880,

baseline_high-0.093,

baseline_high+0.105,

baseline_high+0.684])

highs

array([13.6,14.387,14.585,15.164])

baseline_low=3.00

lows=np.array((baseline_low-0.872,baseline_low-0.629,

baseline_low-0.126,baseline_low+0.728))

lows

array([2.128,2.371,2.874,3.728])

Arraysdifferfromlistsandtuplesbecausetheycanbeusedinarithmeticexpressionstocomputeovertheircontents.Whenanarrayiscombinedwithasinglenumber,thatnumberiscombinedwitheachelementofthearray.Therefore,wecanconvertallofthesetemperaturestoFarenheitbywritingthefamiliarconversionformula.

ComputationalandInferentialThinking

48Arrays

Page 49: Computational and inferential thinking

(9/5)*highs+32

array([56.48,57.8966,58.253,59.2952])

Whentwoarraysarecombinedtogetherusinganarithmeticoperator,theirindividualvaluesarecombined.Forexample,wecancomputethedailytemperaturerangesbysubtracting.

highs-lows

array([11.472,12.016,11.711,11.436])

Arraysalsohavemethods,whicharefunctionsthatoperateonthearrayvalues.Themeanofacollectionofnumbersisitsaveragevalue:thesumdividedbythelength.Eachpairofparenthesesintheexamplebelowispartofacallexpression;it'scallingafunctionwithnoargumentstoperformacomputationonthearraycalledtemps.

[highs.size,highs.sum(),highs.mean()]

[4,57.736000000000004,14.434000000000001]

It'scommonthatsmallroundingerrorsappearinarithmeticonacomputer.Forinstance,themean(average)aboveisinfact14.434,butanextra0.0000000000000001hasbeenaddedduringthecomputationoftheaverage.Arithmeticroundingerrorsareanimportanttopicinscientificcomputing,butwon'tbeafocusofthistext.Everyonewhoworkswithdatamustbeawarethatsomeoperationswillbeapproximatedsothatvaluescanbestoredandmanipulatedefficiently.

FunctionsonArrays¶

Inadditiontobasicarithmeticandmethodssuchasminandmax,manipulatingarraysofteninvolvescallingfunctionsthatarepartofthenppackage.Forexample,thedifffunctioncomputesthedifferencebetweeneachadjacentpairofelementsinanarray.Thefirstelementofthediffisthesecondelementminusthefirst.

np.diff(highs)

ComputationalandInferentialThinking

49Arrays

Page 50: Computational and inferential thinking

array([0.787,0.198,0.579])

ThefullNumpyreferenceliststhesefunctionsexhaustively,butonlyasmallsubsetareusedcommonlyfordataprocessingapplications.Thesearegroupedintodifferentpackageswithinnp.LearningthisvocabularyisanimportantpartoflearningthePythonlanguage,soreferbacktothislistoftenasyouworkthroughexamplesandproblems.

Eachofthesefunctionstakesanarrayasanargumentandreturnsasinglevalue.

Function Description

np.prod Multiplyallelementstogether

np.sum Addallelementstogether

np.allTestwhetherallelementsaretruevalues(non-zeronumbersaretrue)

np.anyTestwhetheranyelementsaretruevalues(non-zeronumbersaretrue)

np.count_nonzero Countthenumberofnon-zeroelements

Eachofthesefunctionstakesanarrayasanargumentandreturnsanarrayofvalues.

Function Description

np.diff Differencebetweenadjacentelements

np.round Roundeachnumbertothenearestinteger(wholenumber)

np.cumprod Acumulativeproduct:foreachelement,multiplyallelementssofar

np.cumsum Acumulativesum:foreachelement,addallelementssofar

np.exp Exponentiateeachelement

np.log Takethenaturallogarithmofeachelement

np.sqrt Takethesquarerootofeachelement

np.sort Sorttheelements

Eachofthesefunctionstakesanarrayofstringsandreturnsanarray.

ComputationalandInferentialThinking

50Arrays

Page 51: Computational and inferential thinking

Function Description

np.char.lower Lowercaseeachelement

np.char.upper Uppercaseeachelement

np.char.strip Removespacesatthebeginningorendofeachelement

np.char.isalpha Whethereachelementisonlyletters(nonumbersorsymbols)

np.char.isnumeric Whethereachelementisonlynumeric(noletters)

Eachofthesefunctionstakesbothanarrayofstringsandasearchstring;eachreturnsanarray.

Function Description

np.char.countCountthenumberoftimesasearchstringappearsamongtheelementsofanarray

np.char.findThepositionwithineachelementthatasearchstringisfoundfirst

np.char.rfindThepositionwithineachelementthatasearchstringisfoundlast

np.char.startswith Whethereachelementstartswiththesearchstring

ComputationalandInferentialThinking

51Arrays

Page 52: Computational and inferential thinking

RangesInteractArangeisanarrayofnumbersinincreasingordecreasingorder,eachseparatedbyaregularinterval.Rangesaredefinedusingthenp.arangefunction,whichtakeseitherone,two,orthreearguments.

np.arange(end):Anarraystartingwith0ofincreasingintegersuptoend

np.arange(start,end):Anarrayofincreasingintegersfromstartuptoend

np.arange(start,end,step):Arangewithstepbetweeneachpairofconsecutivevalues

Arangealwaysincludesitsstartvalue,butdoesnotincludeitsendvalue.Thestepcanbeeitherpositiveornegativeandmaybeawholenumberorafraction.

by_four=np.arange(1,100,4)

by_four

array([1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,

69,73,77,81,85,89,93,97])

Rangeshavemanyuses.Forinstance,arangecanbeusedtocomputepartoftheLeibnizformulaforπ,whichistypicallywrittenas

4*sum(1/by_four-1/(by_four+2))

3.1215946525910097

TheBirthdayProblem¶

Therearekstudentsinaclass.Whatisthechancethatatleasttwoofthestudentshavethesamebirthday?

Assumptions

ComputationalandInferentialThinking

52Ranges

Page 53: Computational and inferential thinking

1. Noleapyears;everyyearhas365days2. Birthsaredistributedevenlythroughouttheyear3. Nostudent'sbirthdayisaffectedbyanyother(e.g.twins)

Let'sstartwithaneasycase:kis4.We'llfirstfindthechancethatallfourpeoplehavedifferentbirthdays.

all_different=(364/365)*(363/365)*(362/365)

Thechancethatthereisatleastonepairwiththesamebirthdayisequivalenttothechancethatthebirthdaysarenotalldifferent.Giventhechanceofsomeeventoccurring,thechancethattheeventdoesnotoccurisoneminusthechancethatitdoes.Withonly4people,thechancethatanytwohavethesamebirthdayislessthan2%.

1-all_different

0.016355912466550215

Usingarange,wecanexpressthissamecomputationcompactly.Webeginwiththenumeratorsofeachfactor.

k=4

numerators=np.arange(364,365-k,-1)

numerators

array([364,363,362])

Then,wedivideeachnumeratorby365toformthefactors,multiplythemalltogether,andsubtractfrom1.

1-np.prod(numerators/365)

0.016355912466550215

Ifkis40,thechanceoftwobirthdaysbeingthesameismuchhigher:almost90%!

ComputationalandInferentialThinking

53Ranges

Page 54: Computational and inferential thinking

k=40

numerators=np.arange(364,365-k,-1)

1-np.prod(numerators/365)

0.89123180981794903

Usingranges,wecaninvestigatehowthischancechangesaskincreases.Thenp.cumprodfunctioncomputesthecumulativeproductofanarray.Thatis,itcomputesanewarraythathasthesamelengthasitsinput,buttheithelementistheproductofthefirstiterms.Below,thefourthtermoftheresultis1*2*3*4.

ten=np.arange(1,9)

np.cumprod(ten)

array([1,2,6,24,120,720,5040,40320])

Thefollowingcellcomputesthechanceofmatchingbirthdaysforeveryclasssizefrom2upto365.Scrollingthroughtheresult,youwillseethataskincreases,thechanceofmatchingbirthdaysreaches1longbeforetheendofthearray.Infact,foranyksmallerthan365,thereisachancethatallkstudentsinaclasscanhavedifferentbirthdays.Thechanceissosmall,however,thatthedifferencefrom1isroundedawaybythecomputer.

numerators=np.arange(364,0,-1)

chances=1-np.cumprod(numerators/365)

chances

array([0.00273973,0.00820417,0.01635591,0.02713557,0.04046248,

0.0562357,0.07433529,0.09462383,0.11694818,0.14114138,

0.16702479,0.19441028,0.22310251,0.25290132,0.28360401,

0.31500767,0.34691142,0.37911853,0.41143838,0.44368834,

0.47569531,0.50729723,0.53834426,0.5686997,0.59824082,

0.62685928,0.65446147,0.68096854,0.70631624,0.73045463,

0.75334753,0.77497185,0.79531686,0.81438324,0.83218211,

0.84873401,0.86406782,0.87821966,0.89123181,0.90315161,

0.91403047,0.92392286,0.93288537,0.9409759,0.94825284,

0.9547744,0.96059797,0.96577961,0.97037358,0.97443199,

ComputationalandInferentialThinking

54Ranges

Page 55: Computational and inferential thinking

0.97800451,0.98113811,0.98387696,0.98626229,0.98833235,

0.99012246,0.99166498,0.99298945,0.99412266,0.9950888,

0.99590957,0.99660439,0.99719048,0.99768311,0.9980957,

0.99844004,0.99872639,0.99896367,0.99915958,0.99932075,

0.99945288,0.99956081,0.99964864,0.99971988,0.99977744,

0.99982378,0.99986095,0.99989067,0.99991433,0.99993311,

0.99994795,0.99995965,0.99996882,0.999976,0.99998159,

0.99998593,0.99998928,0.99999186,0.99999385,0.99999537,

0.99999652,0.9999974,0.99999806,0.99999856,0.99999893,

0.99999922,0.99999942,0.99999958,0.99999969,0.99999978,

0.99999984,0.99999988,0.99999992,0.99999994,0.99999996,

0.99999997,0.99999998,0.99999998,0.99999999,0.99999999,

0.99999999,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

ComputationalandInferentialThinking

55Ranges

Page 56: Computational and inferential thinking

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

1.,1.,1.,1.])

Theitemmethodofanarrayallowsustoselectaparticularelementfromthearraybyitsposition.Thestartingpositionofanarrayisitem(0),sofindingthechanceofmatchingbirthdaysina40-personclassinvolvesextractingitem(40-2)(becausethestartingitemisfora2-personclass).

chances.item(40-2)

0.891231809817949

ComputationalandInferentialThinking

56Ranges

Page 57: Computational and inferential thinking

TablesInteractTablesareafundamentalobjecttypeforrepresentingdatasets.Atablecanbeviewedintwoways.Tablesareasequenceofnamedcolumnsthateachdescribeasingleaspectofallentriesinadataset.Tablesarealsoasequenceofrowsthateachcontainallinformationaboutasingleentryinadataset.

Inordertousetables,importallofthemodulecalleddatascience,amodulecreatedforthistext.

fromdatascienceimport*

EmptytablescanbecreatedusingtheTablefunction,whichoptionallytakesalistofcolumnlabels.Thewith_rowandwith_rowsmethodsofatablereturnnewtableswithoneormoreadditionalrows.Forexample,wecouldcreateatableofOddandEvennumbers.

Table(['Odd','Even']).with_row([3,4])

Odd Even

3 4

Table(['Odd','Even']).with_rows([[3,4],[5,6],[7,8]])

Odd Even

3 4

5 6

7 8

Tablescanalsobeextendedwithadditionalcolumns.Thewith_columnandwith_columnsmethodsreturnnewtableswithadditionallabeledcolumns.Below,webegineachexamplewithanemptytablethathasnocolumns.

Table().with_column('Odd',[3,5,7])

ComputationalandInferentialThinking

57Tables

Page 58: Computational and inferential thinking

Odd

3

5

7

Table().with_columns([

'Odd',[3,5,7],

'Even',[4,6,8]

])

Odd Even

3 4

5 6

7 8

Tablesaremoretypicallycreatedfromfilesthatcontaincomma-separatedvalues,calledCSVfiles.Thefilebelowcontains"AnnualEstimatesoftheResidentPopulationbySingleYearofAgeandSexfortheUnitedStates."

census_url='http://www.census.gov/popest/data/national/asrh/2014/files/NC-EST2014-AGESEX-RES.csv'

full_census_table=Table.read_table(census_url)

full_census_table

SEX AGE CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010

0 0 3944153 3944160 3951330

0 1 3978070 3978090 3957888

0 2 4096929 4096939 4090862

0 3 4119040 4119051 4111920

0 4 4063170 4063186 4077552

0 5 4056858 4056872 4064653

0 6 4066381 4066412 4073013

0 7 4030579 4030594 4043047

0 8 4046486 4046497 4025604

0 9 4148353 4148369 4125415

ComputationalandInferentialThinking

58Tables

Page 59: Computational and inferential thinking

...(296rowsomitted)

Adescriptionofthetableappearsonline.TheSEXcolumncontainsnumericcodes:0standsforthetotal,1formale,and2forfemale.TheAGEcolumncontainsages,butthespecialvalue999isasumofthetotalpopulation.TherestofthecolumnscontainestimatesoftheUSpopulation.

Typically,apublictablewillcontainmoreinformationthannecessaryforaparticularinvestigationoranalysis.Inthiscase,letussupposethatweareonlyinterestedinthepopulationchangesfrom2010to2014.Wecanselectonlyasubsetofthecolumnsusingtheselectmethod.

partial_census_table=full_census_table.select(['SEX','AGE','POPESTIMATE2010'

partial_census_table

SEX AGE POPESTIMATE2010 POPESTIMATE2014

0 0 3951330 3948350

0 1 3957888 3962123

0 2 4090862 3957772

0 3 4111920 4005190

0 4 4077552 4003448

0 5 4064653 4004858

0 6 4073013 4134352

0 7 4043047 4154000

0 8 4025604 4119524

0 9 4125415 4106832

...(296rowsomitted)

Therelabeledmethodcreatesanalternativeversionofthetablethatreplacesacolumnlable.Usingthismethod,wecansimplifythelabelsoftheselectedcolumns.

simple=partial_census_table.relabeled('POPESTIMATE2010','2010').

simple

ComputationalandInferentialThinking

59Tables

Page 60: Computational and inferential thinking

SEX AGE 2010 2014

0 0 3951330 3948350

0 1 3957888 3962123

0 2 4090862 3957772

0 3 4111920 4005190

0 4 4077552 4003448

0 5 4064653 4004858

0 6 4073013 4134352

0 7 4043047 4154000

0 8 4025604 4119524

0 9 4125415 4106832

...(296rowsomitted)

Thecolumnmethodreturnsanarrayofthevaluesinaparticularcolumn.Eachcolumnofatableisanarrayofthesamelength,andsocolumnscanbecombinedusingarithmetic.

simple.column('2014')-simple.column('2010')

array([-2980,4235,-133090,...,6717,13410,4662996])

Atablewithanadditionalcolumncanbecreatedfromanexistingtableusingthewith_columnmethod,whichtakesalabelandasequenceofvaluesasarguments.Theremustbeeitherasmanyvaluesasthereareexistingrowsinthetableorasinglevalue(whichfillsthecolumnwiththatvalue).

change=simple.column('2014')-simple.column('2010')

annual_growth_rate=(simple.column('2014')/simple.column('2010'))

census=simple.with_columns(['Change',change,'Growth',annual_growth_rate

census

ComputationalandInferentialThinking

60Tables

Page 61: Computational and inferential thinking

SEX AGE 2010 2014 Change Growth

0 0 3951330 3948350 -2980 -0.000188597

0 1 3957888 3962123 4235 0.000267397

0 2 4090862 3957772 -133090 -0.00823453

0 3 4111920 4005190 -106730 -0.0065532

0 4 4077552 4003448 -74104 -0.00457471

0 5 4064653 4004858 -59795 -0.00369821

0 6 4073013 4134352 61339 0.00374389

0 7 4043047 4154000 110953 0.00679123

0 8 4025604 4119524 93920 0.00578232

0 9 4125415 4106832 -18583 -0.00112804

...(296rowsomitted)

Althoughthecolumnsofthistablearesimplyarraysofnumbers,theformatofthosenumberscanbechangedtoimprovetheinterpretabilityofthetable.Theset_formatmethodtakesFormatterobjects,whichexistfordates(DateFormatter),currencies(CurrencyFormatter),numbers,andpercentages.

census.set_format('Growth',PercentFormatter)

census.set_format(['2010','2014','Change'],NumberFormatter)

SEX AGE 2010 2014 Change Growth

0 0 3,951,330 3,948,350 -2,980 -0.02%

0 1 3,957,888 3,962,123 4,235 0.03%

0 2 4,090,862 3,957,772 -133,090 -0.82%

0 3 4,111,920 4,005,190 -106,730 -0.66%

0 4 4,077,552 4,003,448 -74,104 -0.46%

0 5 4,064,653 4,004,858 -59,795 -0.37%

0 6 4,073,013 4,134,352 61,339 0.37%

0 7 4,043,047 4,154,000 110,953 0.68%

0 8 4,025,604 4,119,524 93,920 0.58%

0 9 4,125,415 4,106,832 -18,583 -0.11%

...(296rowsomitted)

ComputationalandInferentialThinking

61Tables

Page 62: Computational and inferential thinking

Theinformationinatablecanbeaccessedinmanyways.Theattributesdemonstratedbelowcanbeusedforanytable.

census.labels

('SEX','AGE','2010','2014','Change','Growth')

census.num_columns

6

census.num_rows

306

census.row(5)

Row(SEX=0,AGE=5,2010=4064653,2014=4004858,Change=-59795,Growth=-0.0036982077956316806)

census.column(2)

array([3951330,3957888,4090862,...,26074,45058,

157257573])

Anindividualiteminatablecanbeaccessedviaaroworacolumn.

census.row(0).item(2)

3951330

ComputationalandInferentialThinking

62Tables

Page 63: Computational and inferential thinking

census.column(2).item(0)

3951330

Let'stakealookatthegrowthratesofthetotalnumberofmalesandfemalesbyselectingonlytherowsthatsumoverallages.Thissumisexpressedwiththespecialvalue999accordingtothisdatasetdescription.

census.where('AGE',999)

SEX AGE 2010 2014 Change Growth

0 999 309,347,057 318,857,056 9,509,999 0.76%

1 999 152,089,484 156,936,487 4,847,003 0.79%

2 999 157,257,573 161,920,569 4,662,996 0.73%

Whatagesofmalesaredrivingthisrapidgrowth?Wecanfirstfilterthecensustabletokeeponlythemaleentries,thensortbygrowthrateindecreasingorder.

males=census.where('SEX',1)

males.sort('Growth',descending=True)

SEX AGE 2010 2014 Change Growth

1 99 6,104 9,037 2,933 10.31%

1 100 9,351 13,729 4,378 10.08%

1 98 9,504 13,649 4,145 9.47%

1 93 60,182 85,980 25,798 9.33%

1 96 22,022 31,235 9,213 9.13%

1 94 43,828 62,130 18,302 9.12%

1 97 14,775 20,479 5,704 8.50%

1 95 31,736 42,824 11,088 7.78%

1 91 104,291 138,080 33,789 7.27%

1 92 83,462 109,873 26,411 7.12%

...(92rowsomitted)

ComputationalandInferentialThinking

63Tables

Page 64: Computational and inferential thinking

ThefactthattherearemoremenwithAGEof100than99lookssuspicious;shouldn'ttherebefewer?Acarefullookatthedescriptionofthedatasetrevealsthatthe100categoryactuallyincludesallmenwhoare100orolder.Thegrowthratesinmenattheseveryoldagescouldhaveseveralexplanations,suchasalargeinfluxfromanothercountry,butthemostnaturalexplanationisthatpeoplearesimplylivinglongerin2014than2010.

Thewheremethodcanalsotakeanarrayofbooleanvalues,constructedbyapplyingsomecomparisonoperatortoacolumnofthetable.Forexample,wecanfindalloftheagegroupsamongmalesforwhichtheabsoluteChangeissubstantial.Theshowmethoddisplaysallrowswithoutabbreviating.

males.where(males.column('Change')>200000).show()

SEX AGE 2010 2014 Change Growth

1 23 2,151,095 2,399,883 248,788 2.77%

1 24 2,161,380 2,391,398 230,018 2.56%

1 34 1,908,761 2,192,455 283,694 3.52%

1 57 1,910,028 2,110,149 200,121 2.52%

1 59 1,779,504 2,006,900 227,396 3.05%

1 64 1,291,843 1,661,474 369,631 6.49%

1 65 1,272,693 1,607,688 334,995 6.02%

1 66 1,239,805 1,589,127 349,322 6.40%

1 67 1,270,148 1,653,257 383,109 6.81%

1 71 903,263 1,169,356 266,093 6.67%

1 999 152,089,484 156,936,487 4,847,003 0.79%

Byfarthelargestchangesareclumpedtogetherinthe64-67range.In2014,thesepeoplewouldbebornbetween1947and1951,theheightofthepost-WWIIbabyboomintheUnitedStates.

Itispossibletospecifymultipleconditionsusingthefunctionsnp.logical_andandnp.logical_or.Whentwoconditionsarecombinedwithlogical_and,bothmustbetrueforarowtoberetained.Whenconditionsarecombinedwithlogical_or,theneitheroneofthemmustbetrueforarowtoberetained.Herearetwodifferentwaystoselect18and19yearolds.

ComputationalandInferentialThinking

64Tables

Page 65: Computational and inferential thinking

both=census.where(census.column('SEX')!=0)

both.where(np.logical_or(both.column('AGE')==18,

both.column('AGE')==19))

SEX AGE 2010 2014 Change Growth

1 18 2,305,733 2,165,062 -140,671 -1.56%

1 19 2,334,906 2,220,790 -114,116 -1.24%

2 18 2,185,272 2,060,528 -124,744 -1.46%

2 19 2,236,479 2,105,604 -130,875 -1.50%

both.where(np.logical_and(both.column('AGE')>=18,

both.column('AGE')<=19))

SEX AGE 2010 2014 Change Growth

1 18 2,305,733 2,165,062 -140,671 -1.56%

1 19 2,334,906 2,220,790 -114,116 -1.24%

2 18 2,185,272 2,060,528 -124,744 -1.46%

2 19 2,236,479 2,105,604 -130,875 -1.50%

Here,weobservearatherdramaticdecreaseinthenumberof18and19yearoldsintheUnitedStates;thechildrenofthebabyboomgenerationarenowintheir20's,leavingfewerteenagersinAmerica.

AggregationandGrouping.Thisparticulardatasetincludesentriesforsumsacrossallagesandsexes,usingthespecialvalues999and0respectively.However,iftheserowsdidnotexist,wewouldbeabletoaggregatetheremainingentriesinordertocomputethosesameresults.

both.where('AGE',999).select(['SEX','2014'])

SEX 2014

1 156,936,487

2 161,920,569

ComputationalandInferentialThinking

65Tables

Page 66: Computational and inferential thinking

no_sums=both.select(['SEX','AGE','2014']).where(both.column('AGE'

females=no_sums.where('SEX',2).select(['AGE','2014'])

females

AGE 2014

0 1,930,493

1 1,938,870

2 1,935,270

3 1,956,572

4 1,959,950

5 1,961,391

6 2,024,024

7 2,031,760

8 2,014,402

9 2,009,560

...(91rowsomitted)

sum(females['2014'])

161920569

Somecolumnsexpresscategories,suchasthesexoragegroupinthecaseofthecensustable.Aggregationcanalsobeperformedoneveryvalueforacategoryusingthegroupandgroupsmethods.Thegroupmethodtakesasinglecolumn(orcolumnname)asanargumentandcollectsallvaluesassociatedwitheachuniquevalueinthatcolumn.

Thesecondargumenttogroup,thenameofafunction,specifieshowtoaggregatetheresultingvalues.Forexample,thesumfunctioncanbeusedtoaddtogetherthepopulationsforeachage.Inthisresult,theSEXsumcolumnismeaninglessbecausethevaluesweresimplycodesfordifferentcategories.However,the2014sumcolumndoesinfactcontainthetotalnumberacrossallsexesforeachagecategory.

no_sums.select(['AGE','2014']).group('AGE',sum)

ComputationalandInferentialThinking

66Tables

Page 67: Computational and inferential thinking

AGE 2014sum

0 3948350

1 3962123

2 3957772

3 4005190

4 4003448

5 4004858

6 4134352

7 4154000

8 4119524

9 4106832

...(91rowsomitted)

JoiningTables.Therearemanysituationsindataanalysisinwhichtwodifferentrowsneedtobeconsideredtogether.Twotablescanbejoinedintoone,anoperationthatcreatesonelongrowoutoftwomatchingrows.Thesematchingrowscanbefromthesametableordifferenttables.

Forexample,wemightwanttoestimatewhichagecategoriesareexpectedtochangesignificantlyinthefuture.Someonewhois14yearsoldin2014willbe20yearsoldin2020.Therefore,oneestimateofthenumberof20yearoldsin2020isthenumberof14yearoldsin2014.Betweentheagesof1and50,annualmortalityratesareverylow(lessthan0.5%formenandlessthan0.3%forwomen[1]).Immigrationalsoaffectspopulationchanges,butfornowwewillignoreitsinfluence.Let'sconsiderjustfemalesinthisanalysis.

females['AGE+6']=females['AGE']+6

females

ComputationalandInferentialThinking

67Tables

Page 68: Computational and inferential thinking

AGE 2014 AGE+6

0 1,930,493 6

1 1,938,870 7

2 1,935,270 8

3 1,956,572 9

4 1,959,950 10

5 1,961,391 11

6 2,024,024 12

7 2,031,760 13

8 2,014,402 14

9 2,009,560 15

...(91rowsomitted)

Inordertorelatetheagein2014totheagein2020,wewilljointhistablewithitself,matchingvaluesintheAGEcolumnwithvaluesintheAGEin2020column.

joined=females.join('AGE',females,'AGE+6')

joined

AGE 2014 AGE+6 AGE_2 2014_2

6 2024024 12 0 1930493

7 2031760 13 1 1938870

8 2014402 14 2 1935270

9 2009560 15 3 1956572

10 2015380 16 4 1959950

11 2001949 17 5 1961391

12 1993547 18 6 2024024

13 2041159 19 7 2031760

14 2068252 20 8 2014402

15 2035299 21 9 2009560

...(85rowsomitted)

ComputationalandInferentialThinking

68Tables

Page 69: Computational and inferential thinking

Theresultingtablehasfivecolumns.Thefirstthreearethesameasbefore.ThetwonewcolumsarethevaluesforAGEand2014thatappearedinadifferentrow,theoneinwhichthatAGEappearedintheAGE+6column.Forinstance,thefirstrowcontainsthenumberof6yearoldsin2014andanestimateofthenumberof6yearoldsin2020(whowere0in2014).Somerelabelingandselectingmakesthistablemoreclear.

future=joined.select(['AGE','2014','2014_2']).relabeled('2014_2'

future

AGE 2014 2020(est)

6 2024024 1930493

7 2031760 1938870

8 2014402 1935270

9 2009560 1956572

10 2015380 1959950

11 2001949 1961391

12 1993547 2024024

13 2041159 2031760

14 2068252 2014402

15 2035299 2009560

...(85rowsomitted)

AddingaChangecolumnandsortingbythatchangedescribessomeofthemajorchangesinagecategoriesthatwecanexpectintheUnitedStatesforpeopleunder50.Accordingtothissimplisticanalysis,therewillbesubstantiallymorepeopleintheirlate30'sby2020.

predictions=future.where(future['AGE']<50)

predictions['Change']=predictions['2020(est)']-predictions['2014'

predictions.sort('Change',descending=True)

ComputationalandInferentialThinking

69Tables

Page 70: Computational and inferential thinking

AGE 2014 2020(est) Change

40 1940627 2170440 229813

38 1936610 2154232 217622

30 2110672 2301237 190565

37 1995155 2148981 153826

39 1993913 2135416 141503

29 2169563 2298701 129138

35 2046713 2169563 122850

36 2009423 2110672 101249

28 2144666 2244480 99814

41 1977497 2046713 69216

...(34rowsomitted)

ComputationalandInferentialThinking

70Tables

Page 71: Computational and inferential thinking

FunctionsInteractWearebuildingupausefulinventoryoftechniquesforidentifyingpatternsandthemesinadataset.Sortingandgroupingrowsofatablecanfocusourattention.Thenextapproachtoanalysiswewillconsiderinvolvesgroupingrowsofatablebyarbitrarycriteria.Todoso,wewillexploretwocorefeaturesofthePythonprogramminglanguage:functiondefinitionandconditionalstatements.

Wehaveusedfunctionsextensivelyalreadyinthistext,butneverdefinedafunctionofourown.Thepurposeofdefiningafunctionistogiveanametoacomputationalprocessthatmaybeappliedmultipletimes.Therearemanysituationsincomputingthatrequirerepeatedcomputation.Forexample,itisoftenthecasethatwewillwanttoperformthesamemanipulationoneveryvalueinacolumn.

AfunctionisdefinedinPythonusingadefstatement,whichisamulti-linestatementthatbeginswithaheaderlinegivingthenameofthefunctionandnamesfortheargumentsofthefunction.Therestofthedefstatement,calledthebody,mustbeindentedbelowtheheader.

Afunctionexpressesarelationshipbetweenitsinputs(calledarguments)anditsoutputs(calledreturnvalues).Thenumberofargumentsrequiredtocallafunctionisthenumberofnamesthatappearwithinparenthesesinthedefstatementheader.Thevaluesthatarereturneddependonthebody.Wheneverafunctioniscalled,itsbodyisexecuted.Wheneverareturnstatementwithinthebodyisexecuted,thecalltothefunctioncompletesandthevalueofthereturnexpressionisreturnedasthevalueofthefunctioncall.

Thedefinitionofthepercentfunctionbelowmultipliesanumberby100androundstheresulttotwodecimalplaces.

defpercent(x):

returnround(100*x,2)

Theprimarydifferencebetweendefiningapercentfunctionandsimplyevaluatingitsreturnexpression,round(100*x,2),isthatwhenafunctionisdefined,itsreturnexpressionisnotimmediatelyevaluated.Itcannotbe,becausethevalueforxisnotyetdefined.Instead,thereturnexpressionisevaluatedwheneverthispercentfunctioniscalledbyplacingparenthesesafterthenamepercentandplacinganexpressiontocomputeitsargumentinparentheses.

ComputationalandInferentialThinking

71Functions

Page 72: Computational and inferential thinking

percent(1/6)

16.67

percent(1/6000)

0.02

percent(1/60000)

0.0

Thethreeexpressionsaboveareallcallexpressions.Inthefirst,thevalueof1/6iscomputedandthenpassedastheargumentnamedxtothepercentfunction.Whenthepercentfunctioniscalledinthisway,itsbodyisexecuted.Thebodyofpercenthasonlyasingleline:returnround(100*x,2).Executingthisreturnstatementcompletesexecutionofthepercentfunction'sbodyandcomputesthevalueofthecallexpression.

Thesameresultiscomputedbypassinganamedvalueasanargument.Thepercentfunctiondoesnotknoworcarehowitsargumentiscomputedorstored;itsonlyjobistoexecuteitsownbodyusingtheargumentspassedtoit.

sixth=1/6

percent(sixth)

16.67

ConditionalStatements.Thebodyofafunctioncanhavemorethanonelineandmorethanonereturnstatement.Aconditionalstatementisamulti-linestatementthatallowsPythontochooseamongdifferentalternativesbasedonthetruthvalueofanexpression.Whileconditionalstatementscanappearanywhere,theyappearmostoftenwithinthebodyofafunctioninordertoexpressalternativebehaviordependingonargumentvalues.

ComputationalandInferentialThinking

72Functions

Page 73: Computational and inferential thinking

Aconditionalstatementalwaysbeginswithanifheader,whichisasinglelinefollowedbyanindentedbody.Thebodyisonlyexecutediftheexpressiondirectlyfollowingif(calledtheifexpression)evaluatestoatruevalue.Iftheifexpressionevaluatestoafalsevalue,thenthebodyoftheifisskipped.

Forexample,wecanimproveourpercentfunctionsothatitdoesn'troundverysmallnumberstozerosoreadily.Thebehaviorofpercent(1/6)isunchanged,butpercent(1/60000)providesamoreusefulresult.

defpercent(x):

ifx<0.00005:

return100*x

returnround(100*x,2)

percent(1/6)

16.67

percent(1/6000)

0.02

percent(1/60000)

0.0016666666666666668

Aconditionalstatementcanalsohavemultipleclauseswithmultiplebodies,andonlyoneofthosebodiescaneverbeexecuted.Thegeneralformatofamulti-clauseconditionalstatementappearsbelow.

ComputationalandInferentialThinking

73Functions

Page 74: Computational and inferential thinking

if<ifexpression>:

<ifbody>

elif<elifexpression0>:

<elifbody0>

elif<elifexpression1>:

<elifbody1>

...

else:

<elsebody>

Thereisalwaysexactlyoneifclause,buttherecanbeanynumberofelifclauses.Pythonwillevaluatetheifandelifexpressionsintheheadersinorderuntiloneisfoundthatisatruevalue,thenexecutethecorrespondingbody.Theelseclauseisoptional.Whenanelseheaderisprovided,itselsebodyisexecutedonlyifnoneoftheheaderexpressionsofthepreviousclausesaretrue.Theelseclausemustalwayscomeattheend(ornotatall).

Letuscontinuetorefineourpercentfunction.Perhapsforsomeanalysis,anyvaluebelowshouldbeconsideredcloseenoughto0thatitcanbeignored.Thefollowingfunction

definitionhandlesthiscaseaswell.

defpercent(x):

ifx<1e-8:

return0.0

elifx<0.00005:

return100*x

else:

returnround(100*x,2)

percent(1/6)

16.67

percent(1/6000)

0.02

ComputationalandInferentialThinking

74Functions

Page 75: Computational and inferential thinking

percent(1/60000)

0.0016666666666666668

percent(1/60000000000)

0.0

Awell-composedfunctionhasanamethatevokesitsbehavior,aswellasadocstring—adescriptionofitsbehaviorandexpectationsaboutitsarguments.Thedocstringcanalsoshowexamplecallstothefunction,wherethecallisprecededby>>>.

Adocstringcanbeanystringthatimmediatelyfollowstheheaderlineofadefstatement.Docstringsaretypicallydefinedusingtriplequotationmarksatthestartandend,whichallowsthestringtospanmultiplelines.Thefirstlineisconventionallyacompletebutshortdescriptionofthefunction,whilefollowinglinesprovidefurtherguidancetofutureusersofthefunction.

Amorecompletedefinitionofpercentthatincludesadocstringappearsbelow.

ComputationalandInferentialThinking

75Functions

Page 76: Computational and inferential thinking

defpercent(x):

"""Convertxtoapercentagebymultiplyingby100.

Percentagesareconventionallyroundedtotwodecimalplaces,

butprecisionisretainedforanyxabove1e-8thatwould

otherwiseroundto0.

>>>percent(1/6)

16.67

>>>perent(1/6000)

0.02

>>>perent(1/60000)

0.0016666666666666668

>>>percent(1/60000000000)

0.0

"""

ifx<1e-8:

return0.0

elifx<0.00005:

return100*x

else:

returnround(100*x,2)

ComputationalandInferentialThinking

76Functions

Page 77: Computational and inferential thinking

FunctionsandTablesInteractInthissection,wewillcontinuetousetheUSCensusdataset.Wewillfocusonlyonthe2014populationestimate.

census_2014=census.select(['SEX','AGE','2014']).where(census.column

census_2014

SEX AGE 2014

0 0 3,948,350

0 1 3,962,123

0 2 3,957,772

0 3 4,005,190

0 4 4,003,448

0 5 4,004,858

0 6 4,134,352

0 7 4,154,000

0 8 4,119,524

0 9 4,106,832

...(293rowsomitted)

Functionscanbeusedtocomputenewcolumnsinatablebasedonexistingcolumnvalues.Forexample,wecantransformthecodesintheSEXcolumntostringsthatareeasiertointerpret.

defmale_female(code):

ifcode==0:

return'Total'

elifcode==1:

return'Male'

elifcode==2:

return'Female'

ComputationalandInferentialThinking

77FunctionsandTables

Page 78: Computational and inferential thinking

Thisfunctiontakesanindividualcode—0,1,or2—andreturnsastringthatdescribesitsmeaning.

male_female(0)

'Total'

male_female(1)

'Male'

male_female(2)

'Female'

Wecouldalsotransformagesintoagecategories.

defage_group(age):

ifage<2:

return'Baby'

elifage<13:

return'Child'

elifage<20:

return'Teen'

else:

return'Adult'

age_group(15)

'Teen'

Apply.Theapplymethodofatablecallsafunctiononeachelementofacolumn,forminganewarrayofreturnvalues.Toindicatewhichfunctiontocall,justnameit(withoutquotationmarks).Thenameofthecolumnofinputvaluesmuststillappearwithinquotation

ComputationalandInferentialThinking

78FunctionsandTables

Page 79: Computational and inferential thinking

marks.

census_2014.apply(male_female,'SEX')

array(['Total','Total','Total',...,'Female','Female','Female'],

dtype='<U6')

Thisarray,whichhasthesamelengthastheoriginalSEXcolumnofthepopulationtable,canbeusedasthevaluesinanewcolumncalledMale/FemalealongsidetheexistingAGEandPopulationcolumns.

population=Table().with_columns([

'Male/Female',census_2014.apply(male_female,'SEX'),

'AgeGroup',census_2014.apply(age_group,'AGE'),

'Population',census_2014.column('2014')

])

population

Male/Female AgeGroup Population

Total Baby 3948350

Total Baby 3962123

Total Child 3957772

Total Child 4005190

Total Child 4003448

Total Child 4004858

Total Child 4134352

Total Child 4154000

Total Child 4119524

Total Child 4106832

...(293rowsomitted)

Groups.Thegroupmethodwithasingleargumentcountsthenumberofrowsforeachcategoryinacolumn.Theresultcontainsonerowperuniquevalueinthegroupedcolumn.

ComputationalandInferentialThinking

79FunctionsandTables

Page 80: Computational and inferential thinking

population.group('AgeGroup')

AgeGroup count

Adult 243

Baby 6

Child 33

Teen 21

Theoptionalsecondargumentnamesthefunctionthatwillbeusedtoaggregatevaluesinothercolumnsforallofthoserows.Forinstance,sumwillsumupthepopulationsinallrowsthatmatcheachcategory.Thisresultalsocontainsonerowperuniquevalueinthegroupedcolumn,butithasthesamenumberofcolumnsastheoriginaltable.

totals=population.where(0,'Total').select(['AgeGroup','Population'

totals

AgeGroup Population

Baby 3948350

Baby 3962123

Child 3957772

Child 4005190

Child 4003448

Child 4004858

Child 4134352

Child 4154000

Child 4119524

Child 4106832

...(91rowsomitted)

totals.group('AgeGroup',sum)

ComputationalandInferentialThinking

80FunctionsandTables

Page 81: Computational and inferential thinking

AgeGroup Populationsum

Adult 236721454

Baby 7910473

Child 44755656

Teen 29469473

Thegroupsmethodbehavesinthesameway,butacceptsalistofcolumnsasitsfirstargument.Theresultingtablehasonerowforeveryuniquecombinationofvaluesthatappeartogetherinthegroupedcolumns.Again,asingleargument(alist,inthiscase)givesrowcounts.

population.groups(['Male/Female','AgeGroup'])

Male/Female AgeGroup count

Female Adult 81

Female Baby 2

Female Child 11

Female Teen 7

Male Adult 81

Male Baby 2

Male Child 11

Male Teen 7

Total Adult 81

Total Baby 2

...(2rowsomitted)

Asecondargumenttogroupsaggregatesallothercolumnsthatdonotappearinthelistofgroupedcolumns.

population.groups(['Male/Female','AgeGroup'],sum)

ComputationalandInferentialThinking

81FunctionsandTables

Page 82: Computational and inferential thinking

Male/Female AgeGroup Populationsum

Female Adult 121754366

Female Baby 3869363

Female Child 21903805

Female Teen 14393035

Male Adult 114967088

Male Baby 4041110

Male Child 22851851

Male Teen 15076438

Total Adult 236721454

Total Baby 7910473

...(2rowsomitted)

Pivot.Thepivotmethodiscloselyrelatedtothegroupsmethod:itgroupstogetherrowsthatshareacombinationofvalues.Itdiffersfromgroupsbecauseitorganizestheresultingvaluesinagrid.Thefirstargumenttopivotisacolumnthatcontainsthevaluesthatwillbeusedtoformnewcolumnsintheresult.Thesecondargumentisacolumnusedforgroupingrows.Theresultgivesthecountofallrowsthatsharethecombinationofcolumnandrowvalues.

population.pivot('Male/Female','AgeGroup')

AgeGroup Female Male Total

Adult 81 81 81

Baby 2 2 2

Child 11 11 11

Teen 7 7 7

Anoptionalthirdargumentindicatesacolumnofvaluesthatwillreplacethecountsineachcellofthegrid.Thefourthargumentindicateshowtoaggregateallofthevaluesthatmatchthecombinationofcolumnandrowvalues.

pivoted=population.pivot('Male/Female','AgeGroup','Population'

pivoted

ComputationalandInferentialThinking

82FunctionsandTables

Page 83: Computational and inferential thinking

AgeGroup Female Male Total

Adult 121754366 114967088 236721454

Baby 3869363 4041110 7910473

Child 21903805 22851851 44755656

Teen 14393035 15076438 29469473

Theadvantageofpivotisthatitplacesgroupedvaluesintoadjacentcolumns,sothattheycanbecombined.Forinstance,thispivotedtableallowsustocomputetheproportionofeachagegroupthatismale.Wefindthesurprisingresultthatyoungeragegroupsarepredominantlymale,butamongadultstherearesubstantiallymorefemales.

pivoted.with_column('MalePercentage',pivoted.column('Male')/pivoted

AgeGroup Female Male Total MalePercentage

Adult 121754366 114967088 236721454 48.57%

Baby 3869363 4041110 7910473 51.09%

Child 21903805 22851851 44755656 51.06%

Teen 14393035 15076438 29469473 51.16%

ComputationalandInferentialThinking

83FunctionsandTables

Page 84: Computational and inferential thinking

Chapter2

VisualizationsWhilewehavebeenabletomakeseveralinterestingobservationsaboutdatabysimplyrunningoureyesdownthenumbersinatable,ourtaskbecomesmuchharderastablesbeenlarger.Indatascience,apicturecanbeworthathousandnumbers.

Wesawanexampleofthisearlierinthetext,whenweexaminedJohnSnow'smapofcholeradeathsinLondonin1854.

Snowshowedeachdeathasablackmarkatthelocationwherethedeathoccurred.Indoingso,heplottedthreevariables—thenumberofdeaths,aswellastwocoordinatesforeachlocation—inasinglegraphwithoutanyornamentsorflourishes.Thesimplicityofhis

ComputationalandInferentialThinking

84Visualization

Page 85: Computational and inferential thinking

presentationfocusestheviewer'sattentiononhismainpoint,whichwasthatthedeathswerecenteredaroundtheBroadStreetpump.

In1869,aFrenchcivilengineernamedCharlesJosephMinardcreatedwhatisstillconsideredoneofthegreatestgraphofalltime.ItshowsthedecimationofNapoleon'sarmyduringitsretreatfromMoscow.In1812,NapoleonhadsetouttoconquerRussia,withover400,000meninhisarmy.TheydidreachMoscow,butwereplaguedbylossesalongtheway;theRussianarmykeptretreatingfartherandfartherintoRussia,deliberatelyburningfieldsanddestroyingvillagesasitretreated.ThislefttheFrencharmywithoutfoodorshelterasthebrutalRussianwinterbegantosetin.TheFrencharmyturnedbackwithoutadecisivevictoryinMoscow.Theweathergotcolder,andmoremendied.Only10,000returned.

ThegraphisdrawnoveramapofeasternEurope.ItstartsatthePolish-Russianborderattheleftend.ThelightbrownbandrepresentsNapoleon'sarmymarchingtowardsMoscow,andtheblackbandrepresentsthearmyreturning.Ateachpointofthegraph,thewidthofthebandisproportionaltothenumberofsoldiersinthearmy.Atthebottomofthegraph,Minardincludesthetemperaturesonthereturnjourney.

Noticehownarrowtheblackbandbecomesasthearmyheadsback.ThecrossingoftheBerezinariverwasparticularlydevastating;canyouspotitonthegraph?

Thegraphisremarkableforitssimplicityandpower.Inasinglegraph,Minardshowssixvariables:

thenumberofsoldiersthedirectionofthemarchthetwocoordinatesoflocationthetemperatureonthereturnjourneythelocationonspecificdatesinNovemberandDecember

ComputationalandInferentialThinking

85Visualization

Page 86: Computational and inferential thinking

EdwardTufte,ProfessoratYaleandoneoftheworld'sexpertsonvisualizingquantitativeinformation,saysthatMinard'sgraphis"probablythebeststatisticalgraphiceverdrawn."

Thetechnologyofourtimesallowstoincludeanimationandcolor.Usedjudiciously,withoutexcess,thesecanbeextremelyinformative,asinthisanimationbyGapminder.orgofannualcarbondioxideemissionsintheworldoverthepasttwocenturies.

Acaptionsays,"ClickPlaytoseehowUSAbecomesthelargestemitterofCO2from1900onwards."ThegraphoftotalemissionsshowsthatChinahashigheremissionsthantheUnitedStates.However,inthegraphofpercapitaemissions,theUSishigherthanChina,becauseChina'spopulationismuchlargerthanthatoftheUnitedStates.

Technologycanbeahelpaswellasahindrance.Sometimes,theabilitytocreateafancypictureleadstoalackofclarityinwhatisbeingdisplayed.Inaccuraterepresentationofnumericalinformation,inparticular,canleadtomisleadingmessages.

Here,fromtheWashingtonPostintheearly1980s,isagraphthatattemptstocomparetheearningsofdoctorswiththeearningsofotherprofessionalsoverafewdecades.Dowereallyneedtoseetwoheads(onewithastethoscope)oneachbar?Tuftscoinedtheterm"chartjunk"forsuchunnecessaryembellishments.Healsodeploresthe"lowdata-to-inkratio"whichthisgraphunfortunatelypossesses.

ComputationalandInferentialThinking

86Visualization

Page 87: Computational and inferential thinking

Mostimportantly,thehorizontalaxisofthegraphisisnotdrawntoscale.Thishasasignificanteffectontheshapeofthebargraphs.Whendrawntoscaleandshornofdecoration,thegraphsrevealtrendsthatarequitedifferentfromtheapparentlylineargrowthintheoriginal.TheelegantgraphbelowisduetoRossIhaka,oneoftheoriginatorsofthestatisticalsystemR.

ComputationalandInferentialThinking

87Visualization

Page 88: Computational and inferential thinking

ComputationalandInferentialThinking

88Visualization

Page 89: Computational and inferential thinking

BarChartsInteract

CategoricalDataandBarCharts¶Nowthatwehaveexaminedseveralgraphicsproducedbyothers,itistimetoproducesomeofourown.Wewillstartwithbarcharts,atypeofgraphwithwhichyoumightalreadybefamiliar.Abarchartshowsthedistributionofacategoricalvariable,thatis,avariablewhosevaluesarecategories.Inhumanpopulations,examplesofcategoricalvariablesincludegender,ethnicity,maritalstatus,countryofcitizenship,andsoon.

Abarchartconsistsofasequenceofrectangularbars,onecorrespondingtoeachcategory.Thelengthofeachbarisproportionaltothenumberofentriesinthecorrespondingcategory.

TheKaiserFamilyFoundationhascompliedCensusdataonthedistributionofraceandethnicityintheUnitedStates.

http://kff.org/other/state-indicator/distribution-by-raceethnicity/

http://kff.org/other/state-indicator/children-by-raceethnicity/

Herearesomeofthedata,arrangedbystate.Thetablechildrencontainsdataforpeoplewhowereyoungerthan18yearsoldin2014;everyonecontainsthedataforpeopleofallagesin2014.Somemissingvaluesinthesetablesarerepresentedas0.

everyone=Table.read_table('kaiser_ethnicity_everyone.csv')

children=Table.read_table('kaiser_ethnicity_children.csv')

children.set_format(2,NumberFormatter)

everyone.set_format(2,NumberFormatter)

ComputationalandInferentialThinking

89BarCharts

Page 90: Computational and inferential thinking

State Race/Ethnicity Population

Alabama White 3,167,600

Alabama Black 1,269,200

Alabama Hispanic 191,000

Alabama Asian 77,300

Alabama AmericanIndian/AlaskaNative 0

Alabama TwoOrMoreRaces 56,100

Alabama Total 4,768,000

Alaska White 396,400

Alaska Black 17,000

Alaska Hispanic 59,200

...(347rowsomitted)

Themethodbarhtakestwoarguments,acolumncontainingcategoriesandacolumncontainingnumericquantities.Itgeneratesahorizontalbarchart.ThischartfocusesonCalifornia.

everyone.where('State','California').barh('Race/Ethnicity','Population'

ThetableofchildrenalsocontainsinformationaboutCalifornia,althoughtherearefewercategoriesofrace/ethnicity.

children.where('State','California')

ComputationalandInferentialThinking

90BarCharts

Page 91: Computational and inferential thinking

State Race/Ethnicity Population

California White 2,786,000

California Black 484,200

California Hispanic 4,849,400

California Other 1,590,100

California Total 9,709,700

Again,wecandispalythesedatainahorizontalbarchart.Thebarhmethodcantakecolumnindicesaswellaslabels.Thischartlookssimilar,butinvolvesfewercategoriesbecausetheKaiserFamilyFoundationonlyprovidesthesedataaboutchildren.

children.where('State','California').barh(1,2)

It'snotobvioushowtocomparethesetwocharts.Tohelpusdrawconclusions,wewillcombinethesetwotablesintoone,whichcontainsbotheveryoneandchildrenasseparatecolumns.Thistransformationwillrequireseveralsteps.First,toensurethatvaluesarecomparable,wewilldivideeachpopulationbythetotalforthestatetocomputestateproportions.

totals=everyone.join('State',everyone.where(1,'Total').relabeled

totals

ComputationalandInferentialThinking

91BarCharts

Page 92: Computational and inferential thinking

State Race/Ethnicity Population StateTotal

Alabama White 3167600 4768000

Alabama Black 1269200 4768000

Alabama Hispanic 191000 4768000

Alabama Asian 77300 4768000

Alabama AmericanIndian/AlaskaNative 0 4768000

Alabama TwoOrMoreRaces 56100 4768000

Alabama Total 4768000 4768000

Alaska White 396400 695700

Alaska Black 17000 695700

Alaska Hispanic 59200 695700

...(347rowsomitted)

proportions=everyone.select([0,1]).with_column('Proportion',totals

proportions.set_format(2,PercentFormatter)

State Race/Ethnicity Proportion

Alabama White 66.43%

Alabama Black 26.62%

Alabama Hispanic 4.01%

Alabama Asian 1.62%

Alabama AmericanIndian/AlaskaNative 0.00%

Alabama TwoOrMoreRaces 1.18%

Alabama Total 100.00%

Alaska White 56.98%

Alaska Black 2.44%

Alaska Hispanic 8.51%

...(347rowsomitted)

Thissametransformationcanbeappliedtothechildrentable.Below,wechaintogetherseveraltablemethodsinordertoexpressthetransformationinasinglelongline.Typically,it'sbesttoexpresstransformationsinpieces(above),butprogramminglanguagesarequite

ComputationalandInferentialThinking

92BarCharts

Page 93: Computational and inferential thinking

flexibleinallowingprogrammerstocombineexpressions(below).

children_proportions=children.select([0,1]).with_column(

'ChildrenProportion',children.column(2)/children.join('State'

children_proportions.set_format(2,PercentFormatter)

State Race/Ethnicity ChildrenProportion

Alabama White 59.99%

Alabama Black 30.50%

Alabama Hispanic 6.41%

Alabama Other 3.09%

Alabama Total 100.00%

Alaska White 46.00%

Alaska Black 2.60%

Alaska Hispanic 11.99%

Alaska Other 39.36%

Alaska Total 100.00%

...(245rowsomitted)

Thefollowingfunctiontakesthenameofastate.Itjoinstogethertheproportionsforeveryoneandchildrenintoasingletable.Whenthejoinmethodcombinestwocolumnswithdifferentvalues,onlythematchingelementsareretained.Inthiscase,anyrace/ethnicitynotrepresentedinbothtableswillbeexcluded.

defcompare(state):

prop_for_state=proportions.where('State',state).drop(0)

children_prop_for_state=children_proportions.where('State',state

joined=prop_for_state.join('Race/Ethnicity',children_prop_for_state

joined.set_format([1,2],PercentFormatter)

returnjoined.sort('Proportion')

compare('California')

ComputationalandInferentialThinking

93BarCharts

Page 94: Computational and inferential thinking

Race/Ethnicity Proportion ChildrenProportion

Black 5.34% 4.99%

Hispanic 38.21% 49.94%

White 39.20% 28.69%

Total 100.00% 100.00%

Thebarhmethodcandisplaythedatafrommultiplecolumnsonasinglechart.Whencalledwithonlyoneargumentindicatingthecolumnofcategories,allothercolumnsaredisplayedassetsofbars.

compare('California').barh('Race/Ethnicity')

ThischartshowsthechangingcompositionofthestateofCalifornia.In2014,fewerthan40%ofCalifornianswereHispanic.But50%ofthechildrenwere,whichisthecrucialingredientinpredictionsthatHispanicswilleventuallybeamajorityinCalifornia.

Whataboutotherstates?Ourfunctionallowsquickvisualcomparisonsforanystate.

compare('Minnesota').barh('Race/Ethnicity')

ComputationalandInferentialThinking

94BarCharts

Page 95: Computational and inferential thinking

compare('Vermont').barh('Race/Ethnicity')

ComputationalandInferentialThinking

95BarCharts

Page 96: Computational and inferential thinking

HistogramsInteract

QuantitativeDataandHistograms¶Manyofthevariablesthatdatascientistsstudyarequantitative.Forinstance,wecanstudytheamountofrevenueearnedbymoviesinrecentdecades.OursourceistheInternetMovieDatabase,anonlinedatabasethatcontainsinformationaboutmovies,televisionshows,videogames,andsoon.

Thetabletopconsistsof[U.S.A.'stopgrossingmovies(http://www.boxofficemojo.com/alltime/adjusted.htm)ofalltime.Thefirstcolumncontainsthetitleofthemovie;StarWars:TheForceAwakenshasthetoprank,withaboxofficegrossamountofmorethan900milliondollarsintheUnitedStates.Thesecondcolumncontainsthenameofthestudio;thethirdcontainstheU.S.boxofficegrossindollars;thefourthcontainsthegrossamountthatwouldhavebeenearnedfromticketsalesat2016prices;andthefifthcontainsthereleaseyearofthemovie.

Thereare200moviesonthelist.Herearethetopten.

top=Table.read_table('top_movies.csv')

top.set_format([2,3],NumberFormatter)

ComputationalandInferentialThinking

96Histograms

Page 97: Computational and inferential thinking

Title Studio Gross Gross(Adjusted) Year

StarWars:TheForceAwakens

BuenaVista(Disney) 906,723,418 906,723,400 2015

Avatar Fox 760,507,625 846,120,800 2009

Titanic Paramount 658,672,302 1,178,627,900 1997

JurassicWorld Universal 652,270,625 687,728,000 2015

Marvel'sTheAvengers BuenaVista(Disney) 623,357,910 668,866,600 2012

TheDarkKnight WarnerBros. 534,858,444 647,761,600 2008

StarWars:EpisodeI-ThePhantomMenace Fox 474,544,677 785,715,000 1999

StarWars Fox 460,998,007 1,549,640,500 1977

Avengers:AgeofUltron BuenaVista(Disney) 459,005,868 465,684,200 2015

TheDarkKnightRises WarnerBros. 448,139,099 500,961,700 2012

...(190rowsomitted)

Three-digitnumbersareeasiertoworkwiththannine-digitnumbers.So,wewilladdanadditionalmillionscolumncontainingtheadjustedgrossboxofficerevenueinmillions.

millions=top.select(0).with_column('Gross',np.round(top.column(3

millions

ComputationalandInferentialThinking

97Histograms

Page 98: Computational and inferential thinking

Title Gross

StarWars:TheForceAwakens 906.72

Avatar 846.12

Titanic 1178.63

JurassicWorld 687.73

Marvel'sTheAvengers 668.87

TheDarkKnight 647.76

StarWars:EpisodeI-ThePhantomMenace 785.72

StarWars 1549.64

Avengers:AgeofUltron 465.68

TheDarkKnightRises 500.96

...(190rowsomitted)

Thehistmethodgeneratesahistogramofthevaluesinacolumn.Theoptionalunitargumentisusedtolabeltheaxes.

millions.hist('Gross',unit="MillionDollars")

Thefigureaboveshowsthedistributionoftheamountsgrossed,inmillionsofdollars.Theamountshavebeengroupedintocontiguousintervalscalledbins.Althoughinthisdatasetnomoviegrossedanamountthatisexactlyontheedgebetweentwobins,itisworthnotingthathisthasanendpointconvention:binsincludethedataattheirleftendpoint,butnotthedataattheirrightendpoint.Sometimes,adjustmentshavetobemadeinthefirstorlastbin,

ComputationalandInferentialThinking

98Histograms

Page 99: Computational and inferential thinking

toensurethatthesmallestandlargestvaluesofthevariableareincluded.YousawanexampleofsuchanadjustmentintheCensusdatausedintheTablessection,whereanageof"100"yearsactuallymeant"100yearsoldorolder."

Wecanseethatthereare10bins(somebarsaresolowthattheyarehardtosee),andthattheyallhavethesamewidth.Wecanalsoseethattherethelistcontainsnomoviethatgrossedfewerthan300milliondollars;thatisbecauseweareconsideringonlythetopgrossingmoviesofalltime.Itisalittlehardertoseeexactlywheretheedgesofthebinsareplaced.Forexample,itisnotclearexactlywherethevalue500liesonthehorizontalaxis,andsoitishardtojudgeexactlywherethefirstbarendsandthesecondbegins.

Theoptionalargumentbinscanbeusedwithhisttospecifytheedgesofthebars.Itmustconsistofasequenceofnumbersthatincludestheleftendofthefirstbarandtherightendofthelastbar.Asthehighestgrossamountissomewhatover760onthehorizontalscale,wewillstartbysettingbinstobethearrayconsistingofthenumbers300,400,500,andsoon,endingwith2000.

millions.hist('Gross',unit="MillionDollars",bins=np.arange(300,2001

Thisfigureiseasiertoread.Onthehorizontalaxis,thelabels200,400,600,andsoonarecenteredatthecorrespondingvalues.Thetallestbarisformoviesthatgrossedbetween300andmillionand400million(adjusted)dollars;thebarfor400to500millionisabout5/8thashigh.Theheightofeachbarisproportionaltothenumberofmoviesthatfallwithinthex-axisrangeofthebar.

Averysmallnumbergrossedmorethan700milliondollars.Thisresultsinthefigurebeing"skewedtotheright,",or,lessformally,having"alongrighthandtail."Distributionsofvariableslikeincomeorrentoftenhavethiskindofshape.

ComputationalandInferentialThinking

99Histograms

Page 100: Computational and inferential thinking

Thecountsofvaluesindifferentrangescanbecomputedfromatableusingthebinmethod,whichtakesacolumnlabelorindexandanoptionalsequenceornumberofbins.Theresultisatabularformofahistogram;itliststhecountsofallvaluesinthe'Millions'columnthataregreaterthanorequaltotheindicatedbin,butlessthanthenextbin'sstartingvalue.

bins=millions.bin('Gross',bins=np.arange(300,2001,100))

bins

bin Grosscount

300 81

400 52

500 28

600 16

700 7

800 5

900 3

1000 1

1100 3

1200 2

...(8rowsomitted)

Theverticalaxisofahistogramisthedensityscale.Theheightofeachbaristheproportionofelementsthatfallintothecorrespondingbin,dividedbythewidthofthebin.Forexample,theheightofthebinfrom300to400,whichcontains81ofthe200movies,is

=0.405%.

starts=bins.column(0)

counts=bins.column(1)

counts.item(0)/sum(counts)/(starts.item(1)-starts.item(0))

0.0040500000000000006

ComputationalandInferentialThinking

100Histograms

Page 101: Computational and inferential thinking

Thetotalareaofthisbaris0.405,100timesgreaterbecauseitswidthis100.ThisareaistheproportionofallvaluesintheMillionscolumnthatfallintothe300-400bin.Theareaofallbarsinahistogramsumsto1.

Ahistogrammaycontainbinsofunequalsize,buttheareaofallbarswillstillsumto1.Below,thevaluesintheMillionscolumnarebinnedintofourcategories.

millions.hist('Gross',unit="MillionDollars",bins=[300,400,600,

Althoughtheranges300-400and400-600havenearlyidenticalcounts,theformeristwiceastallasthelatterbecauseitisonlyhalfaswide.Thedensityofvaluesinthatrangeistwiceashigh.Histogramsdisplayinvisualformwhereonthenumberlinevaluesaremostconcentrated.

millions.bin('Gross',bins=[300,400,600,1500])

bin Grosscount

300 81

400 80

600 37

1500 0

HistogramofCounts.Itispossibletodisplaycountsdirectlyinachart,usingthenormed=Falseoptionofthehistmethod.Theresultingcharthasthesameshapewhenbinsallhaveequalwidths,butthebarareasdonotsumto1.

ComputationalandInferentialThinking

101Histograms

Page 102: Computational and inferential thinking

millions.hist('Gross',bins=np.arange(300,2001,100),normed=False)

Whilethecountscaleisperhapsmorenaturaltointerpretthanthedensityscale,thechartbecomeshighlymisleadingwhenbinshavedifferentwidths.Below,itappears(duetothecountscale)thathigh-grossingmoviesarequitecommon,wheninfactwehaveseenthattheyarerelativelyrare.

millions.hist('Gross',bins=[300,400,500,600,1500],normed=False

Eventhoughthemethodusediscalledhist,thefigureaboveisNOTAHISTOGRAM.Itgivestheimpressionthatmoviesareevenlydistributedinthe300-600range,buttheyarenot.Moreover,itexaggeratesthedensityofmoviesgrossingmorethan600million.The

ComputationalandInferentialThinking

102Histograms

Page 103: Computational and inferential thinking

heightofeachbarissimplyplottedthenumberofmoviesinthebin,withoutaccountingforthedifferenceinthewidthsofthebins.Thepicturebecomesevenmoreskewedifthelasttwobinsarecombined.

millions.hist('Gross',bins=[300,400,1500],normed=False)

Inthiscount-basedhistogram,theshapeofthedistributionofmoviesislostentirely.

Sowhatisahistogram?

Thefigureaboveshowsthatwhattheeyeperceivesas"big"isarea,notjustheight.Thisisparticularlyimportantwhenthebinsareofdifferentwidths.

Thatiswhyahistogramhastwodefiningproperties:

1. Thebinsarecontiguous(thoughsomemightbeempty)andaredrawntoscale.2. Theareaofeachbarisproportionaltothenumberofentriesinthebin.

Property2isthekeytodrawingahistogram,andisusuallyachievedasfollows:

Whendrawnusingthismethod,thehistogramissaidtobedrawnonthedensityscale,andthetotalareaofthebarsisequalto1.

Tocalculatetheheightofeachbar,usethefactthatthebarisarectangle:

andso

ComputationalandInferentialThinking

103Histograms

Page 104: Computational and inferential thinking

Thelevelofdetail,andtheflattopsofthebars

Eventhoughthedensityscalecorrectlyrepresentsproportionsusingarea,somedetailislostbyplacingdifferentvaluesintothesamebins.

Takeanotherlookatthe[300,400)bininthefigurebelow.Theflattopofthebar,atalevelnear0.4%,hidesthefactthatthemoviesaresomewhatunevenlydistributedacrossthebin.

millions.hist('Gross',bins=[300,400,600,1500])

Toseethis,letussplitthe[150,200)binintotennarrowerbinsofwidth10milliondollarseach:

millions.hist('Gross',bins=[300,310,320,330,340,350,360,370

ComputationalandInferentialThinking

104Histograms

Page 105: Computational and inferential thinking

Someoftheskinnybarsaretallerthan0.4%andothersareshorter.Byputtingaflattopat0.4%overthewholebin,wearedecidingtoignorethefinerdetailandusetheflatlevelasaroughapproximation.Often,thoughnotalways,thisissufficientforunderstandingthegeneralshapeofthedistribution.

Noticethatbecausewehavetheentiredataset,wecandrawthehistograminasfinealevelofdetailasthedataandourpatiencewillallow.However,ifyouarelookingatahistograminabookoronawebsite,andyoudon'thaveaccesstotheunderlyingdataset,thenitbecomesimportanttohaveaclearunderstandingofthe"roughapproximation"oftheflattops.

Thedensityscale

Theheightofeachbarisaproportiondividedbyabinwidth.Thus,forthisdatset,thevaluesontheverticalaxisare"proportionspermilliondollars."Tounderstandthisbetter,lookagainatthe[150,200)bin.Thebinis50milliondollarswide.Sowecanthinkofitasconsistingof50narrowbinsthatareeach1milliondollarswide.Thebar'sheightofroughly"0.004permilliondollars"meansthatineachofthose50skinnybinsofwidth1milliondollars,theproportionofmoviesisroughly0.004.

Thustheheightofahistogrambarisaproportionperunitonthehorizontalaxis,andcanbethoughtofasthedensityofentriesperunitwidth.

millions.hist('Gross',bins=[300,350,400,450,1500])

ComputationalandInferentialThinking

105Histograms

Page 106: Computational and inferential thinking

DensityQ&A

Lookagainatthehistogram,andthistimecomparethe[400,450)binwiththe[450,1500)bin.

Q:Whichhasmoremoviesinit?

A:The[450,1500)bin.Ithas92movies,comparedwith25moviesinthe[400,450)bin.

Q:Thenwhyisthe[450,1500)barsomuchshorterthanthe[400,450)bar?

A:Becauseheightrepresentsdensityperunitwidth,notthenumberofmoviesinthebin.The[450,1500)binhasmoremoviesthanthe[400,450)bin,butitisalsoawholelotwider.Sothedensityismuchlower.

millions.bin('Gross',bins=[300,350,400,450,1500])

bin Grosscount

300 32

350 49

400 25

450 92

1500 0

Barchartorhistogram?

Barchartsdisplaythedistributionsofcategoricalvariables.Allthebarsinabarcharthavethesamewidth.Thelengths(orheights,ifthebarsaredrawnvertically)ofthebarsareproportionaltothenumberofentries.

ComputationalandInferentialThinking

106Histograms

Page 107: Computational and inferential thinking

Histogramsdisplaythedistributionsofquantitativevariables.Thebarscanhavedifferentwidths.Theareasofthebarsareproportionaltothenumberofentries.

Multiplehistograms

Inalltheexamplesinthissection,wehavedrawnasinglehistogram.However,ifadatatablecontainsseveralcolumns,thenbarhandhistcanbeusedtodrawseveralchartsatonce.Wewillcoverthisfeatureinalatersection.

ComputationalandInferentialThinking

107Histograms

Page 108: Computational and inferential thinking

Chapter3

SamplingSamplingisapowerfulcomputationaltechniqueindataanalysis.Itallowsustoinvestigatebothissuesofprobabilityandstatistics.Probabilityisthestudyoftheoutcomesofrandomprocesses;processesthatincludechance.Statisticsisjusttheopposite:itisthestudyofrandomprocessesgivenobservationsoftheiroutcomes.Putanotherway,probabilityfocusesonwhatwillbeobservedgivenwhatweknowabouttheworld.Statisticsfocusesonwhatwecaninferabouttheworldgivenwhatwehaveobserved.RandomSamplingisacentraltoolinbothdirectionsofinference.

ComputationalandInferentialThinking

108Sampling

Page 109: Computational and inferential thinking

SamplingInteractInthissection,wewillcontinuetousethetop_movies.csvdataset.

top=Table.read_table('top_movies.csv')

top.set_format([2,3],NumberFormatter)

Title Studio Gross Gross(Adjusted) Year

StarWars:TheForceAwakens

BuenaVista(Disney) 906,723,418 906,723,400 2015

Avatar Fox 760,507,625 846,120,800 2009

Titanic Paramount 658,672,302 1,178,627,900 1997

JurassicWorld Universal 652,270,625 687,728,000 2015

Marvel'sTheAvengers BuenaVista(Disney) 623,357,910 668,866,600 2012

TheDarkKnight WarnerBros. 534,858,444 647,761,600 2008

StarWars:EpisodeI-ThePhantomMenace Fox 474,544,677 785,715,000 1999

StarWars Fox 460,998,007 1,549,640,500 1977

Avengers:AgeofUltron BuenaVista(Disney) 459,005,868 465,684,200 2015

TheDarkKnightRises WarnerBros. 448,139,099 500,961,700 2012

...(190rowsomitted)

Samplingrowsofatable¶DeterministicSamples

Whenyousimplyspecifywhichelementsofasetyouwanttochoose,withoutanychancesinvolved,youcreateadeterministicsample.

ComputationalandInferentialThinking

109Sampling

Page 110: Computational and inferential thinking

Adeterminsiticsamplefromtherowsofatablecanbeconstructedusingthetakemethod.Itsargumentisasequenceofintegers,anditreturnsatablecontainingthecorrespondingrowsoftheoriginaltable.

Thecodebelowreturnsatableconsistingoftherowsindexed3,18,and100inthetabletop.Sincetheoriginaltableissortedbyrank,whichbeginsat1,thezero-indexedrowsalwayshavearankthatisonegreaterthantheindex.However,thetakemethoddoesnotinspecttheinformationintherowstoconstructitssample.

top.take([3,18,100])

Title Studio Gross Gross(Adjusted) Year

JurassicWorld Universal 652,270,625 687,728,000 2015

Spider-Man Sony 403,706,375 604,517,300 2002

GonewiththeWind MGM 198,676,459 1,757,788,200 1939

Wecanselectevenlyspacedrowsbycallingnp.arangeandpassingtheresulttotake.Intheexamplebelow,westartwiththefirstrow(index0)oftop,andchooseevery40throwafterthatuntilwereachtheendofthetable.

top.take(np.arange(0,top.num_rows,40))

Title Studio Gross Gross(Adjusted) Year

StarWars:TheForceAwakens BuenaVista(Disney) 906,723,418 906,723,400 2015

ShrektheThird Paramount/Dreamworks 322,719,944 408,090,600 2007

BruceAlmighty Universal 242,829,261 350,350,700 2003

ThreeMenandaBaby BuenaVista(Disney) 167,780,960 362,822,900 1987

SaturdayNightFever Paramount 94,213,184 353,261,200 1977

ProbabilitySamples¶Muchofdatascienceconsistsofmakingconclusionsbasedonthedatainrandomsamples.Correctlyinterpretinganalysesbasedonrandomsamplesrequiresdatascientiststoexamineexactlywhatrandomsamplesare.

ComputationalandInferentialThinking

110Sampling

Page 111: Computational and inferential thinking

Apopulationisthesetofallelementsfromwhomasamplewillbedrawn.

Aprobabilitysampleisoneforwhichitispossibletocalculate,beforethesampleisdrawn,thechancewithwhichanysubsetofelementswillenterthesample.

Inaprobabilitysample,allelementsneednothavethesamechanceofbeingchosen.Forexample,supposeyouchoosetwopeoplefromapopulationthatconsistsofthreepeopleA,B,andC,accordingtothefollowingscheme:

PersonAischosenwithprobability1.OneofPersonsBorCischosenaccordingtothetossofacoin:ifthecoinlandsheads,youchooseB,andifitlandstailsyouchooseC.

Thisisaprobabilitysampleofsize2.Herearethechancesofentryforallnon-emptysubsets:

A:1

B:1/2

C:1/2

AB:1/2

AC:1/2

BC:0

ABC:0

PersonAhasahigherchanceofbeingselectedthanPersonsBorC;indeed,PersonAiscertaintobeselected.Sincethesedifferencesareknownandquantified,theycanbetakenintoaccountwhenworkingwiththesample.

Todrawaprobabilitysample,weneedtobeabletochooseelementsaccordingtoaprocessthatinvolveschance.Abasictoolforthispurposeisarandomnumbergenerator.ThereareseveralinPython.Herewewilluseonethatispartofthemodulerandom,whichinturnispartofthemodulenumpy.

Themethodrandint,whengiventwoargumentslowandhigh,returnsanintegerpickeduniformlyatrandombetweenlowandhigh,includinglowbutexcludinghigh.Runthecodebelowseveraltimestoseethevariabilityintheintegersthatarereturned.

np.random.randint(3,8)#selectonceatrandomfrom3,4,5,6,7

7

ASystematicSample

ComputationalandInferentialThinking

111Sampling

Page 112: Computational and inferential thinking

Imaginealltheelementsofthepopulationlistedinasequence.Onemethodofsamplingstartsbychoosingarandompositionearlyinthelist,andthenevenlyspacedpositionsafterthat.Thesampleconsistsoftheelementsinthosepositions.Suchasampleiscalledasystematicsample.

Herewewillchooseasystematicsampleoftherowsoftop.Wewillstartbypickingoneofthefirst10rowsatrandom,andthenwewillpickevery10throwafterthat.

"""Choosearandomstartamongrows0through9;

thentakeevery10throw."""

start=np.random.randint(0,10)

top.take(np.arange(start,top.num_rows,10))

Title Studio Gross Gross(Adjusted) Year

StarWars Fox 460,998,007 1,549,640,500 1977

TheHungerGames Lionsgate 408,010,692 442,510,400 2012

ThePassionoftheChrist NM 370,782,930 519,432,100 2004

AliceinWonderland(2010)

BuenaVista(Disney) 334,191,110 365,718,600 2010

StarWars:EpisodeII-AttackoftheClones Fox 310,676,740 465,175,700 2002

TheSixthSense BuenaVista(Disney) 293,506,292 500,938,400 1999

TheHangover WarnerBros. 277,322,503 323,343,300 2009

RaidersoftheLostArk Paramount 248,159,971 770,183,000 1981

TheLostWorld:JurassicPark Universal 229,086,679 434,216,600 1997

AustinPowers:TheSpyWhoShaggedMe NewLine 206,040,086 352,863,900 1999

...(10rowsomitted)

Runthecodeafewtimestoseehowtheoutputvaries.

Thissystematicsampleisaprobabilitysample.Inthisscheme,allrowsdohavethesamechanceofbeingchosen.Butthatisnottrueofothersubsetsoftherows.Becausetheselectedrowsareevenlyspaced,mostsubsetsofrowshavenochanceofbeingchosen.

ComputationalandInferentialThinking

112Sampling

Page 113: Computational and inferential thinking

Theonlysubsetsthatarepossiblearethosethatconsistofrowsallseparatedbymultiplesof10.Anyofthosesubsetsisselectedwithchance1/10.

Randomsamplewithreplacement

Someofthesimplestprobabilitysamplesareformedbydrawingrepeatedly,uniformlyatrandom,fromthelistofelementsofthepopulation.Ifthedrawsaremadewithoutchangingthelistbetweendraws,thesampleiscalledarandomsamplewithreplacement.Youcanimaginemakingthefirstdrawatrandom,replacingtheelementdrawn,andthendrawingagain.

Inarandomsamplewithreplacement,eachelementinthepopulationhasthesamechanceofbeingdrawn,andeachcanbedrawnmorethanonceinthesample.

Ifyouwanttodrawasampleofpeopleatrandomtogetsomeinformationaboutapopulation,youmightnotwanttosamplewithreplacement–drawingthesamepersonmorethanoncecanleadtoalossofinformation.Butindatascience,randomsampleswithreplacementariseintwomajorareas:

studyingprobabilitiesbysimulatingtossesofacoin,rollsofadie,orgamblinggames

creatingnewsamplesfromasampleathand

Thesecondoftheseareaswillbecoveredlaterinthecourse.Fornow,letusstudysomelong-runpropertiesofprobabilities.

Wewillstartwiththetablediewhichcontainsthenumbersofspotsonthefacesofadie.Allthenumbersappearexactlyonce,asweareassumingthatthedieisfair.

die=Table().with_column('Face',[1,2,3,4,5,6])

die

Face

1

2

3

4

5

6

Drawingthehistogramofthissimplesetofnumbersyieldsanunsettlingfigureashistmakesadefaultchoiceofbins:

ComputationalandInferentialThinking

113Sampling

Page 114: Computational and inferential thinking

die.hist()

Thenumbers1,2,3,4,5,6areintegers,sothebinschosenbyhistonlyhaveentriesattheedges.Insuchasituation,itisabetterideatoselectbinssothattheyarecenteredontheintegers.Thisisoftentrueofhistogramsofdatathatarediscrete,thatis,variableswhosesuccessivevaluesareseparatedbythesamefixedamount.Therealadvantageofthismethodofbinselectionwillbecomemoreclearwhenwestartimaginingsmoothcurvesdrawnoverhistograms.

dice_bins=np.arange(0.5,7,1)

die.hist(bins=dice_bins)

Noticehoweachbinhaswidth1andiscenteredonaninteger.Noticealsothatbecausethewidthofeachbinis1,theheightofeachbaris ,thechancethatthecorrespondingface

appears.

ComputationalandInferentialThinking

114Sampling

Page 115: Computational and inferential thinking

Thehistogramshowstheprobabilitywithwhicheachfaceappears.Itiscalledaprobabilityhistogramoftheresultofonerollofadie.

Thehistogramwasdrawnwithoutrollinganydiceorgeneratinganyrandomnumbers.Wewillnowusethecomputertomimicactuallyrollingadie.Theprocessofusingacomputerprogramtoproducetheresultsofachanceexperimentiscalledsimulation.

Torollthedie,wewilluseamethodcalledsample.Thismethodreturnsanewtableconsistingofrowsselecteduniformlyatrandomfromatable.Itsfirstargumentisthenumberofrowstobereturned.Itssecondargumentiswhetherornotthesamplingshouldbedonewithreplacement.

Thecodebelowsimulates10rollsofthedie.Aswithallsimulationsofchanceexperiments,youshouldrunthecodeseveraltimesandnoticethevariabilityinwhatisreturned.

die.sample(10,with_replacement=True)

Face

5

4

2

3

6

5

1

1

6

2

Wewillnowrollthedieseveraltimesanddrawthehistogramoftheobservedresults.Thehistogramofobservedresultsiscalledanempiricalhistogram.

Theresultsofrollingthediewillallbeintegersintherange1through6,sowewillwanttousethesamebinsasweusedfortheprobabilityhistogram.Toavoidwritingoutthesamebinargumenteverytimewedrawahistogram,letusdefineafunctioncalleddice_histthatwillperformthetaskforus.Thefunctionwilltakeoneargument:thenumbernofdicetoroll.

ComputationalandInferentialThinking

115Sampling

Page 116: Computational and inferential thinking

defdice_hist(n):

rolls=die.sample(n,with_replacement=True)

rolls.hist(bins=dice_bins)

dice_hist(20)

Asweincreasethenumberofrollsinthesimulation,theproportionsgetcloserto .

dice_hist(10000)

Thebehaviorwehaveobservedisaninstanceofageneralrule.

TheLawofAverages¶

ComputationalandInferentialThinking

116Sampling

Page 117: Computational and inferential thinking

Ifachanceexperimentisrepeatedindependentlyandunderidenticalconditions,then,inthelongrun,theproportionoftimesthataneventoccursgetscloserandclosertothetheoreticalprobabilityoftheevent.

Forexample,inthelongrun,theproportionoftimesthefacewithfourspotsappearsgetscloserandcloserto1/6.

Here"independentlyandunderidenticalconditions"meansthateveryrepetitionisperformedinthesamewayregardlessoftheresultsofalltheotherrepetitions.

Convergenceofempiricalhistograms¶Wehavealsoobservedthatarandomquantity(suchasthenumberofspotsononerollofadie)isassociatedwithtwohistograms:

aprobabilityhistogram,thatshowsallthepossiblevaluesofthequantityandalltheirchances

anempirialhistogram,createdbysimulatingtherandomquantityrepeatedlyanddrawingahistogramoftheobservedresults

Wehaveseenanexampleofthelong-runbehaviorofempiricalhistograms:

Asthenumberofrepetitionsincreases,theempiricalhistogramofarandomquantitylooksmoreandmoreliketheprobabilityhistogram.

AttheRouletteTable¶Equippedwithournewknowledgeaboutthelong-runbehaviorofchances,letusexploreagamblinggame.BettingonrouletteispopularingamblingcenterssuchasLasVegasandMonteCarlo,andwewillsimulateoneofthebetshere.

ThemainrandomizerinrouletteinNevadaisawheelthathas38pocketsonitsrim.Twoofthepocketsaregreen,eighteenblack,andeighteenred.Thewheelisonaspindle,andthereisasmallballonit.Whenthewheelisspun,theballricochetsaroundandfinallycomestorestinoneofthepockets.Thatisdeclaredtobethewinningpocket.

Youareallowedtobetonseveralpre-specifiedcollectionsofpockets.Ifyoubeton"red,"youwiniftheballcomestorestinoneoftheredpockets.

Thebetevenmoney.Thatis,itpays1to1.Tounderstandwhatthatmeans,assumeyouaregoingtobet$1on"red."Thefirstthingthathappens,evenbeforethewheelisspun,isthatyouhavetohandoveryour$1.Iftheballlandsinagreenorblackpocket,younever

ComputationalandInferentialThinking

117Sampling

Page 118: Computational and inferential thinking

seethatdollaragain.Iftheballlandsinaredpocket,yougetyourdollarback(tobringyoubacktoeven),plusanother$1inwinnings.

ThetablewheelrepresentsthepocketsofaNevadaroulettewheel.Ithas38rowslabeled1through38,onerowperpocket.

pockets=np.arange(1,39)

colors=(['red','black']*5+['black','red']*4)*2+['green'

wheel=Table().with_columns([

'pocket',pockets,

'color',colors])

wheel

pocket color

1 red

2 black

3 red

4 black

5 red

6 black

7 red

8 black

9 red

10 black

...(28rowsomitted)

Thefunctionbet_on_redtakesanumericalargumentxandreturnsthenetwinningsona$1beton"red,"providedxisthenumberofapocket.

defbet_on_red(x):

"""Thenetwinningsofbettingonredforoutcomex."""

pockets=wheel.where('pocket',x)

ifpockets.column('color').item(0)=='red':

return1

else:

return-1

ComputationalandInferentialThinking

118Sampling

Page 119: Computational and inferential thinking

bet_on_red(17)

-1

Thefunctionspinstakesanumericalargumentnandreturnsanewtableconsistingofnrowsofwheelsampledatrandomwithreplacement.Inotherwords,itsimulatestheresultsofnspinsoftheroulettewheel.

defspins(n):

returnwheel.sample(n,with_replacement=True)

Wewillcreateatablecalledplayconsistingoftheresultsof10spins,andaddacolumnthatshowsthenetwinningson$1placedon"red."Recallthattheapplymethodappliesafunctiontoeachelementinacolumnofatable.

outcomes=spins(500)

results=outcomes.with_column(

'winnings',outcomes.apply(bet_on_red,'pocket'))

results

pocket color winnings

13 black -1

22 black -1

38 green -1

11 black -1

29 black -1

19 red 1

38 green -1

37 green -1

19 red 1

31 black -1

...(490rowsomitted)

Andhereisthenetgainonall500bets:

ComputationalandInferentialThinking

119Sampling

Page 120: Computational and inferential thinking

sum(results.column('winnings'))

-32

Betting$1onredhundredsoftimesseemslikeabadideafromagambler'sperspective.Butfromthecasinos'perspectiveitisexcellent.Casinosrelyonlargenumbersofbetsbeingplaced.Thepayoffoddsaresetsothatthemorebetsthatareplaced,themoremoneythecasinosarelikelytomake,eventhoughafewpeoplearelikelytogohomewithwinnings.

SimpleRandomSample–aRandomSamplewithoutReplacement

Arandomsamplewithoutreplacementisoneinwhichelementsaredrawnfromalistrepeatedly,uniformlyatrandom,ateachstagedeletingfromthelisttheelementthatwasdrawn.

Arandomsamplewithoutreplacementisalsocalledasimplerandomsample.Allelementsofthepopulationhavethesamechanceofenteringasimplerandomsample.Allpairshavethesamechanceaseachother,asdoalltriples,andsoon.

Thedefaultactionofsampleistodrawwithoutreplacement.Incardgames,cardsarealmostalwaysdealtwithoutreplacement.Letususesampletodealcardsfromadeck.

Astandarddeckdeckconsistsof13ranksofcardsineachoffoursuits.Thesuitsarecalledspades,clubs,diamonds,andhearts.TheranksareAce,2,3,4,5,6,7,8,9,10,Jack,Queen,andKing.Spadesandclubsareblack;diamondsandheartsarered.TheJacks,Queens,andKingsarecalled'facecards.'

Thetabledeckcontainsall52cardsinacolumnlabeledcards.Theabbreviationsare:

Ace:AJack:JQueen:QKing:K

Below,weusetheproductfunctiontoproduceallpossiblepairsofasuitandarank.Thesuitsarerepresentedbyspecialcharactersusedtoprintbridgeandpokerarticlesinnewspapers.

ComputationalandInferentialThinking

120Sampling

Page 121: Computational and inferential thinking

fromitertoolsimportproduct

ranks=['A','2','3','4','5','6','7','8','9','10','J','Q'

suits=['♠','♥','♦','♣']

cards=product(ranks,suits)

deck=Table(['rank','suit']).with_rows(cards)

deck

rank suit

A ♠

A ♥

A ♦

A ♣

2 ♠

2 ♥

2 ♦

2 ♣

3 ♠

3 ♥

...(42rowsomitted)

Apokerhandis5cardsdealtatrandomfromthedeck.Theexpressionbelowdealsahand.Executethecellafewtimes;howoftenareyoudealtapair?

deck.sample(5)

rank suit

5 ♥

4 ♣

9 ♥

3 ♣

10 ♣

ComputationalandInferentialThinking

121Sampling

Page 122: Computational and inferential thinking

Ahandisasetoffivecards,regardlessoftheorderinwhichtheyappeared.Forexample,thehand9♣9♥Q♣7♣8♥isthesameasthehand7♣9♣8♥9♥Q♣.

Thisfactcanbeusedtoshowthatsimplerandomsamplingcanbethoughtofintwoequivalentways:

drawingelementsonebyoneatrandomwithoutreplacement

randomlypermuting(thatis,shuffling)thewholelist,andthenpullingoutasetofelementsatthesametime

ComputationalandInferentialThinking

122Sampling

Page 123: Computational and inferential thinking

IterationInteract

Repetition¶Itisoftenthecasewhenprogrammingthatyouwillwishtorepeatthesameoperationmultipletimes,perhapswithslightlydifferentbehavioreachtime.Youcouldcopy-pastethecode10times,butthat'stediousandpronetotypos,andifyouwantedtodoitathousandtimes(oramilliontimes),forgetit.

Abettersolutionistouseaforstatementtoloopoverthecontentsofasequence.Aforstatementbeginswiththewordfor,followedbyanamefortheiteminthesequence,followedbythewordin,andendingwithanexpressionthatevaluatestoasequence.Theindentedbodyoftheforstatementisexecutedonceforeachiteminthatsequence.

foriinnp.arange(5):

print(i)

0

1

2

3

4

Atypicaluseofaforstatementistobuildupatablebyrepeatingarandomcomputationmanytimesandstoringeachresultasanewrow.Theappendmethodofatabletakesasequenceandaddsanewrow.It'sdifferentfromwith_rowbecauseanewtableisnotcreated;instead,theoriginaltableisextended.Thecellbelowdraws100cards,butkeepsonlytheaces.

ComputationalandInferentialThinking

123Iteration

Page 124: Computational and inferential thinking

aces=Table(['Rank','Suit'])

foriinnp.arange(100):

card=deck.row(np.random.randint(deck.num_rows))

ifcard.item(0)=='A':

aces.append(card)

aces

Rank Suit

A ♠

A ♣

A ♥

A ♦

A ♠

A ♦

Thispatterncanbeusedtotracktheresultsofrepeatedexperiments.Forexample,perhapswewanttolearnabouttheempiricalpropertiesofsomerandomlydrawnpokerhands.Below,wetrackwhetherthehandcontainsfour-of-a-kindorfivecardsofthesamesuit(calledaflush).

hands=Table(['Four-of-a-kind','Flush'])

foriinnp.arange(10000):

hand=deck.sample(5)

four_of_a_kind=max(hand.group('rank').column('count'))==4

flush=max(hand.group('suit').column('count'))==5

hands.append([four_of_a_kind,flush])

hands

ComputationalandInferentialThinking

124Iteration

Page 125: Computational and inferential thinking

Four-of-a-kind Flush

False False

False False

False False

False False

False False

False False

False False

False False

False False

False False

...(9990rowsomitted)

Aforstatementcanalsoiterateoverasequenceoflabels.Wecanusethisfeaturetosummarizetheresultsofourexperiment.Thesearerarehandsindeed!

forlabelinhands.labels:

success=np.count_nonzero(hands.column(label))

print('A',label,'wasdrawn',success,'of',hands.num_rows,'times'

AFour-of-a-kindwasdrawn2of10000times

AFlushwasdrawn22of10000times

Randomizedresponse¶Next,we'lllookatatechniquethatwasdesignedseveraldecadesagotohelpconductsurveysofsensitivesubjects.Researcherswantedtoaskparticipantsafewquestions:Haveyoueverhadanaffair?Doyousecretlythinkyouaregay?Haveyouevershoplifted?HaveyoueversungaJustinBiebersongintheshower?Theyfiguredthatsomepeoplemightnotrespondhonestly,becauseofthesocialstigmaassociatedwithanswering"yes".So,theycameupwithacleverwaytoestimatethefractionofthepopulationwhoareinthe"yes"camp,withoutviolatinganyone'sprivacy.

ComputationalandInferentialThinking

125Iteration

Page 126: Computational and inferential thinking

Here'stheidea.We'llinstructtherespondenttorollafair6-sideddie,secretly,wherenooneelsecanseeit.Ifthediecomesup1,2,3,or4,thenrespondentissupposedtoanswerhonestly.Ifitcomesup5or6,therespondentissupposedtoanswertheoppositeofwhattheirtrueanswerwouldbe.But,theyshouldn'trevealwhatcameupontheirdie.

Noticehowcleverthisis.Evenifthepersonsays"yes",thatdoesn'tnecessarilymeantheirtrueansweris"yes"--theymightverywellhavejustrolleda5or6.Sotheresponsestothesurveydon'trevealanyoneindividual'strueanswer.Yet,inaggregate,theresponsesgiveenoughinformationthatwecangetaprettygoodestimateofthefractionofpeoplewhosetrueansweris"yes".

Let'stryasimulation,sowecanseehowthisworks.We'llwritesomecodetoperformthisoperation.First,afunctiontosimulaterollingonedie:

defroll_once():

returnnp.random.randint(1,7)

Nowwe'llusethistowriteafunctiontosimulatehowsomeoneissupposedtorespondtothesurvey.Theargumenttothefunctionistheirtrueanswer(TrueorFalse);thefunctionreturnswhatthey'resupposedtotelltheinterview.

defrespond(true_answer):

ifroll_once()>=5:

returnnottrue_answer

else:

returntrue_answer

Wecantryit.Assumeourtrueansweris'no';let'sseewhathappensthistime:

respond(False)

False

Ofcourse,ifyouweretorunitmanytimes,youmightgetadifferentresulteachtime.Below,webuildatableoftheresponsesformanyresponseswhenthetrueanswerisalwaysFalse.

ComputationalandInferentialThinking

126Iteration

Page 127: Computational and inferential thinking

responses=Table(['Truth','Response'])

foriinnp.arange(1000):

responses.append([False,respond(False)])

responses

Truth Response

False False

False False

False False

False False

False False

False False

False True

False False

False True

False False

...(990rowsomitted)

Let'sbuildabarchartandlookathowmanyTrueandFalseresponsesweget.

responses.group('Response').barh('Response')

responses.where('Response',False).num_rows

ComputationalandInferentialThinking

127Iteration

Page 128: Computational and inferential thinking

656

responses.where('Response',True).num_rows

344

Exerciseforyou:IfNoutof1000responsesareTrue,approximatelywhatfractionofthepopulationhastrulysungaJustinBiebersongintheshower?

Analysis¶Thismethodiscalled"randomizedresponse".Itisonewaytopollpeopleaboutsensitivesubjects,whilestillprotectingtheirprivacy.Youcanseehowitisaniceexampleofrandomnessatwork.

Itturnsoutthatrandomizedresponsehasbeautifulgeneralizations.Forinstance,yourChromewebbrowserusesittoanonymouslyreportfeedbacktoGoogle,inawaythatwon'tviolateyourprivacy.That'sallwe'llsayaboutitforthissemester,butifyoutakeanupper-divisioncourse,maybeyou'llgettoseesomegeneralizationsofthisbeautifultechnique.

Thestepsintherandomizedresponsesurveycanbevisualizedusingatreediagram.Thediagrampartitionsallthesurveyrespondentsaccordingtotheirtrueandanswerandtheanswerthattheyeventuallygive.Italsodisplaystheproportionsofrespondentswhosetrueanswersare1("True")and0("False"),aswellasthechancesthatdeterminetheanswersthattheygive.Asinthecodeabove,wehaveusedptodenotetheproportionwhosetrueansweris1.

ComputationalandInferentialThinking

128Iteration

Page 129: Computational and inferential thinking

Therespondentswhoanswer1splitintotwogroups.Thefirstgroupconsistsoftherespondentswhosetrueanswerandgivenanswersareboth1.Ifthenumberofrespondentsislarge,theproportioninthisgroupislikelytobeabout2/3ofp.Thesecondgroupconsistsoftherespondentswhosetrueansweris0andgivenansweris1.Thisproportioninthisgroupislikelytobeabout1/3of1-p.

Wecanobserved ,theproportionof1'samongthegivenanswers.Thus

Thisallowsustosolveforanapproximatevalueofp:

Inthiswaywecanusetheobservedproportionof1'sto"workbackwards"andgetanestimateofp,theproportioninwhichwheareinterested.

Technicalnote.Itisworthnotingtheconditionsunderwhichthisestimateisvalid.Thecalculationoftheproportionsinthetwogroupswhosegivenansweris1reliesoneachofthegroupsbeinglargeenoughsothattheLawofAveragesallowsustomakeestimatesabouthowtheirdicearegoingtoland.Thismeansthatitisnotonlythetotalnumberof

ComputationalandInferentialThinking

129Iteration

Page 130: Computational and inferential thinking

respondentsthathastobelarge–thenumberofrespondentswhosetrueansweris1hastobelarge,asdoesthenumberwhosetrueansweris0.Forthistohappen,pmustbeneithercloseto0norcloseto1.Ifthecharacteristicofinterestiseitherextremelyrareorextremelycommoninthepopulation,themethodofrandomizedresponsedescribedinthisexamplemightnotworkwell.

Let'stryoutthismethodonsomerealdata.Thechanceofdrawingapokerhandwithnoacesis

np.product(np.arange(48,43,-1)/np.arange(52,47,-1))

0.65884199833779666

Itisquiteembarassingindeedtodrawahandwithnoaces.Thetablebelowcontainsonecolumnforthetruthofwhetherahandhasnoacesandanotherfortherandomizedresponse.

ace_responses=Table(['Truth','Response'])

foriinnp.arange(10000):

hand=deck.sample(5)

no_aces=hand.where('rank','A').num_rows==0

ace_responses.append([no_aces,respond(no_aces)])

ace_responses

ComputationalandInferentialThinking

130Iteration

Page 131: Computational and inferential thinking

Truth Response

False True

True True

False True

True False

True False

True True

False False

True False

False False

True True

...(9990rowsomitted)

Usingourderivedformula,wecanestimatewhatfractionofhandshavenoaces.

3*np.count_nonzero(ace_responses.column('Response'))/10000-1

0.6644000000000001

ComputationalandInferentialThinking

131Iteration

Page 132: Computational and inferential thinking

EstimationInteractIntheprevioussectionwesawthattwokindsofhistogramscanbeassociatedwithaquantitygeneratedbyarandomprocess.

Theprobabilityhistogramshowsallthepossiblevaluesofthequantityandtheprobabilitiesofallthosevalues.Theprobabilitiesarecalculatedusingmathandtherulesofchance.

Theempiricalhistogramisbasedonrunningtherandomprocessrepeatedlyandkeepingtrackofthevalueofthequantityeachtime.Itshowsallthevaluesofthequantityandtheproportionoftimeseachvaluewasobservedamongalltherepetitions.

WenotedthatbytheLawofAverages,theempiricalhistogramislikelytoresembletheprobabilityhistogramiftherandomprocessisrepeatedalargenumberoftimes.

Thismeansthatsimulatingrandomprocessesrepeatedlyisawayofapproximatingprobabilitydistributionswithoutfiguringouttheprobabilitiesmathematically.Thuscomputersimulationsbecomeapowerfultoolindatascience.Theycanhelpdatascientistsunderstandthepropertiesofrandomquantitiesthatwouldbecomplicatedtoanalyzeusingmathalone.

Hereisanexampleofsuchasimulation.

Estimatingthenumberofenemyaircraft¶InWorldWarII,dataanalystsworkingfortheAlliesweretaskedwithestimatingthenumberofGermanwarplanes.ThedataincludedtheserialnumbersoftheGermanplanesthathadbeenobservedbyAlliedforces.Theseserialnumbersgavethedataanalystsawaytocomeupwithananswer.

Tocreateanestimateofthetotalnumberofwarplanes,thedataanalystshadtomakesomeassumptionsabouttheserialnumbers.Herearetwosuchassumptions,greatlysimplifiedtomakeourcalculationseasier.

1. ThereareNplanes,numbered .

2. Theobservedplanesaredrawnuniformlyatrandomwithreplacementfromtheplanes.

Thegoalistoestimatethenumber .

ComputationalandInferentialThinking

132Estimation

Page 133: Computational and inferential thinking

Supposeyouobservesomeplanesandnotedowntheirserialnumbers.Howmightyouusethedatatoguessthevalueof ?Anaturalandstraightforwardmethodwouldbetosimplyusethelargestserialnumberobserved.

Letusseehowwellthismethodofestimationworks.First,anothersimplification:SomehistoriansnowestimatethattheGermanaircraftindustryproducedalmost100,000warplanesofmanydifferentkinds,Butherewewillimaginejustonekind.ThatmakesAssumption1aboveeasiertojustify.

Supposethereareinfact planesofthiskind,andthatyouobserve30ofthem.Wecanconstructatablecalledserialnothatcontainstheserialnumbers1through .Wecanthensample30timeswithreplacement(seeAssumption2)togetoursampleofserialnumbers.Ourestimateisthemaximumofthese30numbers.

N=300

serialno=Table().with_column('serialnumber',np.arange(1,N+1))

serialno

serialnumber

1

2

3

4

5

6

7

8

9

10

...(290rowsomitted)

max(serialno.sample(30,with_replacement=True).column(0))

296

ComputationalandInferentialThinking

133Estimation

Page 134: Computational and inferential thinking

Aswithallcodeinvolvingrandomsampling,runitafewtimestoseethevariation.Youwillobservethatevenwithjust30observationsfromamong300,thelargestserialnumberistypicallyinthe250-300range.

Inprinciple,thelargestserialnumbercouldbeassmallas1,ifyouwereunluckyenoughtoseePlaneNumber1all30times.Anditcouldbeaslargeas300ifyouobservePlaneNumber300atleastonce.Butusually,itseemstobeintheveryhigh200's.Itappearsthatifyouusethelargestobservedserialnumberasyourestimateofthetotal,youwillnotbeveryfarwrong.

Letusgeneratesomedatatoseeifwecanconfirmthis.Wewilluseiterationtorepeatthesamplingprocedurenumeroustimes,eachtimenotingthelargestserialnumberobserved.Thesewouldbeourestimatesof fromallthenumeroussamples.Wewillthendrawahistogramofalltheseestimates,andexaminebyhowmuchtheydifferfrom .

Inthecodebelow,wewillrun750repetitionsofthefollowingprocess:Sample30timesatrandomwithreplacementfrom1through300andnotethelargestnumberobserved.

Todothis,wewilluseaforloop.Asyouhaveseenbefore,wewillstartbysettingupanemptytablethatwilleventuallyholdalltheestimatesthataregenerated.Aseachestimateisthelargestnumberinitssample,wewillcallthistablemaxes.

Foreachinteger(callediinthecode)intherange0through749(750total),theforloopexecutesthecodeinthebodyoftheloop.Inthisexample,itgeneratesarandomsampleof30serialnumbers,computesthemaximumvalue,andaugmentstherowsofmaxeswiththatvalue.

sample_size=30

repetitions=750

maxes=Table(['i','max'])

foriinnp.arange(repetitions):

sample=serialno.sample(sample_size,with_replacement=True)

maxes.append([i,max(sample.column(0))])

maxes

ComputationalandInferentialThinking

134Estimation

Page 135: Computational and inferential thinking

i max

0 300

1 300

2 291

3 300

4 297

5 295

6 283

7 296

8 299

9 273

...(740rowsomitted)

Hereisahistogramofthe750estimates.Asyoucansee,theestimatesareallcrowdedupnear300,eventhoughintheorytheycouldbemuchsmaller.Thehistogramindicatesthatasanestimateofthetotalnumberofplanes,thelargestserialnumbermightbetoolowbyabout10to25.Butitisextremelyunlikelytobeunderestimatethetruenumberofplanesbymorethanabout50.

every_ten=np.arange(1,N+100,10)

maxes.hist('max',bins=every_ten)

Theempiricaldistributionofastatistic¶

ComputationalandInferentialThinking

135Estimation

Page 136: Computational and inferential thinking

Intheexampleabove,thelargestserialnumberiscalledastatistic(singular!).Astatisticisanynumbercomputedusingthedatainasample.

Thegraphaboveisanempiricalhistogram.Itdisplaystheempiricalorobserveddistributionofthestatistic,basedon750samples.

Thestatisticcouldhavebeendifferent.

Afundamentalconsiderationinusinganystatisticbasedonarandomsampleisthatthesamplecouldhavecomeoutdifferently,andthereforethestatisticcouldhavecomeoutdifferentlytoo.

Justhowdifferentcouldthestatistichavebeen?

Ifyougenerateallofthepossiblesamples,andcomputethestatisticforeachofthem,thenyouwillhaveaclearpictureofhowdifferentthestatisticmighthavebeen.Indeed,youwillhaveacompleteenumerationofallthepossiblevaluesofthestatisticandalltheirprobabilities.Theresultingdistributioniscalledtheprobabilitydistributionofthestatistic,anditshistogramisthesameastheprobabilityhistogram.

Theprobabilitydistributionofastatisticisalsocalledthesamplingdistributionofthestatistic,becauseitisbasedonallofthepossiblesamples.

Thetotalnumberofpossiblesamplesisoftenverylarge.Forexample,thetotalnumberofpossiblesequencesof30serialnumbersthatyoucouldseeiftherewere300aircraftis

300**30

205891132094649000000000000000000000000000000000000000000000000000000000000

That'salotofsamples.Fortunately,wedon'thavetogenerateallofthem.Weknowthattheempiricalhistogramofthestatistic,basedonmanybutnotallofthepossiblesamples,isagoodapproximationtotheprobabilityhistogram.Theempiricaldistributionofastatisticgivesagoodideaofhowdifferentthestatisticcouldbe.

Itistruethattheprobabilitydistributionofastatisticcontainsmoreaccurateinformationaboutthestatisticthananempiricaldistributiondoes.Butoften,asinthisexample,theapproximationprovidedbytheempiricaldistributionissufficientfordatascientiststounderstandhowmuchastatisticcanvary.Andempiricaldistributionsareeasiertocompute.Therefore,datascientistsoftenuseempiricaldistributionsinsteadofexactprobabilitydistributionswhentheyaretryingtounderstandrandomquantities.

AnotherEstimate.

ComputationalandInferentialThinking

136Estimation

Page 137: Computational and inferential thinking

Hereisanexampletoillustratethispoint.Thusfar,wehaveusedthelargestobservedserialnumberasanestimateofthetotalnumberofplanes.Butthereareotherpossibleestimates,andwewillnowconsideroneofthem.

Theideaunderlyingthisestimateisthattheaverageoftheobservedserialnumbersislikelybeabouthalfwaybetween1and .Thus,if istheaverage,then

Thusanewstatisticcanbeusedtoestimatethetotalnumberofplanes:taketheaverageoftheobservedserialnumbersanddoubleit.

Howdoesthismethodofestimationcomparewithusingthelargestnumberobserved?Itisnoteasytocalculatetheprobabilitydistributionofthenewstatistic.Butasbefore,wecansimulateittogettheprobabilitiesapproximately.Let'stakealookattheempiricaldistributionbasedonrepeatedsampling.Thenumberofrepetitionsischosentobethesameasitwasfortheearlierestimate.Thiswillallowustocomparethetwoempiricaldistributions.

new_est=Table(['i','2*avg'])

foriinnp.arange(repetitions):

sample=serialno.sample(sample_size,with_replacement=True)

new_est.append([i,2*np.average(sample.column(0))])

new_est.hist('2*avg',bins=every_ten)

Noticethatunliketheearlierestimate,thisonecanoverestimatethenumberofplanes.Thiswillhappenwhentheaverageoftheobservedserialnumbersiscloserto thanto1.

ComputationalandInferentialThinking

137Estimation

Page 138: Computational and inferential thinking

Foreaseofcomparison,letusrunanothersetof750repetitionsofthesamplingprocessandcalculatebothestimatesforeachsample.

new_est=Table(['i','max','2*avg'])

foriinnp.arange(repetitions):

sample=serialno.sample(sample_size,with_replacement=True)

new_est.append([i,max(sample.column(0)),2*np.average(sample

new_est.select(['max','2*avg']).hist(bins=every_ten)

Youcanseethattheoldmethodalmostalwaysunderestimates;formally,wesaythatitisbiased.Butithaslowvariability,andishighlylikelytobeclosetothetruetotalnumberofplanes.

Thenewmethodoverestimatesaboutasoftenasitunderestimates,andthusisroughlyunbiasedonaverageinthelongrun.However,itismorevariablethantheoldestimate,andthusispronetolargerabsoluteerrors.

Thisisaninstanceofabias-variancetradeoffthatisnotuncommonamongcompetingestimates.Whichestimateyoudecidetousewilldependonthekindsoferrorsthatmatterthemosttoyou.Inthecaseofenemywarplanes,underestimatingthetotalnumbermighthavegrimconsequences,inwhichcaseyoumightchoosetousethemorevariablemethodthatoverestimatesabouthalfthetime.Ontheotherhand,ifoverestimationleadstoneedlesslyhighcostsofguardingagainstplanesthatdon'texist,youmightbesatisfiedwiththemethodthatunderestimatesbyamodestamount.

ComputationalandInferentialThinking

138Estimation

Page 139: Computational and inferential thinking

Technicalnote:Infact,twicetheaverageisnotunbiased.Onaverage,itoverestimatesbyexactly1.Forexample,ifNis3,theaverageofdrawsfrom1,2,and3willbe2,and2times2is4,whichisonemorethanN.Twicetheaverageminus1isanunbiasedestimatorofN.

ComputationalandInferentialThinking

139Estimation

Page 140: Computational and inferential thinking

CenterInteractWehaveseenthatstudyingthevariabilityofanestimateinvolvesquestionssuchas:

Whereistheempiricalhistogramcentered?Abouthowfarawayfromcentercouldthevaluesbe?

Inordertodiscussestimatesandtheirvariabilityingreaterdetail,itwillhelptoquantifysomebasicfeaturesofdistributions.

Themeanofalistofnumbers¶

Inthiscourse,wewillusethewords"average"and"mean"interchangeably.

Definition:Theaverageormeanofalistofnumbersisthesumofallthenumbersinthelist,dividedbythenumberofentriesinthelist.

Hereisthedefinitionappliedtoalistof20numbers.

#Thedata:alistofnumbers

x_list=[4,4,4,5,5,5,2,2,2,2,2,2,3,3,3,3,3,3,3,

#Theaverage

sum(x_list)/len(x_list)

3.15

Theaverageofthelistis3.15.

Youcanthinkoftakingthemeanasan"equalizing"or"smoothing"operation.Forexample,imaginetheentriesinx_listbeingtheamountsofmoney(indollars)inthepocketsof20differentpeople.Togetthemean,youfirstputallofthemoneyintoonebigpot,andthendivideitevenlyamongall20people.Theyhadstartedoutwithdifferentamountsofmoneyintheirpockets,butnoweachpersonhas$3.15,theaverageamount.

ComputationalandInferentialThinking

140Center

Page 141: Computational and inferential thinking

Computation:Wecouldinsteadhavefoundthemeanbyapplyingthefunctionnp.meantothelistx_list.Differencesinroundingleadtoanswersthatdifferveryslightly.

np.mean(x_list)

3.1499999999999999

Orwecouldhaveplacedthelistinatable,andusedtheTablemethodmean:

x_table=Table().with_column('value',x_list)

x_table['value'].mean()

3.1499999999999999

Themeanandthehistogram¶

Anotherwaytocalculatethemeanistofirstarrangethelistinadistributiontable.Thecolumnscontainthedistinctvaluesinthelist,thecountofeachvalue,andthecorrespondingproportion.Thesumofthecolumncountis20,thenumberofentriesinthelist.

values=Table(['value','count']).with_rows([

[2,6],

[3,8],

[4,3],

[5,3]

])

counts=values.column('count')

x_dist=values.with_column('proportion',counts/sum(counts))

#thedistributiontable

x_dist

ComputationalandInferentialThinking

141Center

Page 142: Computational and inferential thinking

value count proportion

2 6 0.3

3 8 0.4

4 3 0.15

5 3 0.15

Thesumofthenumbersinthelistcanbecalculatedbymultiplyingthefirsttwocolumnsandaddingup–four2's,five3's,andsoon–andthendividingby20:

sum(x_dist.column('value')*x_dist.column('count'))/sum(x_dist.column

3.1499999999999999

Thisisthesameasmultiplyingthefirstandthirdcolumns–thevaluesandtheirproportions–andaddingup:

sum(x_dist.column('value')*x_dist.column('proportion'))

3.1500000000000004

Wenowhaveanotherwaytothinkaboutthemean:Themeanofalististheweightedaverageofthedistinctvalues,wheretheweightsaretheproportionsinwhichthosevaluesappear.

Physically,thisimpliesthattheaverageisthebalancepointofthehistogram.Hereisthehistogramofthedistribution.Imaginethatitismadeoutofcardboardandthatyouaretryingtobalancethecardboardfigureatapointonthehorizontalaxis.Theaverageof3.15iswherethefigurewillbalance.

Tounderstandwhythatis,youwillhavetostudysomephysics.Balancepoints,orcentersofgravity,canbecalculatedinexactlythewaywecalculatedthemeanfromthedistributiontable.

x_dist.select(['value','proportion']).hist(counts='value',bins=np

ComputationalandInferentialThinking

142Center

Page 143: Computational and inferential thinking

Noticethatthemeanofalistdependsonlyonthedistinctvaluesandtheirproportions;youdonotneedtoknowhowmanyentriesthereareinthelist.

Inotherwords,theaverageisapropertyofthehistogram.Iftwolistshavethesamehistogram,theywillalsohavethesameaverage.

Themeanandthemedian¶

Ifastudent'sscoreonatestisbelowaverage,doesthatimplythatthestudentisinthebottomhalfoftheclassonthattest?

Happilyforthestudent,theansweris,"Notnecessarily."Thereasonhastodowiththerelationbetweentheaverage,whichisthebalancepointofthehistogram,andthemedian,whichisthe"half-waypoint"ofthedata.

Let'scomparethebalancepointofthehistogramtothepointthathas50%oftheareaofthehistogramoneithersideofit.

Therelationshipiseasytoseeinasimpleexample.Hereisthelist1,2,2,3,representedasadistributiontablecalledsymfor"symmetric".

sym=Table().with_columns([

'value',[1,2,3],

'dist',[0.25,0.5,0.25]

])

sym

ComputationalandInferentialThinking

143Center

Page 144: Computational and inferential thinking

value dist

1 0.25

2 0.5

3 0.25

Thehistogram(oracalculation)showsthattheaverageandmedianareboth2.

sym.hist(counts='value',bins=np.arange(0.5,10.6,1))

Ingeneral,forsymmetricdistributions,themeanandthemedianareequal.

Whatifthedistributionisnotsymmetric?Wewillexplorethisbymakingachangeinthehistogramabove:wewilltakethebaratthevalue3andslideitovertothevalue10.

#Averageversusmedian:

#Balancepointversus50%point

slide=Table().with_columns([

'value',[1,2,3,10],

'dist1',[0.25,0.5,0.25,0],

'dist2',[0.25,0.5,0,0.25]

])

slide

ComputationalandInferentialThinking

144Center

Page 145: Computational and inferential thinking

value dist1 dist2

1 0.25 0.25

2 0.5 0.5

3 0.25 0

10 0 0.25

slide.hist(counts='value',bins=np.arange(0.5,10.6,1))

Thebluehistogramrepresentstheoriginaldistribution;thegoldhistogramstartsoutthesameastheblueattheleftend,butitsrightmostbarhasslidovertothevalue10.Thebrownpartiswherethetwohistogramsoverlap.

Themedianandmeanofthebluedistributionarebothequalto2.Themedianofthegolddistributionisalsoequalto2;theareaoneithersideof2isstill50%,thoughtherighthalfisdistributeddifferentlyfromtheleft.

Butthemeanofthegolddistributionisnot2:thegoldhistogramwouldnotbalanceat2.Thebalancepointhasshiftedtotheright,to3.75.

sum(slide['value']*slide['dist2'])

3.75

Inthegolddistribution,3outof4entries(75%)arebelowaverage.Thestudentwithabelowaveragescorecanthereforetakeheart.Heorshemightbeinthemajorityoftheclass.

ComputationalandInferentialThinking

145Center

Page 146: Computational and inferential thinking

Themean,themedian,andskeweddistributions:Ingeneral,ifthehistogramhasatailononeside(theformaltermis"skewed"),thenthemeanispulledawayfromthemedianinthedirectionofthetail.

Example:FlightDelays.TheBureauofTransportationStatisticsprovidesdataondeparturesandarrivalsofflightsintheUnitedStates.Thetableuacontainsthedeparturedelaytimes,inminutes,ofabout37,500UnitedAirlinesflightsoverthepastfewyears.Anegativedelaymeansthattheflightleftearlierthanscheduled.

ontime=Table.read_table('airline_ontime.csv')

ontime.show(3)

Carrier DepartureDelay ArrivalDelay

AA -1 -1

AA -1 -1

AA -1 -1

...(457010rowsomitted)

ua=ontime.where('Carrier','UA').select(1)

ua.show(3)

DepartureDelay

-1

-1

-1

...(37360rowsomitted)

ua_bins=np.arange(-15,121,2)

ua.hist(bins=ua_bins,unit='minute')

ComputationalandInferentialThinking

146Center

Page 147: Computational and inferential thinking

Thehistogramofthedelaytimesisright-skewed:ithasalongrighthandtail.

Themeangetspulledawayfromthemedianinthedirectionofthetail.Soweexpectthemeandelaytobelargerthanthemedian,andthatisindeedthecase:

np.mean(ua['DepartureDelay'])

13.885555228434548

np.median(ua['DepartureDelay'])

2.0

ComputationalandInferentialThinking

147Center

Page 148: Computational and inferential thinking

SpreadInteract

Variability¶

TheRoughSizeofDeviationsfromAverage¶

Aswehaveseen,inferencebasedonteststatisticsmusttakeintoaccountthewayinwhichthestatisticsvaryacrosssamples.Itisthereforeimportanttobeabletoquantifythevariabilityinanylistofnumbers.Onewayistocreateameasureofthedifferencebetweenthevaluesinthelistandtheaverageofthelist.

Wewillstartbydefiningsuchameasureinthecontextofasimplearrayofjustfournumbers.

numbers=np.array([1,2,2,10])

numbers

array([1,2,2,10])

Thegoalistogetameasureofroughlyhowfaroffthenumbersinthelistarefromtheaverage.Todothis,weneedtheaverage,andallthedeviationsfromtheaverage.A"deviationfromaverage"isjustavalueminustheaverage.

#Step1.Theaverage.

average=np.mean(numbers)

average

3.75

ComputationalandInferentialThinking

148Spread

Page 149: Computational and inferential thinking

#Step2.Thedeviationsfromaverage.

deviations=numbers-average

numbers_and_deviations=Table().with_columns([

'value',numbers,

'deviationfromaverage',deviations

])

numbers_and_deviations

value deviationfromaverage

1 -2.75

2 -1.75

2 -1.75

10 6.25

Someofthedeviationsarenegative;thosecorrespondtovaluesthatarebelowaverage.Positivedeviationscorrespondtoabove-averagevalues.

Tocalculateroughlyhowbigthedeviationsare,itisnaturaltocomputethemeanofthedeviations.Butsomethinginterestinghappenswhenallthedeviationsareaddedtogether:

sum(deviations)

0.0

Thepositivedeviationsexactlycanceloutthenegativeones.Thisistrueofalllistsofnumbers,nomatterwhatthehistogramofthelistlookslike:thesumofthedeviationsfromaverageiszero.

Sincethesumofthedeviationsis0,themeanofthedeviationswillbe0aswell:

np.mean(deviations)

0.0

ComputationalandInferentialThinking

149Spread

Page 150: Computational and inferential thinking

Becauseofthis,themeanofthedeviationsisnotausefulmeasureofthesizeofthedeviations.Whatwereallywanttoknowisroughlyhowbigthedeviationsare,regardlessofwhethertheyarepositiveornegative.Soweneedawaytoeliminatethesignsofthedeviations.

Therearetwotime-honoredwaysoflosingsigns:theabsolutevalue,andthesquare.Itturnsoutthattakingthesquareconstructsameasurewithextremelypowerfulproperties,someofwhichwewillstudyinthiscourse.

Soletuseliminatethesignsbysquaringallthedeviations.Thenwewilltakethemeanofthesquares:

#Step3.Thesquareddeviationsfromaverage

squared_deviations=deviations**2

numbers_and_deviations.with_column('squareddeviationsfromaverage'

value deviationfromaverage squareddeviationsfromaverage

1 -2.75 7.5625

2 -1.75 3.0625

2 -1.75 3.0625

10 6.25 39.0625

#Variance=themeansquareddeviationfromaverage

variance=np.mean(squared_deviations)

variance

13.1875

Variance:Themeansquareddeviationiscalledthevarianceofasequenceofvalues.Whileitdoesgiveusanideaofspread,itisnotonthesamescaleastheoriginalvariableanditsunitsarethesquareoftheoriginal.Thismakesinterpretationverydifficult.Sowereturntotheoriginalscalebytakingthepositivesquarerootofthevariance:

ComputationalandInferentialThinking

150Spread

Page 151: Computational and inferential thinking

#StandardDeviation:rootmeansquareddeviationfromaverage

#Stepsofcalculation:54321

sd=variance**0.5

sd

3.6314597615834874

StandardDeviation¶Thequantitythatwehavejustcomputediscalledthestandarddeviationofthelist,andisabbreviatedasSD.Itmeasuresroughlyhowfarthenumbersonthelistarefromtheiraverage.

Definition.TheSDofalistisdefinedastherootmeansquareofdeviationsfromaverage.That'samouthful.Butreaditfromrighttoleftandyouhavethesequenceofstepsinthecalculation.

Computation.ThenumPyfunctionnp.stdcomputestheSDofvaluesinanarray:

np.std(numbers)

3.6314597615834874

WorkingwiththeSD¶

LetusnowexaminetheSDinthecontextofamoreinterestingdataset.ThetablenbacontainsdataontheplayersintheNationalBasketballAssociation(NBA)in2013.Foreachplayer,thetablerecordsthepositionatwhichtheplayerusuallyplayed,hisheightininches,weightinpounds,andageinyears.

nba=Table.read_table("nba2013.csv")

nba

ComputationalandInferentialThinking

151Spread

Page 152: Computational and inferential thinking

Name Position Height Weight Agein2013

DeQuanJones Guard 80 221 23

DariusMiller Guard 80 235 23

TrevorAriza Guard 80 210 28

JamesJones Guard 80 215 32

WesleyJohnson Guard 79 215 26

KlayThompson Guard 79 205 23

ThaboSefolosha Guard 79 215 29

ChaseBudinger Guard 79 218 25

KevinMartin Guard 79 185 30

EvanFournier Guard 79 206 20

...(495rowsomitted)

Hereisahistogramoftheplayers'heights.

nba.select('Height').hist(bins=np.arange(68,88,1))

ItisnosurprisethatNBAplayersaretall!Theiraverageheightisjustover79inches(6'7"),about10inchestallerthantheaverageheightofmenintheUnitedStates.

mean_height=np.mean(nba.column('Height'))

mean_height

ComputationalandInferentialThinking

152Spread

Page 153: Computational and inferential thinking

79.065346534653472

Aboutfaroffaretheplayers'heightsfromtheaverage?ThisismeasuredbytheSDoftheheights,whichisabout3.45inches.

sd_height=np.std(nba.column('Height'))

sd_height

3.4505971830275546

ThetoweringcenterHasheemThabeetoftheOklahomaCityThunderwasthetallestplayerataheightof87inches.

nba.sort('Height',descending=True).show(3)

Name Position Height Weight Agein2013

HasheemThabeet Center 87 263 26

RoyHibbert Center 86 278 26

TysonChandler Center 85 235 30

...(502rowsomitted)

Thabeetwasabout8inchesabovetheaverageheight.

87-mean_height

7.9346534653465284

That'sadeviationfromaverage,anditisabout2.3timesthestandarddeviation:

(87-mean_height)/sd_height

2.2995015194397923

Inotherwords,theheightofthetallestplayerwasabout2.3SDsaboveaverage.

ComputationalandInferentialThinking

153Spread

Page 154: Computational and inferential thinking

At69inchestall,IsaiahThomaswasoneofthetwoshortestplayersintheNBA.Hisheightwasabout2.9SDsbelowaverage.

nba.sort('Height').show(3)

Name Position Height Weight Agein2013

IsaiahThomas Guard 69 185 24

NateRobinson Guard 69 180 29

JohnLucasIII Guard 71 157 30

...(502rowsomitted)

(69-mean_height)/sd_height

-2.9169868288775844

WhatwehaveobservedisthatthetallestandshortestplayerswerebothjustafewSDsawayfromtheaverageheight.ThisisanexampleofwhytheSDisausefulmeasureofspread.Nomatterwhattheshapeofthehistogram,theaverageandtheSDtogethertellyoualotaboutthedata.

FirstmainreasonformeasuringspreadbytheSD¶

Informalstatement.Foralllists,thebulkoftheentriesarewithintherange"average afewSDs".

Trytoresistthedesiretoknowexactlywhatfuzzywordslike"bulk"and"few"mean.Wewilmakethempreciselaterinthissection.Fornow,letusexaminethestatementinthecontextofsomeexamples.

WehavealreadyseenthatalloftheheightsoftheNBAplayerswereintherange"average3SDs".

Whatabouttheages?Hereisahistogramofthedistribution.Yes,therewasa15-year-oldplayerintheNBAin2013.

nba.select('Agein2013').hist(bins=np.arange(15,45,1))

ComputationalandInferentialThinking

154Spread

Page 155: Computational and inferential thinking

JuwanHowardwastheoldestplayer,at40.Hisagewasabout3.2SDsaboveaverage.

nba.sort('Agein2013',descending=True).show(3)

Name Position Height Weight Agein2013

JuwanHoward Forward 81 250 40

MarcusCamby Center 83 235 39

DerekFisher Guard 73 210 39

...(502rowsomitted)

ages=nba['Agein2013']

(40-np.mean(ages))/np.std(ages)

3.1958482778922357

Theyoungestwas15-year-oldJarvisVarnado,whowontheNBAChampionshipthatyearwiththeMiamiHeat.Hisagewasabout2.6SDsbelowaverage.

nba.sort('Agein2013').show(3)

ComputationalandInferentialThinking

155Spread

Page 156: Computational and inferential thinking

Name Position Height Weight Agein2013

JarvisVarnado Forward 81 230 15

GiannisAntetokounmpo Forward 81 205 18

SergeyKarasev Guard 79 197 19

...(502rowsomitted)

(15-np.mean(ages))/np.std(ages)

-2.5895811038670811

Whatwehaveobservedfortheheightsandagesistrueingreatgenerality.Foralllists,thebulkoftheentriesarenomorethan2or3SDsawayfromtheaverage.

Theseroughstatementsaboutdeviationsfromaveragearemadeprecisebythefollowingresult.

Chebychev'sinequality¶

Foralllists,andallpositivenumbersz,theproportionofentriesthatareintherange"average SDs"isatleast .

Whatmakesthisresultpowerfulisthatitistrueforalllists–alldistributions,nomatterhowirregular.

Specifically,Chebychev'sinequalitysaysthatforeverylist:

theproportionintherange"average 2SDs"isatleast1-1/4=0.75

theproportionintherange"average 3SDs"isatleast1-1/9 0.89

theproportionintherange"average 4.5SDs"isatleast1-1/ 0.95

Chebychev'sInequalitygivesalowerbound,notanexactansweroranapproximation.Forexample,thepercentofentriesintherange"average SDs"mightbequiteabitlargerthan75%.Butitcannotbesmaller.

Standardunits¶

Inthecalculationsabove, measuresstandardunits,thenumberofstandarddeviationsaboveaverage.

ComputationalandInferentialThinking

156Spread

Page 157: Computational and inferential thinking

Somevaluesofstandardunitsarenegative,correspondingtooriginalvaluesthatarebelowaverage.Othervaluesofstandardunitsarepositive.Butnomatterwhatthedistributionofthelistlookslike,Chebychev'sInequalitysaysthatstandardunitswilltypicallybeinthe(-5,5)range.

Toconvertavaluetostandardunits,firstfindhowfaritisfromaverage,andthencomparethatdeviationwiththestandarddeviation.

Aswewillsee,standardunitsarefrequentlyusedindataanalysis.Soitisusefultodefineafunctionthatconvertsanarrayofnumberstostandardunits.

defstandard_units(any_numbers):

"Convertanyarrayofnumberstostandardunits."

return(any_numbers-np.mean(any_numbers))/np.std(any_numbers)

Aswesawinanearliersection,thetableuacontainsacolumnDepartureDelayconsistingofthedeparturedelaytimes,inminutes,ofover37,000UnitedAirlinesflights.WecancreateanewcolumncalledDelay_subyapplyingthefunctionstandard_unitstothecolumnofdelaytimes.Thisallowsustoseeallthedelaytimesinminutesaswellastheircorrespondingvaluesinstandardunits.

ua.append_column('z',standard_units(ua.column('DepartureDelay')))

ua

ComputationalandInferentialThinking

157Spread

Page 158: Computational and inferential thinking

DepartureDelay z

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

-1 -0.391942

...(37353rowsomitted)

Somethingratheralarminghappenswhenwesortthedelaytimesfromhighesttolowest.Thestandardunitsareextremelyhigh!

ua.sort(0,descending=True)

DepartureDelay z

886 22.9631

739 19.0925

678 17.4864

595 15.301

566 14.5374

558 14.3267

543 13.9318

541 13.8791

537 13.7738

513 13.1419

...(37353rowsomitted)

WhatthisshowsisthatitispossiblefordatatobemanySDsaboveaverage(andforflightstobedelayedbyalmost15hours).Thehighestvalueismorethan22instandardunits.

ComputationalandInferentialThinking

158Spread

Page 159: Computational and inferential thinking

However,theproportionoftheseextremevaluesistiny,andtheboundsinChebychev'sinequalitystillholdtrue.Forexample,letuscalculatethepercentofdelaytimesthatareintherange"average 2SDs".Thisisthesameasthepercentoftimesforwhichthestandardunitsareintherange(-2,2).Thatisjustabout96%,ascomputedbelow,consistentwithChebychev'sboundof"atleast75%".

within_2_sd=np.logical_and(ua.column('z')>-2,ua.column('z')<

ua.where(within_2_sd).num_rows/ua.num_rows

0.960522441988063

Thehistogramofdelaytimesisshownbelow,withthehorizontalaxisinstandardunits.Bythetableabove,therighthandtailcontinuesallthewayoutto22standardunits(886minutes).Theareaofthehistogramoutsidetherange to isabout4%,puttogetherintinylittlebitsthataremostlyinvisibleinahistogram.

ua.hist('z',bins=np.arange(-5,23,1),unit='standardunit')

ComputationalandInferentialThinking

159Spread

Page 160: Computational and inferential thinking

TheNormalDistributionInteractWeknowthatthemeanisthebalancepointofthehistogram.Unlikethemean,theSDisusuallynoteasytoidentifybylookingatthehistogram.

However,thereisoneshapeofdistributionforwhichtheSDisalmostasclearlyidentifiableasthemean.Thissectionexaminesthatshape,asitappearsfrequentlyinprobabilityhistogramsandalsoinmanyhistogramsofdata.

TheSDandthebellcurve¶

Thetablebirthscontainsdataonover1,000babiesbornatahospitalintheBayArea.Thecolumnscontainseveralvariables:thebaby'sbirthweightinounces;thenumberofgestationaldays;themother'sageinyears,heightininches,andpregnancyweightinpounds;andarecordofwhethershesmokedornot.

births=Table.read_table('baby.csv')

births

BirthWeight

GestationalDays

MaternalAge

MaternalHeight

MaternalPregnancyWeight

MaternalSmoker

120 284 27 62 100 False

113 282 33 64 135 False

128 279 28 64 115 True

108 282 23 67 125 True

136 286 25 62 93 False

138 244 33 62 178 False

132 245 23 65 140 False

120 289 25 62 125 False

143 299 30 66 136 True

140 351 27 68 120 False

...(1164rowsomitted)

ComputationalandInferentialThinking

160TheNormalDistribution

Page 161: Computational and inferential thinking

Themothers'heightshaveameanof64inchesandanSDof2.5inches.Unliketheheightsofthebasketballplayers,themothers'heightsaredistributedfairlysymmetricallyaboutthemeaninabell-shapedcurve.

heights=births.column('MaternalHeight')

mean_height=np.round(np.mean(heights),1)

mean_height

64.0

sd_height=np.round(np.std(heights),1)

sd_height

2.5

births.hist('MaternalHeight',bins=np.arange(55.5,72.5,1),unit=

positions=np.arange(-3,3.1,1)*sd_height+mean_height

_=plots.xticks(positions)

Thelasttwolinesofcodeinthecellabovechangethelabelingofthehorizontalaxis.Now,thelabelscorrespondto"average SDs"for ,and .Becauseoftheshapeofthedistribution,the"center"hasanunambiguousmeaningandisclearlyvisibleat64.

ComputationalandInferentialThinking

161TheNormalDistribution

Page 162: Computational and inferential thinking

ToseehowtheSDisrelatedtothecurve,startatthetopofthecurveandlooktowardstheright.Noticethatthereisaplacewherethecurvechangesfromlookinglikean"upside-downcup"toa"right-way-upcup";formally,thecurvehasapointofinflection.ThatpointisoneSDaboveaverage.Itisthepoint ,whichis"averageplus1SD"=66.5inches.

Symmetricallyontheleft-handsideofthemean,thepointofinflectionisat ,thatis,"averageminus1SD"=61.5inches.

HowtospottheSDonabell-shapedcurve:Forbell-shapeddistributions,theSDisthedistancebetweenthemeanandthepointsofinflectiononeitherside.

Bell-shapedprobabilityhistograms¶

Thereasonforstudyingbell-shapedcurvesisnotjustthattheyappearassomedistributionsofdata,butthattheyappearasapproximationstoprobabilityhistogramsofsampleaverages.

Wehaveseenthisphenomenonalready,intheexampleaboutestimatingthetotalnumberofaircraft.Recallthatwesimulatedrandomsamplesofsize30drawnwithreplacementfromtheserialnumbers1,2,3,...,300,andcomputedtwodifferentquantities:thelargestofthe30numbersobserved,andtwicetheaverageofthe30numbers.Herearethegeneratedhistograms.

N=300

sample_size=30

repetitions=5000

serialno=Table().with_column('serialnumber',np.arange(1,N+1))

new_est=Table(['i','max','2*avg'])

foriinnp.arange(repetitions):

sample=serialno.sample(sample_size,with_replacement=True)

new_est.append([i,max(sample.column(0)),2*np.average(sample

every_ten=np.arange(1,N+100,10)

new_est.select(['max','2*avg']).hist(bins=every_ten)

ComputationalandInferentialThinking

162TheNormalDistribution

Page 163: Computational and inferential thinking

Weobservedthatthedistributionof"twicethesampleaverage"isroughlysymmetric;nowlet'salsoobservethatitisroughlybell-shaped.

Ifthedistributionof"twicetheaverage"isbell-shaped,soisthedistributionoftheaverage.Hereisthedistributionoftheaverageofasampleofsize30drawnfromthenumbers1,2,3,...,300.Thenumberofreplicationsofthesamplingprocedureislarge,soitissafetoassumethattheempiricalhistogramresemblestheprobabilityhistogramofthesampleaverage.

sample_means=Table(['Samplemean'])

foriinnp.arange(repetitions):

sample=serialno.sample(sample_size,with_replacement=True)

sample_means.append([np.average(sample.column(0))])

every_five=np.arange(100.5,200.6,5)

sample_means.hist(bins=every_five)

ComputationalandInferentialThinking

163TheNormalDistribution

Page 164: Computational and inferential thinking

Wecandrawthiscurvewitharedbell-shapedcurvesuperposed.Thiscurveisgeneratedfromthemeanandstandarddeviationofthesamplemeans.Don'tworryaboutthecodethatdrawsthecurve;justfocusonthediagramitself.

#Importamoduleofstandardstatisticalfunctions

fromscipyimportstats

sample_means.hist(bins=every_five)

means=sample_means['Samplemean']

m=np.mean(means)

s=np.std(means)

x=np.arange(100.5,200.6,1)

plots.plot(x,stats.norm.pdf(x,m,s),color='r',lw=1)

positions=np.arange(-3,3.1,1)*15+150

_=plots.xticks(positions,[int(y)foryinpositions])

ComputationalandInferentialThinking

164TheNormalDistribution

Page 165: Computational and inferential thinking

Sinceallthenumbersdrawnarebetween1and300,weexpectthesamplemeantobearound150,whichiswherethehistogramiscentered.

CanyouguesswhattheSDis?Runyoureyealongthecurvetillyoufeelyouareclosetothepointofinflectiontotherightofcenter.Ifyouguessedthatthepointisatabout165,you'recorrect–it'sbetween165and166andsymmetricallybetween134and135totheleftofcenter.

Theprobabilityhistogramofarandomsampleaverage¶

WhatisnotableaboutthisisnotjustthatwecanspottheSD,butthattheprobabilitydistributionofthestatisticisbell-shapedinthefirstplace.ThereisaremarkablepieceofmathematicscalledtheCentralLimitTheoremthatshowsthatthedistributionoftheaverageofalargerandomsamplewillberoughlynormal,nomatterwhatthedistributionofthepopulationfromwhichthesampleisdrawn.

Inourexampleaboutserialnumbers,eachofthe300serialnumbersisequallylikelytobedrawneachtime,sothedistributionofthepopulationisuniformandtheprobabilityhistogramlookslikeabrick.

serialno.hist(bins=np.arange(0.5,300.5,1))

Let'stestouttheCentralLimitTheorembydrawingsamplesfromquiteadifferentdistribution.Hereisthehistogramofover37,000flightdelaytimesinthetableua.Itlooksnothingliketheuniformdistributionabove.

plotrange=np.arange(-15,121,2)

ua.hist(bins=plotrange,unit='minute')

ComputationalandInferentialThinking

165TheNormalDistribution

Page 166: Computational and inferential thinking

Wewillnowlookattheprobabilitydistributionoftheaverageof625flighttimesdrawnatrandomfromthispopulation.Thedrawswillbemadewithreplacement,astheywereintheexampleaboutserialnumbers.

sample_size=625

sample_means=Table(['Samplemean'])

foriinnp.arange(repetitions):

sample=ua.sample(sample_size,with_replacement=True)

sample_means.append([np.average(sample.column(0))])

sample_means.hist(bins=np.arange(9,19.1,0.5),unit='minute')

#Drawaredbell-shapedcurvefromthemeanandSDofthesamplemeans

means=sample_means['Samplemean']

m=np.mean(means)

s=np.std(means)

x=np.arange(9,19.1,0.1)

_=plots.plot(x,stats.norm.pdf(x,np.mean(ua[0]),np.std(ua[0])/25

ComputationalandInferentialThinking

166TheNormalDistribution

Page 167: Computational and inferential thinking

Theprobabilitiesforthesampleaverageareroughlybell-shaped,consistentwiththeCenralLimitTheorem.

Theaveragedelaytimeofalltheflightsinthepopulationisabout14minutes,whichiswherethehistogramiscentered.TospottheSD,itwillhelptonotethatthebarsarehalfaminutewide.Byeye,thepointsofinflectionlooktobeaboutthreebarsawayfromthecenter.SowewouldguessthattheSDisabout1.5minutes,andinfactthatisnotfarfromtheexactvalue.

Whatistheexactvalue?Wewillanswerthatquestionlaterintheterm.Remarkably,itcanbecalculatedeasilybasedonthesamplesizeandtheSDofthepopulation,withoutdoinganysimulationsatall.

Fornow,hereisthemainpointtonote.

SecondmainreasonforusingtheSDtomeasurespread¶

Foralargerandomsample,theprobabilitydistributionofthesampleaveragewillberoughlybell-shaped,withameanandSDthatcanbeeasilyidentified,nomatterwhatthedistributionofthepopulationfromwhichthesampleisdrawn.

Thestandardnormalcurve¶

Thebell-shapedcurvesabovelookessentiallyallthesameapartfromthelabelsthatwehaveplacedonthehorizontalaxes.Indeed,thereisreallyjustonebasiccurvefromwhichallofthesecurvescanbedrawnjustbyrelabelingtheaxesappropriately.

Todrawthatbasiccurve,wewillusetheunitsintowhichwecanconverteverylist:standardunits.Theresultingcurveisthereforecalledthestandardnormalcurve.

ComputationalandInferentialThinking

167TheNormalDistribution

Page 168: Computational and inferential thinking

Thestandardnormalcurvehasanimpressiveequation.Butfornow,itisbesttothinkofitasasmoothedversionofahistogramofavariablethathasbeenmeasuredinstandardunits.

#Thestandardnormalprobabilitydensityfunction(pdf)

plot_normal_cdf()

Asalwayswhenyouexamineanewhistogram,startbylookingatthehorizontalaxis.Onthehorizontalaxisofthestandardnormalcurve,thevaluesarestandardunits.Herearesomepropertiesofthecurve.Someareapparentbyobservation,andothersrequireaconsiderableamountofmathematicstoestablish.

Thetotalareaunderthecurveis1.Soyoucanthinkofitasahistogramdrawntothedensityscale.

Thecurveissymmetricabout0.Soifavariablehasthisdistribution,itsmeanandmedianareboth0.

Thepointsofinflectionofthecurveareat-1and+1.

Ifavariablehasthisdistribution,itsSDis1.ThenormalcurveisoneoftheveryfewdistributionsthathasanSDsoclearlyidentifiableonthehistogram.

Sincewearethinkingofthecurveasasmoothedhistogram,wewillwanttorepresentproportionsofthetotalamountofdatabyareasunderthecurve.Areasundersmoothcurvesareoftenfoundbycalculus,usingamethodcalledintegration.Itisafactofmathematics,however,thatthestandardnormalcurvecannotbeintegratedinanyofthe

ComputationalandInferentialThinking

168TheNormalDistribution

Page 169: Computational and inferential thinking

usualwaysofcalculus.Therefore,areasunderthecurvehavetobeapproximated.Thatiswhyalmostallstatisticstextbookscarrytablesofareasunderthenormalcurve.Itisalsowhyallstatisticalsystems,includingthestatsmoduleofPython,includemethodsthatprovideexcellentapproximationstothoseareas.

Areasunderthenormalcurve¶

Thestandardnormalcumulativedistributionfunction(cdf)

Thefundamentalfunctionforfindingareasunderthenormalcurveisstats.norm.cdf.Ittakesanumericalargumentandreturnsalltheareaunderthecurvetotheleftofthatnumber.Formally,itiscalledthe"cumulativedistributionfunction"ofthestandardnormalcurve.Thatratherunwieldymouthfulisabbreviatedascdf.

Letususethisfunctiontofindtheareatotheleftof underthestandardnormalcurve.

#Areaunderthestandardnormalcurve,below1

plot_normal_cdf(1)

stats.norm.cdf(1)

0.84134474606854293

That'sabout84%.Wecannowusethesymmetryofthecurveandthefactthatthetotalareaunderthecurveis1tofindotherareas.

ComputationalandInferentialThinking

169TheNormalDistribution

Page 170: Computational and inferential thinking

Forexample,theareatotherightof isabout100%-84%=16%.

#Areaunderthestandardnormalcurve,above1

plot_normal_cdf(lbound=1)

1-stats.norm.cdf(1)

0.15865525393145707

Theareabetween and canbecomputedinseveraldifferentways.

#Areaunderthestandardnormalcurve,between-1and1

plot_normal_cdf(1,lbound=-1)

ComputationalandInferentialThinking

170TheNormalDistribution

Page 171: Computational and inferential thinking

Forexample,wecancalculatetheareaas"100%-twoequaltails",whichworksouttoroughly100%-2x16%=68%.

Orwecouldnotethattheareabetween and isequaltoalltheareatotheleftof ,minusalltheareatotheleftof .

stats.norm.cdf(1)-stats.norm.cdf(-1)

0.68268949213708585

Byasimilarcalculation,weseethattheareabetween and2isabout95%.

#Areaunderthestandardnormalcurve,between-2and2

plot_normal_cdf(2,lbound=-2)

ComputationalandInferentialThinking

171TheNormalDistribution

Page 172: Computational and inferential thinking

stats.norm.cdf(2)-stats.norm.cdf(-2)

0.95449973610364158

Inotherwords,ifahistogramisroughlybellshaped,theproportionofdataintherange"average 2SDs"isabout95%.

ThatisquiteabitmorethanChebychev'slowerboundof75%.Chebychev'sboundisweakerbecauseithastoworkforalldistributions.Ifweknowthatadistributionisnormal,wehavegoodapproximationstotheproportions,notjustbounds.

Thetablebelowcompareswhatweknowaboutalldistributionsandaboutnormaldistributions.NoticethatChebychev'sboundisnotveryilluminatingwhen .

PercentinRange AllDistributions NormalDistribution

average 1SD atleast0% about68%

average 2SDs atleast75% about95%

average 3SDs atleast88.888...% about99.73%

ComputationalandInferentialThinking

172TheNormalDistribution

Page 173: Computational and inferential thinking

Exploration:PrivacySectionauthor:DavidWagner

Interact

Licenseplates¶We'regoingtolookatsomedatacollectedbytheOaklandPoliceDepartment.Theyhaveautomatedlicenseplatereadersontheirpolicecars,andthey'vebuiltupadatabaseoflicenseplatesthatthey'veseen--andwhereandwhentheysaweachone.

Datacollection¶First,we'llgatherthedata.ItturnsoutthedataispubliclyavailableontheOaklandpublicrecordssite.IdownloadeditandcombineditintoasingleCSVfilebymyselfbeforelecture.

lprs=Table.read_table('./all-lprs.csv.gz',compression='gzip',sep

lprs.labels

('red_VRM','red_Timestamp','Location')

Let'sstartbyrenamingsomecolumns,andthentakealookatit.

lprs.relabel('red_VRM','Plate')

lprs.relabel('red_Timestamp','Timestamp')

lprs

ComputationalandInferentialThinking

173Exploration:Privacy

Page 174: Computational and inferential thinking

Plate Timestamp Location

1275226 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

27529C 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

1158423 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

1273718 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

1077682 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

1214195 01/19/201102:06:00AM (37.798281000000003,-122.27575299999999)

1062420 01/19/201102:06:00AM (37.79833,-122.27574300000001)

1319726 01/19/201102:05:00AM (37.798475000000003,-122.27571500000001)

1214196 01/19/201102:05:00AM (37.798499999999997,-122.27571)

75227 01/19/201102:05:00AM (37.798596000000003,-122.27569)

...(2742091rowsomitted)

Phew,that'salotofdata:wecanseeabout2.7millionlicenseplatereadshere.

Let'sstartbyseeingwhatcanbelearnedaboutsomeone,usingthisdata--assumingyouknowtheirlicenseplate.

StalkingJeanQuan¶Asawarmup,we'lltakealookatex-MayorJeanQuan'scar,andwhereithasbeenseen.Herlicenseplatenumberis6FCH845.(HowdidIlearnthat?Turnsoutshewasinthenewsforgetting$1000ofparkingtickets,andthenewsarticleincludedapictureofhercar,withthelicenseplatevisible.You'dbeamazedbywhat'soutthereontheInternet...)

lprs.where('Plate','6FCH845')

ComputationalandInferentialThinking

174Exploration:Privacy

Page 175: Computational and inferential thinking

Plate Timestamp Location

6FCH845 11/01/201209:04:00AM (37.79871,-122.276221)

6FCH845 10/24/201211:15:00AM (37.799695,-122.274868)

6FCH845 10/24/201211:01:00AM (37.799693,-122.274806)

6FCH845 10/24/201210:20:00AM (37.799735,-122.274893)

6FCH845 05/08/201407:30:00PM (37.797558,-122.26935)

6FCH845 12/31/201310:09:00AM (37.807556,-122.278485)

OK,sohercarshowsup6timesinthisdataset.However,it'shardtomakesenseofthosecoordinates.Idon'tknowaboutyou,butIcan'treadGPSsowell.

So,let'sworkoutawaytoshowwherehercarhasbeenseenonamap.We'llneedtoextractthelatitudeandlongitude,asthedataisn'tquiteintheformatthatthemappingsoftwareexpects:themappingsoftwareexpectsthelatitudetobeinonecolumnandthelongitudeinanother.Let'swritesomePythoncodetodothat,bysplittingtheLocationstringintotwopieces:thestuffbeforethecomma(thelatitude)andthestuffafter(thelongitude).

defget_latitude(s):

before,after=s.split(',')#Breakitintotwoparts

lat_string=before.replace('(','')#Getridoftheannoying'('

returnfloat(lat_string)#Convertthestringtoanumber

defget_longitude(s):

before,after=s.split(',')#Breakitintotwoparts

long_string=after.replace(')','').strip()#Getridofthe')'andspaces

returnfloat(long_string)#Convertthestringtoanumber

Let'stestittomakesureitworkscorrectly.

get_latitude('(37.797558,-122.26935)')

37.797558

get_longitude('(37.797558,-122.26935)')

ComputationalandInferentialThinking

175Exploration:Privacy

Page 176: Computational and inferential thinking

-122.26935

Good,nowwe'rereadytoaddtheseasextracolumnstothetable.

lprs=lprs.with_columns([

'Latitude',lprs.apply(get_latitude,'Location'),

'Longitude',lprs.apply(get_longitude,'Location')

])

lprs

Plate Timestamp Location Latitude Longitude

1275226 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

27529C 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

1158423 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

1273718 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

1077682 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

1214195 01/19/201102:06:00AM

(37.798281000000003,-122.27575299999999) 37.7983 -122.276

1062420 01/19/201102:06:00AM

(37.79833,-122.27574300000001) 37.7983 -122.276

1319726 01/19/201102:05:00AM

(37.798475000000003,-122.27571500000001) 37.7985 -122.276

1214196 01/19/201102:05:00AM

(37.798499999999997,-122.27571) 37.7985 -122.276

75227 01/19/201102:05:00AM

(37.798596000000003,-122.27569) 37.7986 -122.276

...(2742091rowsomitted)

Andatlast,wecandrawamapwithamarkereverywherethathercarhasbeenseen.

jean_quan=lprs.where('Plate','6FCH845').select(['Latitude','Longitude'

Marker.map_table(jean_quan)

ComputationalandInferentialThinking

176Exploration:Privacy

Page 177: Computational and inferential thinking

OK,soit'sbeenseenneartheOaklandpolicedepartment.Thisshouldmakeyoususpectwemightbegettingabitofabiasedsample.WhymighttheOaklandPDbethemostcommonplacewherehercarisseen?Canyoucomeupwithaplausibleexplanationforthis?

Pokingaround¶Let'stryanother.Andlet'sseeifwecanmakethemapalittlemorefancy.It'dbenicetodistinguishbetweenlicenseplatereadsthatareseenduringthedaytime(onaweekday),vstheevening(onaweekday),vsonaweekend.Sowe'llcolor-codethemarkers.Todothis,we'llwritesomePythoncodetoanalyzetheTimestampandchooseanappropriatecolor.

ComputationalandInferentialThinking

177Exploration:Privacy

Page 178: Computational and inferential thinking

importdatetime

defget_color(ts):

t=datetime.datetime.strptime(ts,'%m/%d/%Y%I:%M:%S%p')

ift.weekday()>=6:

return'green'#Weekend

elift.hour>=6andt.hour<=17:

return'blue'#Weekdaydaytime

else:

return'red'#Weekdayevening

lprs.append_column('Color',lprs.apply(get_color,'Timestamp'))

Nowwecancheckoutanotherlicenseplate,thistimewithourspiffycolor-coding.ThisonehappenstobethecarthatthecityissuestotheFireChief.

t=lprs.where('Plate','1328354').select(['Latitude','Longitude',

Marker.map_table(t)

ComputationalandInferentialThinking

178Exploration:Privacy

Page 179: Computational and inferential thinking

Hmm.WecanseeablueclusterindowntownOakland,wheretheFireChief'scarwasseenonweekdaysduringbusinesshours.Ibetwe'vefoundheroffice.Infact,ifyouhappentoknowdowntownOakland,thosearemostlyclusteredrightnearCityHall.Also,hercarwasseentwiceinnorthernOaklandonweekdayevenings.Onecanonlyspeculatewhatthatindicates.Maybedinnerwithafriend?Orrunningerrands?Offtothesceneofafire?Whoknows.Andthenthecarhasbeenseenoncemore,lateatnightonaweekend,inaresidentialareainthehills.Herhomeaddress,maybe?

Let'slookatanother.Thistime,we'llmakeafunctiontodisplaythemap.

defmap_plate(plate):

sightings=lprs.where('Plate',plate)

t=sightings.select(['Latitude','Longitude','Timestamp','Color'

returnMarker.map_table(t)

map_plate('5AJG153')

ComputationalandInferentialThinking

179Exploration:Privacy

Page 180: Computational and inferential thinking

Whatcanwetellfromthis?LookstomelikethispersonlivesonInternationalBlvdand9th,roughly.Onweekdaysthey'veseeninavarietyoflocationsinwestOakland.It'sfuntoimaginewhatthismightindicate--deliveryperson?taxidriver?someonerunningerrandsallovertheplaceinwestOakland?

Wecanlookatanother:

map_plate('6UZA652')

ComputationalandInferentialThinking

180Exploration:Privacy

Page 181: Computational and inferential thinking

Whatcanwelearnfromthismap?First,it'sprettyeasytoguesswherethispersonlives:16thandInternational,orprettynearthere.AndthenwecanseethemspendingsomenightsandaweekendnearLaneyCollege.Didtheyhaveanapartmenttherebriefly?Arelationshipwithsomeonewholivedthere?

Isanyoneelsegettingalittlebitcreepedoutaboutthis?IthinkI'vehadenoughoflookingatindividualpeople'sdata.

Inference¶Aswecansee,thiskindofdatacanpotentiallyrevealafairbitaboutpeople.Someonewithaccesstothedatacandrawinferences.Takeamomenttothinkaboutwhatsomeonemightbeabletoinferfromthiskindofdata.

Aswe'veseenhere,it'snottoohardtomakeaprettygoodguessatroughlywheresomelives,fromthiskindofinformation:theircarisprobablyparkedneartheirhomemostnights.Also,itwilloftenbepossibletoguesswheresomeoneworks:iftheycommuteintoworkbycar,thenonweekdaysduringbusinesshours,theircarisprobablyparkedneartheiroffice,sowe'llseeaclearclusterthatindicateswheretheywork.

ComputationalandInferentialThinking

181Exploration:Privacy

Page 182: Computational and inferential thinking

Butitdoesn'tstopthere.Ifwehaveenoughdata,itmightalsobepossibletogetasenseofwhattheyliketododuringtheirdowntime(dotheyspendtimeatthepark?).Andinsomecasesthedatamightrevealthatsomeoneisinarelationshipandspendingnightsatsomeoneelse'shouse.That'sarguablyprettysensitivestuff.

Thisgetsatoneofthechallengeswithprivacy.Datathat'scollectedforonepurpose(fightingcrime,orsomethinglikethat)canpotentiallyrevealalotmore.Itcanallowtheownerofthedatatodrawinferences--sometimesaboutthingsthatpeoplewouldprefertokeepprivate.Andthatmeansthat,inaworldof"bigdata",ifwe'renotcareful,privacycanbecollateraldamage.

Mitigation¶Ifwewanttoprotectpeople'sprivacy,whatcanbedoneaboutthis?That'salengthysubject.Butatriskofover-simplifying,thereareafewsimplestrategiesthatdataownerscantake:

1. Minimizethedatatheyhave.Collectonlywhattheyneed,anddeleteitafterit'snotneeded.

2. Controlwhohasaccesstothesensitivedata.Perhapsonlyahandfuloftrustedinsidersneedaccess;ifso,thenonecanlockdownthedatasoonlytheyhaveaccesstoit.Onecanalsologallaccess,todetermisuse.

3. Anonymizethedata,soitcan'tbelinkedbacktotheindividualwhoitisabout.Unfortunately,thisisoftenharderthanitsounds.

4. Engagewithstakeholders.Providetransparency,totrytoavoidpeoplebeingtakenbysurprise.Giveindividualsawaytoseewhatdatahasbeencollectedaboutthem.Givepeopleawaytooptoutandhavetheirdatabedeleted,iftheywish.Engageinadiscussionaboutvalues,andtellpeoplewhatstepsyouaretakingtoprotectthemfromunwantedconsequences.

Thisonlyscratchesthesurfaceofthesubject.Mymaingoalinthislecturewastomakeyouawareofprivacyconcerns,sothatifyouareeverastewardofalargedataset,youcanthinkabouthowtoprotectpeople'sdataanduseitresponsibly.

ComputationalandInferentialThinking

182Exploration:Privacy

Page 183: Computational and inferential thinking

Chapter4

Prediction

ComputationalandInferentialThinking

183Prediction

Page 184: Computational and inferential thinking

CorrelationInteract

Therelationbetweentwovariables¶Intheprevioussections,wedevelopedseveraltoolsthathelpusdescribethedistributionofasinglevariable.Datasciencealsohelpsusunderstandhowmultiplevariablesarerelatedtoeachother.Thisallowsustopredictthevalueofonevariablegiventhevaluesofothers,andtogetasenseoftheamountoferrorintheprediction.

Agoodwaytostartexploringtherelationbetweentwovariablesisbyvisualization.Agraphcalledascatterdiagramcanbeusedtoplotthevalueofonevariableagainstvaluesofanother.Letusstartbylookingatsomescatterdiagramsandthenmoveontoquantifyingsomeofthefeaturesthatwesee.

ThetablehybridcontainsdataonhybridpassengercarssoldintheUnitedStatesfrom1997to2013.ThedatawereobtainedfromtheonlinedataarchiveofProf.LarryWinneroftheUniversityofFlorida.Thecolumnsincludethemodelofthecar,yearofmanufacture,theMSRP(manufacturer'ssuggestedretailprice)in2013dollars,theaccelerationrateinkmperhourpersecond,fuelecononmyinmilespergallon,andthemodel'sclass.

hybrid

vehicle year msrp acceleration mpg class

Prius(1stGen) 1997 24509.7 7.46 41.26 Compact

Tino 2000 35355 8.2 54.1 Compact

Prius(2ndGen) 2000 26832.2 7.97 45.23 Compact

Insight 2000 18936.4 9.52 53 TwoSeater

Civic(1stGen) 2001 25833.4 7.04 47.04 Compact

Insight 2001 19036.7 9.52 53 TwoSeater

Insight 2002 19137 9.71 53 TwoSeater

Alphard 2003 38084.8 8.33 40.46 Minivan

Insight 2003 19137 9.52 53 TwoSeater

Civic 2003 14071.9 8.62 41 Compact

ComputationalandInferentialThinking

184Correlation

Page 185: Computational and inferential thinking

...(143rowsomitted)

TheTablemethodscattercanbeusedtoplotaccelerationonthehorizontalaxis(thefirstargument)andmsrponthevertical(thesecondargument).Thereare153pointsinthescatter,oneforeachcarinthetable.

hybrid.scatter('acceleration','msrp')

Thescatterofpointsisslopingupwards,indicatingthatcarswithgreateraccelerationtendedtocostmore,onaverage;conversely,thecarsthatcostmoretendedtohavegreateraccelerationonaverage.

Thisisanexampleofpositiveassociation:above-averagevaluesofonevariabletendtobeassociatedwithabove-averagevaluesoftheother.

ThescatterdiagramofMSRP(verticalaxis)versusmileage(horizontalaxis)showsanegativeassociation.Thescatterhasacleardownwardtrend.Hybridcarswithhighermileagetendedtocostless,onaverage.Thisseemssurprisingtillyouconsiderthatcarsthatacceleratefasttendtobelessfuelefficientandhavelowermileage.Asthepreviousscattershowed,thosewerealsothecarsthattendedtocostmore.

hybrid.scatter('mpg','msrp')

ComputationalandInferentialThinking

185Correlation

Page 186: Computational and inferential thinking

Alongwiththenegativeassociation,thescatterdiagramofpriceversusefficiencyshowsanon-linearrelationbetweenthetwovariables.Thepointsappeartobeclusteredaroundacurve.

IfwerestrictthedatajusttotheSUVclass,however,theassociationbetweenpriceandefficiencyisstillnegativebuttherelationappearstobemorelinear.TherelationbetweenthepriceandaccelerationofSUV'salsoshowsalineartrend,butwithapositiveslope.

suv=hybrid.where('class','SUV')

suv.scatter('mpg','msrp')

suv.scatter('acceleration','msrp')

ComputationalandInferentialThinking

186Correlation

Page 187: Computational and inferential thinking

Youwillhavenoticedthatwecanderiveusefulinformationfromthegeneralorientationandshapeofascatterdiagramevenwithoutpayingattentiontotheunitsinwhichthevariablesweremeasured.

Indeed,wecouldplotallthevariablesinstandardunitsandtheplotswouldlookthesame.Thisgivesusawaytocomparethedegreeoflinearityintwoscatterdiagrams.

HerearethetwoscatterdiagramsforSUVs,withallthevariablesmeasuredinstandardunits.

Table().with_columns([

'mpg(standardunits)',standard_units(suv.column('mpg')),

'msrp(standardunits)',standard_units(suv.column('msrp'))

]).scatter(0,1)

plots.xlim([-3,3])

plots.ylim([-3,3])

None

ComputationalandInferentialThinking

187Correlation

Page 188: Computational and inferential thinking

Table().with_columns([

'acceleration(standardunits)',standard_units(suv.column('acceleration'

'msrp(standardunits)',standard_units(suv.column('msrp'

]).scatter(0,1)

plots.xlim([-3,3])

plots.ylim([-3,3])

None

Theassociationsthatweseeinthesefiguresarethesameasthosewesawbefore.Also,becausethetwoscatterdiagramsaredrawnonexactlythesamescale,wecanseethatthelinearrelationintheseconddiagramisalittlemorefuzzythaninthefirst.

Wewillnowdefineameasurethatusesstandardunitstoquantifythekindsofassociationthatwehaveseen.

ComputationalandInferentialThinking

188Correlation

Page 189: Computational and inferential thinking

Thecorrelationcoefficient¶Thecorrelationcoefficientmeasuresthestrengthofthelinearrelationshipbetweentwovariables.Graphically,itmeasureshowclusteredthescatterdiagramisaroundastraightline.

Thetermcorrelationcoefficientisn'teasytosay,soitisusuallyshortenedtocorrelationanddenotedby .

Herearesomemathematicalfactsabout thatwewillobservebysimulation.

Thecorrelationcoefficient isanumberbetween and1.measurestheextenttowhichthescatterplotclustersaroundastraightline.

ifthescatterdiagramisaperfectstraightlineslopingupwards,and ifthescatterdiagramisaperfectstraightlineslopingdownwards.

Thefunctionr_scattertakesavalueof asitsargumentandsimulatesascatterplotwithacorrelationverycloseto .Becauseofrandomnessinthesimulation,thecorrelationisnotexpectedtobeexactlyequalto .

Callr_scatterafewtimes,withdifferentvaluesof astheargument,andseehowthefootballchanges.Positive correspondstopositiveassociation:above-averagevaluesofonevariableareassociatedwithabove-averagevaluesoftheother,andthescatterplotslopesupwards.

When thescatterplotisperfectlylinearandslopesupward.When ,thescatterplotisperfectlylinearandslopesdownward.When ,thescatterplotisaformlesscloudaroundthehorizontalaxis,andthevariablesaresaidtobeuncorrelated.

r_scatter(0.8)

ComputationalandInferentialThinking

189Correlation

Page 190: Computational and inferential thinking

r_scatter(0.25)

r_scatter(0)

r_scatter(-0.7)

ComputationalandInferentialThinking

190Correlation

Page 191: Computational and inferential thinking

Calculating ¶

Theformulafor isnotapparentfromourobservationssofar,andithasamathematicalbasisthatisoutsidethescopeofthisclass.However,thecalculationisstraightforwardandhelpsusunderstandseveralofthepropertiesof .

Formulafor :

istheaverageoftheproductsofthetwovariables,whenbothvariablesaremeasuredinstandardunits.

Herearethestepsinthecalculation.Wewillapplythestepstoasimpletableofvaluesofand .

x=np.arange(1,7,1)

y=[2,3,1,5,2,7]

t=Table().with_columns([

'x',x,

'y',y

])

t

ComputationalandInferentialThinking

191Correlation

Page 192: Computational and inferential thinking

x y

1 2

2 3

3 1

4 5

5 2

6 7

Basedonthescatterdiagram,weexpectthat willbepositivebutnotequalto1.

t.scatter(0,1,s=30,color='red')

Step1.Converteachvariabletostandardunits.

t_su=t.with_columns([

'x(standardunits)',standard_units(x),

'y(standardunits)',standard_units(y)

])

t_su

ComputationalandInferentialThinking

192Correlation

Page 193: Computational and inferential thinking

x y x(standardunits) y(standardunits)

1 2 -1.46385 -0.648886

2 3 -0.87831 -0.162221

3 1 -0.29277 -1.13555

4 5 0.29277 0.811107

5 2 0.87831 -0.648886

6 7 1.46385 1.78444

Step2.Multiplyeachpairofstandardunits.

t_product=t_su.with_column('productofstandardunits',t_su.column

t_product

x y x(standardunits) y(standardunits) productofstandardunits

1 2 -1.46385 -0.648886 0.949871

2 3 -0.87831 -0.162221 0.142481

3 1 -0.29277 -1.13555 0.332455

4 5 0.29277 0.811107 0.237468

5 2 0.87831 -0.648886 -0.569923

6 7 1.46385 1.78444 2.61215

Step3. istheaverageoftheproductscomputedinStep2.

#ristheaverageoftheproductsofstandardunits

r=np.mean(t_product.column(4))

r

0.61741639718977093

Asexpected, ispositivebutnotequalto1.

Propertiesof ¶

Thecalculationshowsthat:

ComputationalandInferentialThinking

193Correlation

Page 194: Computational and inferential thinking

isapurenumber.Ithasnounits.Thisisbecause isbasedonstandardunits.isunaffectedbychangingtheunitsoneitheraxis.Thistooisbecause isbasedon

standardunits.isunaffectedbyswitchingtheaxes.Algebraically,thisisbecausetheproductof

standardunitsdoesnotdependonwhichvariableiscalled andwhich .Geometrically,switchingaxesreflectsthescatterplotabouttheline ,butdoesnotchangetheamountofclusteringnorthesignoftheassociation.

t.scatter('y','x',s=30,color='red')

Wecandefineafunctioncorrelationtocompute ,basedontheformulathatweusedabove.Theargumentsareatableandtwolabelsofcolumnsinthetable.Thefunctionreturnsthemeanoftheproductsofthosecolumnvaluesinstandardunits,whichis .

defcorrelation(t,x,y):

returnnp.mean(standard_units(t.column(x))*standard_units(t.column

Let'scallthefunctiononthexandycolumnsoft.Thefunctionreturnsthesameanswertothecorrelationbetween and aswegotbydirectapplicationoftheformulafor.

correlation(t,'x','y')

0.61741639718977093

ComputationalandInferentialThinking

194Correlation

Page 195: Computational and inferential thinking

Callingcorrelationoncolumnsofthetablesuvgivesusthecorrelationbetweenpriceandmileageaswellasthecorrelationbetweenpriceandacceleration.

correlation(suv,'mpg','msrp')

-0.6667143635709919

correlation(suv,'acceleration','msrp')

0.48699799279959155

Thesevaluesconfirmwhatwehadobserved:

Thereisanegativeassociationbetweenpriceandefficiency,whereastheassociationbetweenpriceandaccelerationispositive.Thelinearrelationbetweenpriceandaccelerationisalittleweaker(correlationabout0.5)thanbetweenpriceandmileage(correlationabout-0.67).

CareinUsingCorrelation¶

Correlationisasimpleandpowerfulconcept,butitissometimesmisused.Beforeusing ,itisimportanttobeawareofwhatcorrelationdoesanddoesnotmeasure.

Correlationonlymeasuresassociation.Correlationdoesnotimplycausation.Thoughthecorrelationbetweentheweightandthemathabilityofchildreninaschooldistrictmaybepositive,thatdoesnotmeanthatdoingmathmakeschildrenheavierorthatputtingonweightimprovesthechildren'smathskills.Ageisaconfoundingvariable:olderchildrenarebothheavierandbetteratmaththanyoungerchildren,onaverage.

Correlationmeasureslinearassociation.Variablesthathavestrongnon-linearassociationmighthaveverylowcorrelation.Hereisanexampleofvariablesthathaveaperfectquadraticrelation buthavecorrelationequalto0.

nonlinear=Table().with_columns([

'x',np.arange(-4,4.1,0.5),

'y',np.arange(-4,4.1,0.5)**2

])

nonlinear.scatter('x','y',s=30,color='r')

ComputationalandInferentialThinking

195Correlation

Page 196: Computational and inferential thinking

correlation(nonlinear,'x','y')

0.0

Outlierscanhaveabigeffectoncorrelation.Hereisanexamplewhereascatterplotforwhich isequalto1isturnedintoaplotforwhich isequalto0,bytheadditionofjustoneoutlyingpoint.

line=Table().with_columns([

'x',[1,2,3,4],

'y',[1,2,3,4]

])

line.scatter('x','y',s=30,color='r')

ComputationalandInferentialThinking

196Correlation

Page 197: Computational and inferential thinking

correlation(line,'x','y')

1.0

outlier=Table().with_columns([

'x',[1,2,3,4,5],

'y',[1,2,3,4,0]

])

outlier.scatter('x','y',s=30,color='r')

correlation(outlier,'x','y')

0.0

Correlationsbasedonaggregateddatacanbemisleading.Asanexample,herearedataontheCriticalReadingandMathSATscoresin2014.Thereisonepointforeachofthe50statesandoneforWashington,D.C.ThecolumnParticipationRatecontainsthepercentofhighschoolseniorswhotookthetest.Thenextthreecolumnsshowtheaveragescoreinthestateoneachportionofthetest,andthefinalcolumnistheaverageofthetotalscoresonthetest.

sat2014=Table.read_table('sat2014.csv').sort('State')

sat2014

ComputationalandInferentialThinking

197Correlation

Page 198: Computational and inferential thinking

State ParticipationRate

CriticalReading Math Writing Combined

Alabama 6.7 547 538 532 1617

Alaska 54.2 507 503 475 1485

Arizona 36.4 522 525 500 1547

Arkansas 4.2 573 571 554 1698

California 60.3 498 510 496 1504

Colorado 14.3 582 586 567 1735

Connecticut 88.4 507 510 508 1525

Delaware 100 456 459 444 1359

DistrictofColumbia 100 440 438 431 1309

Florida 72.2 491 485 472 1448

...(41rowsomitted)

ThescatterdiagramofMathscoresversusCriticalReadingscoresisverytightlyclusteredaroundastraightline;thecorrelationiscloseto0.985.

sat2014.scatter('CriticalReading','Math')

correlation(sat2014,'CriticalReading','Math')

0.98475584110674341

ComputationalandInferentialThinking

198Correlation

Page 199: Computational and inferential thinking

ItisimportanttonotethatthisdoesnotreflectthestrengthoftherelationbetweentheMathandCriticalReadingscoresofstudents.Statesdon'ttaketests–studentsdo.Thedatainthetablehavebeencreatedbylumpingallthestudentsineachstateintoasinglepointattheaveragevaluesofthetwovariablesinthatstate.Butnotallstudentsinthestatewillbeatthatpoint,asstudentsvaryintheirperformance.Ifyouplotapointforeachstudentinsteadofjustoneforeachstate,therewillbeacloudofpointsaroundeachpointinthefigureabove.Theoverallpicturewillbemorefuzzy.ThecorrelationbetweentheMathandCriticalReadingscoresofthestudentswillbelowerthanthevaluecalculatedbasedonstateaverages.

Correlationsbasedonaggregatesandaveragesarecalledecologicalcorrelationsandarefrequentlyreported.Aswehavejustseen,theymustbeinterpretedwithcare.

Seriousortongue-in-cheek?¶

In2012,apaperintherespectedNewEnglandJournalofMedicineexaminedtherelationbetweenchocolateconsumptionandNobelPrizesinagroupofcountries.TheScientificAmericanrespondedseriously;othersweremorerelaxed.Thepaperincludedthefollowinggraph:

ComputationalandInferentialThinking

199Correlation

Page 200: Computational and inferential thinking

ComputationalandInferentialThinking

200Correlation

Page 201: Computational and inferential thinking

RegressionInteract

Theregressionline¶Theconceptsofcorrelationandthe"best"straightlinethroughascatterplotweredevelopedinthelate1800'sandearly1900's.ThepioneersinthefieldwereSirFrancisGalton,whowasacousinofCharlesDarwin,andGalton'sprotégéKarlPearson.Galtonwasinterestedineugenics,andwasameticulousobserverofthephysicaltraitsofparentsandtheiroffspring.Pearson,whohadgreaterexpertisethanGaltoninmathematics,helpedturnthoseobservationsintothefoundationsofmathematicalstatistics.

ThescatterplotbelowisofafamousdatasetcollectedbyPearsonandhiscolleaguesintheearly1900's.Itconsistsoftheheights,ininches,of1,078pairsoffathersandsons.Thescattermethodofatabletakesanoptionalargumentstosetthesizeofthedots.

heights=Table.read_table('heights.csv')

heights.scatter('father','son',s=10)

Noticethefamiliarfootballshapewithadensecenterandafewpointsontheperimeter.Thisshaperesultsfromtwobell-shapeddistributionsthatarecorrelated.Theheightsofbothfathersandsonshaveabell-shapeddistribution.

ComputationalandInferentialThinking

201Regression

Page 202: Computational and inferential thinking

heights.hist(bins=np.arange(55.5,80,1),unit='inch')

Theaverageheightofsonsisaboutaninchtallerthantheaverageoffathers.

np.mean(heights.column('son'))-np.mean(heights.column('father'))

0.99740259740261195

Thedifferenceinheightbetweenasonandafathervariesaswell,withthebulkofthedistributionbetween-5inchesand+7inches.

diffs=Table().with_column('son-fatherdifference',

heights.column('son')-heights.column(

diffs.hist(bins=np.arange(-9.5,10,1),unit='inch')

ComputationalandInferentialThinking

202Regression

Page 203: Computational and inferential thinking

Thecorrelationbetweentheheightsofthefathersandsonsisabout0.5.

r=correlation(heights,'father','son')

r

0.50116268080759108

Theregressioneffect¶

Forfathersaround72inchesinheight,wemightexpecttheirsonstobetallaswell.Thehistogrambelowshowstheheightofallsonsof72-inchfathers.

six_foot_fathers=heights.where(np.round(heights.column('father'))

six_foot_fathers.hist('son',bins=np.arange(55.5,80,1))

ComputationalandInferentialThinking

203Regression

Page 204: Computational and inferential thinking

Most(68%)ofthesonsofthese72-inchfathersarelessthan72inchestall,eventhoughsonsareaninchtallerthanfathersonaverage!

np.count_nonzero(six_foot_fathers.column('son')<72)/six_foot_fathers

0.6842105263157895

Infact,theaverageheightofasonofa72-inchfatherislessthan71inches.Thesonsoftallfathersaresimplynotastallinthissample.

np.mean(six_foot_fathers.column('son'))

70.728070175438603

ThisfactwasnoticedbyGalton,whohadbeenhopingthatexceptionallytallfatherswouldhavesonswhowerejustasexceptionallytall.However,thedatawereclear,andGaltonrealizedthatthetallfathershavesonswhoarenotquiteasexceptionallytall,onaverage.Frustrated,Galtoncalledthisphenomenon"regressiontomediocrity."

Galtonalsonoticedthatexceptionallyshortfathershadsonswhoweresomewhattallerrelativetotheirgeneration,onaverage.Ingeneral,individualswhoareawayfromaverageononevariableareexpectedtobenotquiteasfarawayfromaverageontheother.Thisiscalledtheregressioneffect.

ComputationalandInferentialThinking

204Regression

Page 205: Computational and inferential thinking

Thefigurebelowshowsthescatterplotofthedatawhenbothvariablesaremeasuredinstandardunits.Theredlineshowsequalstandardunitsandhasaslopeof1,butitdoesnotmatchtheangleofthecloudofpoints.Theblueline,whichdoesfollowtheangleofthecloud,iscalledtheregressionline,namedforthe"regressiontomediocrity"itpredicts.

heights_standard=Table().with_columns([

'father(standardunits)',standard_units(heights['father']),

'son(standardunits)',standard_units(heights['son'])

])

heights_standard.scatter(0,fit_line=True,s=10)

_=plots.plot([-3,3],[-3,3],color='r',lw=3)

ThescattermethodofaTabledrawsthisregressionlineforuswhencalledwithfit_line=True.Thislinepassesthroughtheresultofmultiplyingfatherheights(instandardunits)by .

father_standard=heights_standard.column(0)

heights_standard.with_column('father(standardunits)*r',father_standard

ComputationalandInferentialThinking

205Regression

Page 206: Computational and inferential thinking

Anotherinterpretationofthislineisthatitpassesthroughaveragevaluesforslicesofthepopulationofsons.Toseethisrelationship,wecanroundthefatherseachtothenearestunit,thenaveragetheheightsofallsonsassociatedwiththeseroundedvalues.Thegreenlineistheregressionlineforthesedata,anditpassesclosetoalloftheyellowpoints,whicharemeanheightsofsons(instandardunits).

rounded=heights_standard.with_column('father(standardunits)',np

rounded.join('father(standardunits)',rounded.group(0,np.average

_=plots.plot([-3,3],[-3*r,3*r],color='g',lw=2)

ComputationalandInferentialThinking

206Regression

Page 207: Computational and inferential thinking

KarlPearsonusedtheobservationoftheregressioneffectinthedataabove,aswellasinotherdataprovidedbyGalton,todeveloptheformalcalculationofthecorrelationcoefficient.Thatiswhy issometimescalledPearson'scorrelation.

Theregressionlineinoriginalunits¶

Aswesawinthelastsectionforfootballshapedscatterplots,whenthevariables andaremeasuredinstandardunits,thebeststraightlineforestimating basedon hasslopeandpassesthroughtheorigin.Thustheequationoftheregressionlinecanbewrittenas:

Thatis,

Theequationcanbeconvertedintotheoriginalunitsofthedata,eitherbyrearrangingthisequationalgebraically,orbylabelingsomeimportantfeaturesofthelinebothinstandardunitsandintheoriginalunits.

ComputationalandInferentialThinking

207Regression

Page 208: Computational and inferential thinking

Calculationoftheslopeandintercept¶

Theregressionlineisalsocommonlyexpressedasaslopeandanintercept,whereanestimateforyiscomputedfromanxvalueusingtheequation

Thecalculationsoftheslopeandinterceptoftheregressionlinecanbederivedfromtheequationabove.

ComputationalandInferentialThinking

208Regression

Page 209: Computational and inferential thinking

defslope(table,x,y):

r=correlation(table,x,y)

returnr*np.std(table.column(y))/np.std(table.column(x))

defintercept(table,x,y):

a=slope(table,x,y)

returnnp.mean(table.column(y))-a*np.mean(table.column(x))

[slope(heights,'father','son'),intercept(heights,'father','son'

[0.51400591254559247,33.892800540661682]

Itisworthnotingthattheinterceptofapproximately33.89inchesisnotintendedasanestimateoftheheightofasonwhosefatheris0inchestall.Thereisnosuchsonandnosuchfather.Theinterceptismerelyageometricoralgebraicquantitythathelpsdefinetheline.Ingeneral,itisnotagoodideatoextrapolate,thatis,tomakeestimatesoutsidetherangeoftheavailabledata.Itiscertainlynotagoodideatoextrapolateasfarawayas0isfromtheheightsofthefathersinthestudy.

Itisalsoworthnotingthattheslopeisnotr!Instead,itisrmultipliedbytheratioofstandarddeviations.

correlation(heights,'father','son')

0.50116268080759108

Fittedvalues¶Wecanalsoestimateeverysoninthedatausingtheslopeandintercept.Theestimatedvaluesof arecalledthefittedvalues.Theyalllieonastraightline.Tocalculatethem,takeason'sheight,multiplyitbytheslopeoftheregressionline,andaddtheintercept.Inotherwords,calculatetheheightoftheregressionlineatthegivenvalueof .

ComputationalandInferentialThinking

209Regression

Page 210: Computational and inferential thinking

deffit(table,x,y):

"""Returntheheightoftheregressionlineateachxvalue."""

a=slope(table,x,y)

b=intercept(table,x,y)

returna*table.column(x)+b

fitted=heights.with_column('son(fitted)',fit(heights,'father',

fitted

father son son(fitted)

65 59.8 67.3032

63.3 63.2 66.4294

65 63.3 67.3032

65.8 62.8 67.7144

61.1 64.3 65.2986

63 64.2 66.2752

65.4 64.1 67.5088

64.7 64 67.149

66.1 64.6 67.8686

67 64 68.3312

...(1068rowsomitted)

Inoriginalunits,wecanseethatthesefittedvaluesfallonaline.

fitted.scatter(0,s=10)

ComputationalandInferentialThinking

210Regression

Page 211: Computational and inferential thinking

Moreover,thisline(ingreenbelow),passesneartheaverageheightsofsonsforslicesofthepopulation.Inthiscase,weroundtheheightsoffatherstothenearestinchtoconstructtheslices.

rounded=heights.with_column('father',np.round(heights.column('father'

rounded.join('father',rounded.group(0,np.average)).scatter(0,s=80

_=plots.plot(fitted.column('father'),fitted.column('son(fitted)'

ComputationalandInferentialThinking

211Regression

Page 212: Computational and inferential thinking

ComputationalandInferentialThinking

212Regression

Page 213: Computational and inferential thinking

PredictionInteractInthelastsection,wedevelopedtheconceptsofcorrelationandregressionaswaystodescribedata.Wewillnowseehowtheseconceptscanbecomepowerfultoolsforprediction,whenusedappropriately.Inparticular,givenpairedobservationsoftwoquantitiesandsomeadditionalobservationsofonlythefirstquantity,wecaninfertypicalvaluesforthesecondquantity.

Assumptionsofrandomness:a"regressionmodel"¶Predictionispossibleifwebelievethatascatterplotreflectstheunderlyingrelationbetweenthetwovariablesbeingplotted,butdoesnotspecifytherelationcompletely.Forexample,ascatterplotoftheheightsoffathersandsonsshowsusthepreciserelationbetweenthetwovariablesinoneparticularsampleofmen;butwemightwonderwhetherthatrelationholdstrue,oralmosttrue,amongallfathersandsonsinthepopulationfromwhichthesamplewasdrawn,orindeedamongfathersandsonsingeneral.

Asalways,inferentialthinkingbeginswithacarefulexaminationoftheassumptionsaboutthedata.Setsofassumptionsareknownasmodels.Setsofassumptionsaboutrandomnessinroughlylinearscatterplotsarecalledregressionmodels.

Inbrief,suchmodelssaythattheunderlyingrelationbetweenthetwovariablesisperfectlylinear;thisstraightlineisthesignalthatwewouldliketoidentify.However,wearenotabletoseethelineclearly,becausethereisadditionalvariabilityinthedatabeyondtherelation.Whatweseearepointsthatarescatteredaroundtheline.Foreachofthepoints,thelinearsignalthatrelatesthetwovariableshasbeencombinedwithsomeothersourceofvariabilitythathasnothingtodowiththerelationatall.Thisrandomnoiseobfuscatesthelinearsignal,anditisourinferentialgoaltoidentifythesignaldespitethenoise.

Inordertomakethesestatementsmoreprecise,wewillusethenp.random.normalfunction,whichgeneratessamplesfromanormaldistributionwith0meanand1standarddeviation.Thesesamplescorrespondtoabell-shapeddistributionexpressedinstandardunits.

np.random.normal()

ComputationalandInferentialThinking

213Prediction

Page 214: Computational and inferential thinking

-0.6423537870160524

samples=Table('x')

foriinnp.arange(10000):

samples.append([np.random.normal()])

samples.hist(0,bins=np.arange(-4,4,0.1))

Theregressionmodelspecifiesthatthepointsinascatterplot,measuredinstandardunits,aregeneratedatrandomasfollows.

Thexaresampledfromthenormaldistribution.

Foreachx,itscorrespondingyissampledusingthesignal_and_noisefunctionbelow,whichtakesxandthecorrelationcoefficientr.

defsignal_and_noise(x,r):

returnr*x+np.random.normal()*(1-r**2)**0.5

Forexample,ifwechoosertobeahalf,thenthefollowingscatterdiagrammightresult.

ComputationalandInferentialThinking

214Prediction

Page 215: Computational and inferential thinking

defregression_model(r,sample_size):

pairs=Table(['x','y'])

foriinnp.arange(sample_size):

x=np.random.normal()

y=signal_and_noise(x,r)

pairs.append([x,y])

returnpairs

regression_model(0.5,1000).scatter('x','y')

Basedonthisscatterplot,howshouldweestimatethetrueline?Thebestlinethatwecanputthroughascatterplotistheregressionline.Sotheregressionlineisanaturalestimateofthetrueline.

Thesimulationbelowdrawsboththetruelineusedtogeneratethedata(green)andtheregressionline(blue)foundbyanalyzingthepairsthatweregenerated.

ComputationalandInferentialThinking

215Prediction

Page 216: Computational and inferential thinking

defcompare(true_r,sample_size):

pairs=regression_model(true_r,sample_size)

estimated_r=correlation(pairs,'x','y')

pairs.scatter('x','y',fit_line=True)

plt.plot([-3,3],[-3*true_r,3*true_r],color='g',lw=2)

print("Thetrueris",true_r,"andtheestimatedris",estimated_r

compare(0.5,1000)

Thetrueris0.5andtheestimatedris0.459672161067

With1000samples,weseethatthetruelineandtheregressionlinearequitesimilar.Withfewersamples,theregressionestimateofthetruelinethatgeneratedthedatamaybequitedifferent.

compare(0.5,50)

Thetrueris0.5andtheestimatedris0.627329170668

ComputationalandInferentialThinking

216Prediction

Page 217: Computational and inferential thinking

Withveryfewsamples,theregressionlinemayhaveaclearpositiveornegativeslope,evenwhenthetruerelationshiphasacorrelationof0.

compare(0,10)

Thetrueris0andtheestimatedris0.501582566452

ComputationalandInferentialThinking

217Prediction

Page 218: Computational and inferential thinking

Withrealdata,wewillneverobservethetrueline.Infact,regressionisoftenappliedwhenweareuncertainwhetherthetruerelationbetweentwoquantitiesislinearatall.Whatthesimulationshowsthatiftheregressionmodellooksplausible,andwehavealargesample,thenregressionlineisagoodapproximationtothetrueline.

Predictions¶Hereisanexamplewhereregressionmodelcanbeusedtomakepredictions.

ThedataareasubsetoftheinformationgatheredinarandomizedcontrolledtrialabouttreatmentsforHodgkin'sdisease.Hodgkin'sdiseaseisacancerthattypicallyaffectsyoungpeople.Thediseaseiscurablebutthetreatmentisveryharsh.Thepurposeofthetrialwastocomeupwithdosagethatwouldcurethecancerbutminimizetheadverseeffectsonthepatients.

Thistablehodgkinscontainsdataontheeffectthatthetreatmenthadonthelungsof22patients.Thecolumnsare:

HeightincmAmeasureofradiationtothemantle(neck,chest,underarms)AmeasureofchemotherapyAscoreofthehealthofthelungsatbaseline,thatis,atthestartofthetreatment;higherscorescorrespondtomorehealthylungsThesamescoreofthehealthofthelungs,15monthsaftertreatment

ComputationalandInferentialThinking

218Prediction

Page 219: Computational and inferential thinking

hodgkins=Table.read_table('hodgkins.csv')

hodgkins.show()

height rad chemo base month15

164 679 180 160.57 87.77

168 311 180 98.24 67.62

173 388 239 129.04 133.33

157 370 168 85.41 81.28

160 468 151 67.94 79.26

170 341 96 150.51 80.97

163 453 134 129.88 69.24

175 529 264 87.45 56.48

185 392 240 149.84 106.99

178 479 216 92.24 73.43

179 376 160 117.43 101.61

181 539 196 129.75 90.78

173 217 204 97.59 76.38

166 456 192 81.29 67.66

170 252 150 98.29 55.51

165 622 162 118.98 90.92

173 305 213 103.17 79.74

174 566 198 94.97 93.08

173 322 119 85 41.96

173 270 160 115.02 81.12

183 259 241 125.02 97.18

188 238 252 137.43 113.2

Itisevidentthatthepatients'lungswerelesshealthy15monthsafterthetreatmentthanatbaseline.At36months,theydidrecovermostoftheirlungfunction,butthosedataarenotpartofthistable.

Thescatterplotbelowshowsthattallerpatientshadhigherscoresat15months.

ComputationalandInferentialThinking

219Prediction

Page 220: Computational and inferential thinking

hodgkins.scatter('height','month15',fit_line=True)

Thescatterplotlooksroughlylinear.Letusassumethattheregressionmodelholds.Underthatassumption,theregressionlinecanbeusedtopredictthe15-monthscoreofanewpatient,basedonthepatient'sheight.Thepredictionwillbegoodprovidedtheassumptionsoftheregressionmodelarejustifiedforthesedata,andprovidedthenewpatientissimilartothoseinthestudy.

Thefunctionregressgivesustheslopeandinterceptoftheregressionline.Topredictthe15-monthscoreofanewpatientwhoseheightis cm,weusethefollowingcalculation:

Predicted15-monthscoreofapatientwhois inchestall slope intercept

Thepredicted15-monthscoreofapatientwhois173cmtallis83.66cm,andthatofapatientwhois163cmtallis73.69cm.

#slopeandintercept

a=slope(hodgkins,'height','month15')

b=intercept(hodgkins,'height','month15')

ComputationalandInferentialThinking

220Prediction

Page 221: Computational and inferential thinking

#Newpatient,173cmtall

#Predicted15-monthscore:

a*173+b

83.65699801460444

#Newpatient,163cmtall

#Predicted15-monthscore:

a*163+b

73.694360467072741

Thelogicofregressionpredictioncanbewrappedinafunctionpredictthattakesatable,thetwocolumnsofinterest,andsomenewvalueforx.Itpredictstheaveragevaluefory.

defpredict(t,x,y,new_x_value):

a=slope(t,x,y)

b=intercept(t,x,y)

returna*new_x_value+b

predict(hodgkins,'height','month15',163)

73.694360467072741

VariabilityofthePrediction¶Asdatascientistsworkingundertheregressionmodel,wemustalwaysremainawarethatasamplemighthavebeendifferent.Hadthesamplebeendifferent,theregressionlinewouldhavebeendifferenttoo,andsowouldourprediction.

ComputationalandInferentialThinking

221Prediction

Page 222: Computational and inferential thinking

Tounderstandhowmuchapredictionmightvary,wecanselectdifferentsamplesfromthedatawehave.Below,weselectonly15ofthe22rowsofthehodgkinstableatrandomwithoutreplacement,thenmakesaprediction.

predict(hodgkins.sample(15),'height','month15',163)

71.653856125189805

Thepredictionisdifferent,buthowmuchmightweexpectittodiffer?Fromasinglesample,itisnotpossibletoanswerthisquestion.Thesimulationbelowdraws1000samplesof15rowseach,thendrawsahistogramofthepredictedvaluesusingeachsample.

defsample_predict(new_x_value):

predictions=Table(['Predictions'])

foriinnp.arange(1000):

predicted=predict(hodgkins.sample(15),'height','month15'

predictions.append([predicted])

predictions.hist(0,bins=np.arange(40,96,2))

sample_predict(163)

Usingonly15rowsinsteadofall22,thepredictioncouldhavecomeoutdifferentlyfrom73.69.Thishistogramshowshowdifferentthepredictionmightbe.Almostallpredictionsfallbetween64and82,butafewareoutsideofthisrange.

ComputationalandInferentialThinking

222Prediction

Page 223: Computational and inferential thinking

Foranxvaluethatisclosertothemeanofallheights,thepredictionsfrom15rowsaremoretightlyclustered.

sample_predict(173)

Inthishistogramofthepredictionsforaheightof173,weseethatalmostallpredictionsfallbetween76and90.Thenarrowrange80to86accountsfornearly70%oftheestimates.

Predictionsofvaluesoutsidethetypicalobservationsofheightvarywildly.Adatascientistisrarelyjustifiedinextrapolatingalineartrendbeyondtherangeofvaluesobserved.

sample_predict(150)

ComputationalandInferentialThinking

223Prediction

Page 224: Computational and inferential thinking

Higher-OrderFunctionsInteractThefunctionswehavedefinedsofarcomputetheiroutputsdirectlyfromtheirinputs.Theirreturnvaluescanbecomputedfromtheirarguments.However,somefunctionsrequireadditionaldatatodeteriminetheirreturnvalues.

Let'sstartwithasimpleexample.Considerthefollowingthreefunctionsthatallscaletheirinputbyaconstantfactor.

defdouble(x):

return2*x

deftriple(x):

return3*x

defhalve(x):

return0.5*x

[double(4),triple(4),halve(4),double(5),triple(5),halve(5)]

[8,12,2.0,10,15,2.5]

Ratherthandefiningeachofthesefunctionswithaseparatedefstatement,itispossibletogenerateeachofthemdynamicallybypassinginaconstanttoagenericscale_byfunction.

defscale_by(a):

defscale(x):

returna*x

returnscale

double=scale_by(2)

triple=scale_by(3)

halve=scale_by(0.5)

[double(4),triple(4),halve(4),double(5),triple(5),halve(5)]

ComputationalandInferentialThinking

224Higher-OrderFunctions

Page 225: Computational and inferential thinking

[8,12,2.0,10,15,2.5]

Thebodyofthescale_byfunctionactuallydefinesanewfunction,calledscale,andthenreturnsit.Thereturnvalueofscale_byisafunctionthattakesanumberxandmultipliesitbya.Becausescale_byisafunctionthatreturnsafunction,itiscalledahigher-orderfunction.

Thepurposeofhigher-orderfunctionsistocreatefunctionsthatcombinedatawithbehavior.Above,thedoublefunctioniscreatedbycombiningthedatavalue2withthebehaviorofscalingbyaconstant.

Adefstatementwithinanotherdefstatementbehavesinthesamewayasanyotherdefstatementexceptthat:

1. Thenameoftheinnerfunctionisonlyaccessiblewithintheouterfunction'sbody.Forexample,itisnotpossibletousethenamescaleoncescale_byhasreturned.

2. Thebodyoftheinnerfunctioncanrefertotheargumentsandnameswithintheouterfunction.So,forexample,thebodyoftheinnerscalefunctioncanrefertoa,whichisanargumentoftheouterfunctionscale_by.

Thefinallineoftheouterscale_byfunctionisreturnscale.Thislinereturnsthefunctionthatwasjustdefined.Thepurposeofreturningafunctionistogiveitaname.Theline

double=scale_by(2)

givesthenamedoubletotheparticularinstanceofthescalefunctionthathasaassignedtothenumber2.Noticethatitispossibletokeepmultipleinstancesofthescalefunctionaroundatonce,allwithdifferentnamessuchasdoubleandtriple,andPythonkeepstrackofwhichvalueforatouseeachtimeoneofthesefunctionsiscalled.

Example:StandardUnits¶Wehaveseenhowtoconvertalistofnumberstostandardunitsbysubtractingthemeananddividingbythestandarddeviation.

defstandard_units(any_numbers):

"Convertanyarrayofnumberstostandardunits."

return(any_numbers-np.mean(any_numbers))/np.std(any_numbers)

ComputationalandInferentialThinking

225Higher-OrderFunctions

Page 226: Computational and inferential thinking

Thisfunctioncanconvertanylistofnumberstostandardunits.

observations=[3,4,2,4,3,5,1,6]

np.round(standard_units(observations),3)

array([-0.333,0.333,-1.,0.333,-0.333,1.,-1.667,1.667])

Howwouldweanswerthequestion,"whatisthenumber8(inoriginalunits)convertedtostandardunits?"

Adding8tothelistofobservationsandre-computingstandard_unitsdoesnotanswerthisquestion,becauseaddinganobservationchangesthescale!

observations_with_eight=[3,4,2,4,3,5,1,6,8]

standard_units(observations_with_eight)

array([-0.5,0.,-1.,0.,-0.5,0.5,-1.5,1.,2.])

Ifinsteadwewanttomaintaintheoriginalscale,weneedafunctionthatrememberstheoriginalmeanandstandarddeviation.Thisscenario,wherebothdataandbehaviorarerequiredtoarriveattheanswer,isanaturalfitforahigher-orderfunction.

defconverter_to_standard_units(any_numbers):

"""Returnafunctionthatconvertstothestandardunitsofany_numbers."""

mean=np.mean(any_numbers)

sd=np.std(any_numbers)

defconvert(a_number_in_original_units):

return(a_number_in_original_units-mean)/sd

returnconvert

Now,wecanusetheoriginallistofnumberstocreatetheconversionfunction.Themeanandstandarddeviationarestoredwithinthefunctionthatisreturned,calledto_subelow.What'smore,thesenumbersareonlycomputedonce,nomatterhowmanytimeswecallthefunction.

ComputationalandInferentialThinking

226Higher-OrderFunctions

Page 227: Computational and inferential thinking

observations=[3,4,2,4,3,5,1,6]

to_su=converter_to_standard_units(observations)

[to_su(2),to_su(3),to_su(8)]

[-1.0,-0.33333333333333331,3.0]

Acomplementaryfunctionthatreturnsaconverterfromstandardunitstooriginalunitsinvolvesmultipyingbythestandarddeviation,thenaddingthemean.

defconverter_from_standard_units(any_numbers):

"""Returnafunctionthatconvertsfromthestandardunitsofany_numbers."""

mean=np.mean(any_numbers)

sd=np.std(any_numbers)

defconvert(in_standard_units):

returnin_standard_units*sd+mean

returnconvert

from_su=converter_from_standard_units(observations)

from_su(3)

8.0

Itshouldbethecasethatanynumberconvertedtostandardunitsandback,usingthesamelisttocreatebothfunctions.

from_su(to_su(11))

11.0

Example:Prediction¶Previously,inordertocomputethefittedvaluesforaregression,wefirstcomputedtheslopeandinterceptoftheregressionline.Asanalternateimplementation,foreachx,wecanconvertxtostandardunitsfromtheoriginalunitsofxvalues,thenmultiplybyr,then

ComputationalandInferentialThinking

227Higher-OrderFunctions

Page 228: Computational and inferential thinking

converttheresulttotheoriginalunitsofyvalues.Wecanwritedownthisprocessasahigher-orderfunction.

defregression_estimator(table,x,y):

"""Returnanestimatorforyasafunctionofx."""

to_su=converter_to_standard_units(table.column(x))

from_su=converter_from_standard_units(table.column(y))

r=correlation(table,x,y)

defestimate(any_x):

returnfrom_su(r*to_su(any_x))

returnestimate

ThefollowingdatasetofmothersandbabieswascollectedfromtheKaiserHospitalinOakland,California.

baby=Table.read_table('baby.csv')

baby

BirthWeight

GestationalDays

MaternalAge

MaternalHeight

MaternalPregnancyWeight

MaternalSmoker

120 284 27 62 100 False

113 282 33 64 135 False

128 279 28 64 115 True

108 282 23 67 125 True

136 286 25 62 93 False

138 244 33 62 178 False

132 245 23 65 140 False

120 289 25 62 125 False

143 299 30 66 136 True

140 351 27 68 120 False

...(1164rowsomitted)

Gestationaldaysarethenumberofdaysthatthemotherispregnant.Mostpregnancieslastbetween250and310days.

ComputationalandInferentialThinking

228Higher-OrderFunctions

Page 229: Computational and inferential thinking

baby.hist(1,bins=np.arange(200,330,10))

Forthiscollectionoftypicalbirths,gestationaldaysandbirthweightappeartobelinearlyrelated.

typical=baby.where(np.logical_and(baby.column(1)>=250,baby.column

typical.scatter(1,0,fit_line=True)

ComputationalandInferentialThinking

229Higher-OrderFunctions

Page 230: Computational and inferential thinking

Thefunctionaverage_weightreturnstheaverageweightofbabiesbornafteracertainnumberofgestationaldays,usingtheregressionlinetopredictthisvalue.Thisfunctionisdefinedwithinregression_estimatorandreturned.Thereturnedvalueisboundtothenameaverage_weightusinganassignmentstatement.

average_weight=regression_estimator(typical,1,0)

Theaverage_weightfunctioncontainsalltheinformationneededtomakebirthweightpredictions,butwhenitiscalled,ittakesonlyasingleargument:thegestationaldays.Weavoidtheneedtopassthetableandcolumnlabelsintotheaverage_weightfunctioneachtimethatwecallitbecausethosevalueswerepassedtoregression_estimator,whichdefinedtheaverage_weightfunctionintermsofthosevalues.

average_weight(290)

126.33785752202166

Thisfunctionisquiteflexible.Forinstance,wecancomputefittedvaluesforallgestationaldaysbyapplyingittothecolumnofgestationaldays.

fitted=typical.apply(average_weight,1)

typical.select([1,0]).with_column('BirthWeight(fitted)',fitted)

ComputationalandInferentialThinking

230Higher-OrderFunctions

Page 231: Computational and inferential thinking

Higher-orderfunctionsappearofteninthecontextofpredictionbecauseestimatingavalueofteninvolvesbothdatatoinformthepredictionandaprocedurethatmakesthepredictionitself.

ComputationalandInferentialThinking

231Higher-OrderFunctions

Page 232: Computational and inferential thinking

ErrorsInteract

Errorintheregressionestimate¶

Thoughtheaverageresidualis0,eachindividualresidualisnot.Someresidualsmightbequitefarfrom0.Togetasenseoftheamountoferrorintheregressionestimate,wewillstartwithagraphicaldescriptionofthesenseinwhichtheregressionlineisthe"best".

Ourexampleisadatasetthathasonepointforeverychapterofthenovel"LittleWomen."Thegoalistoestimatethenumberofcharacters(thatis,letters,punctuationmarks,andsoon)basedonthenumberofperiods.Recallthatweattemptedtodothisintheveryfirstlectureofthiscourse.

little_women=Table.read_table('little_women.csv')

little_women.show(3)

Characters Periods

21759 189

22148 188

20558 231

...(44rowsomitted)

#Onepointforeachchapter

#Horizontalaxis:numberofperiods

#Verticalaxis:numberofcharacters(asina,b,",?,etc;notpeopleinthebook)

little_women.scatter('Periods','Characters')

ComputationalandInferentialThinking

232Errors

Page 233: Computational and inferential thinking

correlation(little_women,'Periods','Characters')

0.92295768958548163

Thescatterplotisremarkablyclosetolinear,andthecorrelationismorethan0.92.

Thefigurebelowshowsthescatterplotandregressionline,withfouroftheerrorsmarkedinred.

#Residuals:Deviationsfromtheregressionline

lw_slope=slope(little_women,'Periods','Characters')

lw_intercept=intercept(little_women,'Periods','Characters')

print('Slope:',np.round(lw_slope))

print('Intercept:',np.round(lw_intercept))

lw_errors(lw_slope,lw_intercept)

Slope:87.0

Intercept:4745.0

ComputationalandInferentialThinking

233Errors

Page 234: Computational and inferential thinking

Hadweusedadifferentlinetocreateourestimates,theerrorswouldhavebeendifferent.Thepicturebelowshowshowbigtheerrorswouldbeifweweretouseaparticularlysillylineforestimation.

#Errors:Deviationsfromadifferentline

lw_errors(-100,50000)

Belowisalinethatwehaveusedbeforewithoutsayingthatwewereusingalinetocreateestimates.Itisthehorizontallineatthevalue"averageof ."Supposeyouwereaskedtoestimate andwerenottoldthevalueof ;thenyouwouldusetheaverageof asyour

ComputationalandInferentialThinking

234Errors

Page 235: Computational and inferential thinking

estimate,regardlessofthechapter.Inotherwords,youwouldusetheflatlinebelow.

Eacherrorthatyouwouldmakewouldthenbeadeviationfromaverage.TheroughsizeofthesedeviationsistheSDof .

Insummary,ifweusetheflatlineattheaverageof tomakeourestimates,theestimateswillbeoffbytheSDof .

#Errors:Deviationsfromtheflatlineattheaverageofy

characters_average=np.mean(little_women.column('Characters'))

lw_errors(0,characters_average)

LeastSquares¶

Ifyouuseanyarbitrarylineasyourlineofestimates,thensomeofyourerrorsarelikelytobepositiveandothersnegative.Toavoidcancellationwhenmeasuringtheroughsizeoftheerrors,wetakethemeanofthesqaurederrorsratherthanthemeanoftheerrorsthemselves.Thisisexactlyanalogoustoourreasonforlookingatsquareddeviationsfromaverage,whenwewerelearninghowtocalculatetheSD.

Themeansquarederrorofestimationusingastraightlineisameasureofroughlyhowbigthesquarederrorsare;takingthesquarerootyieldstherootmeansquareerror,whichisinthesameunitsas .

ComputationalandInferentialThinking

235Errors

Page 236: Computational and inferential thinking

Hereisaremarkablefactofmathematics:theregressionlineminimizesthemeansquarederrorofestimation(andhencealsotherootmeansquarederror)amongallstraightlines.Thatiswhytheregressionlineissometimescalledthe"leastsquaresline."

Computingthe"best"line.

Togetestimatesof basedon ,youcanuseanylineyouwant.Everylinehasameansquarederrorofestimation."Better"lineshavesmallererrors.Theregressionlineistheuniquestraightlinethatminimizesthemeansquarederrorofestimationamongallstraightlines.

WecandefineafunctiontocomputetherootmeansquarederrorofanylinethroughttheLittleWomenscatterdiagram.

deflw_rmse(slope,intercept):

lw_errors(slope,intercept)

x=little_women.column('Periods')

y=little_women.column('Characters')

fitted=slope*x+intercept

mse=np.average((y-fitted)**2)

print("Rootmeansquarederror:",mse**0.5)

lw_rmse(0,characters_average)

Rootmeansquarederror:7019.17593405

ComputationalandInferentialThinking

236Errors

Page 237: Computational and inferential thinking

Theerroroftheregressionlineisindeedmuchsmallerifwechooseaslopeandinterceptneartheregressionline.

lw_rmse(90,4000)

Rootmeansquarederror:2715.53910638

Buttheminimumisachievedusingtheregressionlineitself.

lw_rmse(lw_slope,lw_intercept)

ComputationalandInferentialThinking

237Errors

Page 238: Computational and inferential thinking

Rootmeansquarederror:2701.69078531

NumericalOptimization¶Wecanalsodefinemean_squared_errorforanarbitrarydataset.We'lluseahigher-orderfunctionsothatwecantrymanydifferentlinesonthesamedataset,simplybypassingintheirslopeandintercept.

defmean_squared_error(table,x,y):

deffor_line(slope,intercept):

fitted=(slope*table.column(x)+intercept)

returnnp.average((table.column(y)-fitted)**2)

returnfor_line

TheOldFaithfulgeyserinYellowstonenationalparkeruptsregularly,butthewaitingtimebetweeneruptions(inseconds)andthedurationoftheeruptionsinseccondsdovary.

faithful=Table.read_table('faithful.csv')

faithful.scatter(1,0)

ComputationalandInferentialThinking

238Errors

Page 239: Computational and inferential thinking

Itappearsthattherearetwotypesoferuptions,shortandlong.Theeruptionsabove3secondsappearcorrelatedwithwaitingtime.

long=faithful.where(faithful.column('eruptions')>3)

long.scatter(1,0,fit_line=True)

Thermse_geyserfunctiontakesaslopeandaninterceptandreturnstherootmeansquarederrorofalinearpredictoroferuptionsizefromthewaitingtime.

ComputationalandInferentialThinking

239Errors

Page 240: Computational and inferential thinking

mse_long=mean_squared_error(long,1,0)

mse_long(0,4)**0.5

0.50268461001194098

Ifweexperimentwithdifferentvalues,wecanfindalow-errorslopeandinterceptthroughtrialanderror.

mse_long(0.01,3.5)**0.5

0.39143872353883735

mse_long(0.02,2.7)**0.5

0.38168519564089542

Theminimizefunctioncanbeusedtofindtheargumentsofafunctionforwhichthefunctionreturnsitsminimumvalue.Pythonusesasimilartrial-and-errorapproach,followingthechangesthatleadtoincrementallyloweroutputvalues.

a,b=minimize(mse_long)

Therootmeansquarederroroftheminimalslopeandinterceptissmallerthananyofthevaluesweconsideredbefore.

mse_long(a,b)**0.5

0.38014612988504093

Infact,thesevaluesaandbarethesameasthevaluesreturnedbytheslopeandinterceptfunctionswedevelopedbasedonthecorrelationcoefficient.Weseesmalldeviationsduetotheinexactnatureofminimize,butthevaluesareessentiallythesame.

ComputationalandInferentialThinking

240Errors

Page 241: Computational and inferential thinking

print("slope:",slope(long,1,0))

print("a:",a)

print("intecept:",intercept(long,1,0))

print("b:",b)

slope:0.0255508940715

a:0.0255496

intecept:2.24752334164

b:2.247629

Therefore,wehavefoundnotonlythattheregressionlineminimizesmeansquarederror,butalsothatminimizingmeansquarederrorgivesustheregressionline.Theregressionlineistheonlylinethatminimizesmeansquarederror.

Residuals¶Theamountoferrorineachoftheseregressionestimatesisthedifferencebetweentheson'sheightanditsestimate.Theseerrorsarecalledresiduals.Someresidualsarepositive.Thesecorrespondtopointsthatareabovetheregressionline–pointsforwhichtheregressionlineunder-estimates .Negativeresidualscorrespondtothelineover-estimatingvaluesof .

waiting=long.column('waiting')

eruptions=long.column('eruptions')

fitted=a*waiting+b

res=long.with_columns([

'eruptions(fitted)',fitted,

'residuals',eruptions-fitted

])

res

ComputationalandInferentialThinking

241Errors

Page 242: Computational and inferential thinking

eruptions waiting eruptions(fitted) residuals

3.6 79 4.26605 -0.666047

3.333 74 4.1383 -0.805299

4.533 85 4.41934 0.113655

4.7 88 4.49599 0.204006

3.6 85 4.41934 -0.819345

4.35 85 4.41934 -0.069345

3.917 84 4.3938 -0.476795

4.2 78 4.2405 -0.0404978

4.7 83 4.36825 0.331754

4.8 84 4.3938 0.406205

...(165rowsomitted)

Aswithdeviationsfromaverage,thepositiveandnegativeresidualsexactlycanceleachotherout.Sotheaverage(andsum)oftheresidualsis0.Becausewefoundaandbbynumericallyminimizingtherootmeansquarederrorratherthancomputingthemexactly,thesumofresidualsisslightlydifferentfromzerointhiscase.

sum(res.column('residuals'))

-0.00037579999996140145

Aresidualplotisascatterplotoftheresidualsversusthefittedvalues.Theresidualplotofagoodregressionlooksliketheonebelow:aformlesscloudwithnopattern,centeredaroundthehorizontalaxis.Itshowsthatthereisnodiscerniblenon-linearpatternintheoriginalscatterplot.

res.scatter(2,3,fit_line=True)

ComputationalandInferentialThinking

242Errors

Page 243: Computational and inferential thinking

Bycontrast,supposewehadattemptedtofitaregressionlinetotheentiresetoferuptionsofOldFaithfulintheoriginaldataset.

faithful.scatter(0,1,fit_line=True)

Thislinedoesnotpassthroughtheaveragevalueofverticalstrips.Wemaybeabletoseethisfactfromthescatteralone,butitisdramaticallyclearbyvisualizingtheresidualscatter.

ComputationalandInferentialThinking

243Errors

Page 244: Computational and inferential thinking

defresidual_plot(table,x,y):

fitted=fit(table,x,y)

residuals=table.column(y)-fitted

res_table=Table().with_columns([

'fitted',fitted,

'residuals',residuals])

res_table.scatter(0,1,fit_line=True)

residual_plot(faithful,1,0)

Thediagonalstripesshowanon-linearpatterninthedata.Inthiscase,wewouldnotexpecttheregressionlinetoprovidelow-errorestimates.

Example:SATScores¶Residualplotscanbeusefulforspottingnon-linearityinthedata,orotherfeaturesthatweakentheregressionanalysis.Forexample,considertheSATdataoftheprevioussection,andsupposeyoutrytoestimatetheCombinedscorebasedonParticipationRate.

sat=Table.read_table('sat2014.csv')

sat

ComputationalandInferentialThinking

244Errors

Page 245: Computational and inferential thinking

State ParticipationRate

CriticalReading Math Writing Combined

NorthDakota 2.3 612 620 584 1816

Illinois 4.6 599 616 587 1802

Iowa 3.1 605 611 578 1794

SouthDakota 2.9 604 609 579 1792

Minnesota 5.9 598 610 578 1786

Michigan 3.8 593 610 581 1784

Wisconsin 3.9 596 608 578 1782

Missouri 4.2 595 597 579 1771

Wyoming 3.3 590 599 573 1762

Kansas 5.3 591 596 566 1753

...(41rowsomitted)

sat.scatter('ParticipationRate','Combined',s=8)

Therelationbetweenthevariablesisclearlynon-linear,butyoumightbetemptedtofitastraightlineanyway,especiallyifyouneverlookedatascatterdiagramofthedata.

ComputationalandInferentialThinking

245Errors

Page 246: Computational and inferential thinking

sat.scatter('ParticipationRate','Combined',s=8,fit_line=True)

Thepointsinthescatterplotstartoutabovetheregressionline,thenareconsistentlybelowtheline,thenabove,thenbelow.Thispatternofnon-linearityismoreclearlyvisibleintheresidualplot.

residual_plot(sat,'ParticipationRate','Combined')

_=plt.title('Residualplotofabadregression')

ComputationalandInferentialThinking

246Errors

Page 247: Computational and inferential thinking

Thisresidualplotisnotaformlesscloud;itshowsanon-linearpattern,andisasignalthatlinearregressionshouldnothavebeenusedforthesedata.

TheSizeofResiduals¶

Wewillconcludebyobservingseveralrelationshipsbetweenquantitieswehavealreadyidentified.We'llillustratetherelationshipsusingthelongeruptionsofOldFaithful,buttheyindeedholdforanycollectionofpairedobservations.

Fact1:Therootmeansquarederrorofalinethathaszeroslopeandaninterceptattheaverageofyisthestandarddeviationofy.

mse_long(0,np.average(eruptions))**0.5

0.40967604587437784

np.std(eruptions)

0.40967604587437784

Bycontrast,thestandarddeviationoffittedvaluesissmaller.

eruptions_fitted=fit(long,'waiting','eruptions')

np.std(eruptions_fitted)

0.15271994814407569

Fact2:Theratioofthestandarddeviationsoffittedvaluestoyvaluesis .

r=correlation(long,'waiting','eruptions')

print('r:',r)

print('ratio:',np.std(eruptions_fitted)/np.std(eruptions))

r:0.372782225571

ratio:0.372782225571

ComputationalandInferentialThinking

247Errors

Page 248: Computational and inferential thinking

Noticetheabsolutevalueof intheformulaabove.Fortheheightsoffathersandsons,thecorrelationispositiveandsothereisnodifferencebetweenusing andusingitsabsolutevalue.However,theresultistrueforvariablesthathavenegativecorrelationaswell,providedwearecarefultousetheabsolutevalueof insteadof .

Theregressionlinedoesabetterjobofestimatingeruptionlengthsthanazero-slopeline.Thus,theroughsizeoftheerrorsmadeusingtheregressionlinemustbesmallerthatthatusingaflatline.Inotherwords,theSDoftheresidualsmustbesmallerthantheoverallSDof .

eruptions_residuals=eruptions-eruptions_fitted

np.std(eruptions_residuals)

0.38014612980028634

Fact3:TheSDoftheresidualsis timestheSDof .

np.std(eruptions)*(1-r**2)**0.5

0.38014612980028639

Fact4:Thevarianceofthefittedvaluesplusthevarianceoftheresidualsgivesthevarianceofy.

np.std(eruptions_fitted)**2+np.std(eruptions_residuals)**2

0.16783446256326531

np.std(eruptions)**2

0.16783446256326534

ComputationalandInferentialThinking

248Errors

Page 249: Computational and inferential thinking

MultipleRegressionInteractNowthatwehaveexploredwaystousemultiplefeaturestopredictacategoricalvariable,itisnaturaltostudywaysofusingmultiplepredictorvariablestopredictaquantitativevariable.Acommonlyusedmethodtodothisiscalledmultipleregression.

Wewillstartwithanexampletoreviewsomefundamentalaspectsofsimpleregression,thatis,regressionbasedononepredictorvariable.

Example:SimpleRegression¶

Supposethatourgoalistouseregressiontoestimatetheheightofabassethoundbasedonitsweight,usingasamplethatlooksconsistentwiththeregressionmodel.Supposetheobservedcorrelation is0.5,andthatthesummarystatisticsforthetwovariablesareasinthetablebelow:

average SD

height 14inches 2inches

weight 50pounds 5pounds

Tocalculatetheequationoftheregressionline,weneedtheslopeandtheintercept.

Theequationoftheregressionlineallowsustocalculatetheestimatedheight,ininches,basedonagivenweightinpounds:

Theslopeofthelineismeasurestheincreaseintheestimatedheightperunitincreaseinweight.Theslopeispositive,anditisimportanttonotethatthisdoesnotmeanthatwethinkbassethoundsgettalleriftheyputonweight.Theslopereflectsthedifferenceintheaverageheightsoftwogroupsofdogsthatare1poundapartinweight.Specifically,

ComputationalandInferentialThinking

249MultipleRegression

Page 250: Computational and inferential thinking

consideragroupofdogswhoseweightis pounds,andthegroupwhoseweightispounds.Thesecondgroupisestimatedtobe0.2inchestaller,onaverage.Thisistrueforallvaluesof inthesample.

Ingeneral,theslopeoftheregressionlinecanbeinterpretedastheaverageincreaseinperunitincreasein .Notethatiftheslopeisnegative,thenforeveryunitincreasein ,theaverageof decreases.

MultiplePredictors¶

Inmultipleregression,morethanonepredictorvariableisusedtoestimate .Forexample,aDartmouthstudyofundergraduatescollectedinformationabouttheiruseofanonlinecourseforumandtheirGPA.Wemightwanttopredictastudent'sGPAbasedonthenumberofdaysthattheyvisitthecourseforumandhowmanytimestheyanswersomeoneelse'squestion.Thenthemultipleregressionmodelwouldinvolvetwoslopes,anintercept,andrandomerrorsasbefore:

GPA

OurgoalwouldbetofindtheestimatedGPAusingthebestestimatesofthetwoslopesandtheintercept;asbefore,the"best"estimatesarethosethatminimizethemeansquarederrorofestimation.

Tostartoff,wewillinvestigatethedataset,whichisdescribedonline.Eachrowrepresentsastudent.Inadditiontothestudent'sGPA,therowtallies

daysonline:Thenumberofdaysonwhichthestudentviewedtheonlineforumviews:Thenumberofpostsviewedcontributions:Thenumberofcontributions,includingpostsandfollow-updiscussionsquestions:Thenumberofquestionspostednotes:Thenumberofnotespostedanswers:Thenumberofanswersposted

grades=Table.read_table('grades_and_piazza.csv')

grades

ComputationalandInferentialThinking

250MultipleRegression

Page 251: Computational and inferential thinking

GPA daysonline views contributions questions notes answers

2.863 29 299 5 1 1 0

3.505 57 299 0 0 0 0

3.029 27 101 1 1 0 0

3.679 67 301 1 0 0 0

3.474 43 201 12 1 0 0

3.705 67 308 45 22 0 5

3.806 36 171 20 4 3 4

3.667 82 300 26 11 0 3

3.245 44 127 6 1 1 0

3.293 35 259 16 13 1 0

...(20rowsomitted)

CorrelationMatrix¶

PerhapswewishtopredictGPAbasedonforumusage.AnaturalfirststepistoseewhichvariablesarecorrelatedwiththeGPA.Hereisthecorrelationmatrix.Theto_dfmethodgeneratesaPandasdataframecontainingthesamedataasthetable,anditscorrmethodgeneratesamatrixofcorrelationsforeachpairofcolumns.

AllforumusagevariablesarecorrelatedwithGPA,butthenotescorrelationhasaverysmallmagnitude.

grades.to_df().corr()

GPA daysonline views contributions questions

GPA 1.000000 0.684905 0.444175 0.427897 0.409212

daysonline 0.684905 1.000000 0.654557 0.448319 0.435269

views 0.444175 0.654557 1.000000 0.426406 0.361002

contributions 0.427897 0.448319 0.426406 1.000000 0.857981

questions 0.409212 0.435269 0.361002 0.857981 1.000000

notes -0.160604 -0.230839 0.065627 0.295661 -0.006365

answers 0.440382 0.502810 0.365010 0.702679 0.515661

ComputationalandInferentialThinking

251MultipleRegression

Page 252: Computational and inferential thinking

Tostartoff,letusperformthesimpleregressionofGPAonjustdaysonline,thecountwiththestrongestcorrelationwithanrof0.68.Hereisthescatterdiagramandregressionline.

grades.scatter('daysonline','GPA',fit_line=True)

Wecanimmediatelyseethattherelationisnotentirelylinear,andaresidualplothighlightsthisfact.OneissueisthattheGPAisneverabove4.0,sothemodelassumptionthatyvaluesarebellshapeddoesnotholdinthiscase.

residual_plot(grades,'daysonline','GPA')

Nonetheless,theregressionlinedoescapturesomeofthepatterinthedata,andsowewillcontinuetoworkwithit,notingthatourpredictionsmaynotbeentirelyjustifiedbytheregressionmodel.

ComputationalandInferentialThinking

252MultipleRegression

Page 253: Computational and inferential thinking

Therootmeansquarederrorisabitabove1/4ofaGPApointwhenpredictingGPAfromonlythedaysonline.

grades_days_mse=mean_squared_error(grades,'daysonline','GPA')

a,b=minimize(grades_days_mse)

grades_days_mse(a,b)**0.5

0.28494512878662448

TwoPredictors¶

Perhapsmoreinformationcanimproveourprediction.ThenumberofanswersisalsocorrelatedwithGPA.Whenusingitaloneasapredictor,therootmeansquarederrorisgreaterbecausethecorrelationhaslowermagnitudethandaysonline.

grades_answers_mse=mean_squared_error(grades,'answers','GPA')

a,b=minimize(grades_answers_mse)

grades_answers_mse(a,b)**0.5

0.35110518920562062

However,thetwovariablescanbeusedtogether.First,wedefinethemeansquarederrorintermsofalinethathastwoslopes,oneforeachvariable,aswellasanintercept.

defgrades_days_answers_mse(a_days,a_answers,b):

fitted=a_days*grades.column('daysonline')+\

a_answers*grades.column('answers')+\

b

y=grades.column('GPA')

returnnp.average((y-fitted)**2)

minimize(grades_days_answers_mse)

array([0.0157683,0.0301254,2.6031789])

ComputationalandInferentialThinking

253MultipleRegression

Page 254: Computational and inferential thinking

Theminimizefunctionreturnsthreevalues,theslopefordaysonline,theslopeforanswers,andtheintercept.Together,thesethreevaluesdescribealinearpredictionfunction.Forexample,someonewhospent50daysonlineandanswered5questionsispredictedtohaveaGPAofabout3.54.

np.round(0.0157683*50+0.0301254*5+2.6031789,2)

3.54

Therootmeansquarederrorofthispredictorisabitsmallerthanthatofeithersingle-variablepredictorwehadbefore!

a_days,a_answers,b=minimize(grades_days_answers_mse)

grades_days_answers_mse(a_days,a_answers,b)**0.5

0.28161529539241298

Combiningallvariables¶

Wecancombineallinformationaboutthecourseforuminordertoattempttofindanevenbetterpredictor.

ComputationalandInferentialThinking

254MultipleRegression

Page 255: Computational and inferential thinking

deffit_grades(a_days,a_views,a_contributions,a_questions,a_notes

returna_days*grades.column('daysonline')+\

a_views*grades.column('views')+\

a_contributions*grades.column('contributions')+\

a_questions*grades.column('questions')+\

a_notes*grades.column('notes')+\

a_answers*grades.column('answers')+\

b

defgrades_all_mse(a_days,a_views,a_contributions,a_questions,a_notes

fitted=fit_grades(a_days,a_views,a_contributions,a_questions

y=grades.column('GPA')

returnnp.average((y-fitted)**2)

minimize(grades_all_mse)

array([1.43703000e-02,-8.15000000e-05,6.50720000e-03,

-1.72780000e-03,-4.04014000e-02,1.59645000e-02,

2.65375470e+00])

Usingallofthisinformation,wehaveconstructedapredictorwithanevensmallerrootmeansquarederror.The*belowpassesallofthereturnvaluesoftheminimizefunctionasargumentstogrades_all_mse.

grades_all_mse(*minimize(grades_all_mse))**0.5

0.27783237243108722

Allofthisinformationdidnotimproveourpredictorverymuch.Therootmeansquarederrordecreasedfrom2.82forthebestsingle-variablepredictorto2.78whenusingalloftheavailableinformation.

Definitionof ,consistentwithourold

Whenwestudiedsimpleregression,wehadnotedthat

ComputationalandInferentialThinking

255MultipleRegression

Page 256: Computational and inferential thinking

Letususeouroldfunctionstocomputethefittedvaluesandconfirmthatthisistrueforourexample:

fitted=fit(grades,'daysonline','GPA')

np.std(fitted)/np.std(grades.column('GPA'))

0.68490480299828194

Becausevarianceisthesquareofthestandarddeviation,wecansaythat

Noticethatthiswayofthinkingabout involvesonlytheestimatedvaluesandtheobservedvalues,notthenumberofpredictorvariables.Therefore,itmotivatesthedefinitionofmultiple :

Itisafactofmathematicsthatthisquantityisalwaysbetween0and1.Withmultiplepredictorvariables,thereisnoclearinterpretationofasignattachedtothesquarerootof

.Someofthepredictorsmightbepositivelyassociatedwith ,othersnegatively.Anoverallmeasureofthefitisprovidedby .

fitted=fit_grades(*minimize(grades_all_mse))

np.var(fitted)/np.var(grades.column('GPA'))

0.49525855565521787

Itisnotalwayswisetomaximize byincludingmanyvariables,butwewilladdressthisconcernlaterinthetext.

Wecanalsoviewtheresidualplotofmultipleregression,justasbefore.Thehorizontalaxisshowsfittedvaluesandtheverticalaxisshowserrors.Again,thereissomestructuretothisplot,soperhapstherelationbetweenthesevariablesisnotentirelylinear.Inthiscase,thestructureisminorenoughthattheregressionmodelisareasonablechoiceofpredictor.

ComputationalandInferentialThinking

256MultipleRegression

Page 257: Computational and inferential thinking

Table().with_columns([

'fitted',fitted,

'errors',grades.column('GPA')-fitted

]).scatter(0,1,fit_line=True)

ComputationalandInferentialThinking

257MultipleRegression

Page 258: Computational and inferential thinking

ClassificationSectionauthor:DavidWagner

InteractThissectionwilldiscussmachinelearning.Machinelearningisaclassoftechniquesforautomaticallyfindingpatternsindataandusingittodrawinferencesormakepredictions.We'regoingtofocusonaparticularkindofmachinelearning,namely,classification.

Classificationisaboutlearninghowtomakepredictionsfrompastexamples:we'regivensomeexampleswherewehavebeentoldwhatthecorrectpredictionwas,andwewanttolearnfromthoseexampleshowtomakegoodpredictionsinthefuture.Hereareafewapplicationswhereclassificationisusedinpractice:

ForeachorderAmazonreceives,Amazonwouldliketopredict:isthisorderfraudulent?Theyhavesomeinformationabouteachorder(e.g.,itstotalvalue,whethertheorderisbeingshippedtoanaddressthiscustomerhasusedbefore,whethertheshippingaddressisthesameasthecreditcardholder'sbillingaddress).Theyhavelotsofdataonpastorders,andtheyknowwhetherwhichofthosepastorderswerefraudulentandwhichweren't.Theywanttolearnpatternsthatwillhelpthempredict,asnewordersarrive,whetherthosenewordersarefraudulent.

Onlinedatingsiteswouldliketopredict:arethesetwopeoplecompatible?Willtheyhititoff?Theyhavelotsofdataonwhichmatchesthey'vesuggestedtotheircustomersinthepast,andtheyhavesomeideawhichonesweresuccessful.Asnewcustomerssignup,they'dliketopredictmakepredictionsaboutwhomightbeagoodmatchforthem.

Doctorswouldliketoknow:doesthispatienthavecancer?Basedonthemeasurementsfromsomelabtest,they'dliketobeabletopredictwhethertheparticularpatienthascancer.Theyhavelotsofdataonpastpatients,includingtheirlabmeasurementsandwhethertheyultimatelydevelopedcancer,andfromthat,they'dliketotrytoinferwhatmeasurementstendtobecharacteristicofcancer(ornon-cancer)sotheycandiagnosefuturepatientsaccurately.

Politicianswouldliketopredict:areyougoingtovoteforthem?Thiswillhelpthemfocusfundraisingeffortsonpeoplewhoarelikelytosupportthem,andfocusget-out-the-voteeffortsonvoterswhowillvoteforthem.Publicdatabasesandcommercialdatabaseshavealotofinformationaboutmostpeople:e.g.,whethertheyownahomeorrent;whethertheyliveinarichneighborhoodorpoorneighborhood;theirinterestsandhobbies;theirshoppinghabits;andsoon.Andpoliticalcampaignshavesurveyed

ComputationalandInferentialThinking

258Classification

Page 259: Computational and inferential thinking

somevotersandfoundoutwhotheyplantovotefor,sotheyhavesomeexampleswherethecorrectanswerisknown.Fromthisdata,thecampaignswouldliketofindpatternsthatwillhelpthemmakepredictionsaboutallotherpotentialvoters.

Alloftheseareclassificationtasks.Noticethatineachoftheseexamples,thepredictionisayes/noquestion--wecallthisbinaryclassification,becausethereareonlytwopossiblepredictions.Inaclassificationtask,wehaveabunchofobservations.Eachobservationrepresentsasingleindividualorasinglesituationwherewe'dliketomakeaprediction.Eachobservationhasmultipleattributes,whichareknown(e.g.,thetotalvalueoftheorder;voter'sannualsalary;andsoon).Also,eachobservationhasaclass,whichistheanswertothequestionwecareabout(e.g.,yesorno;fraudulentornot;etc.).

Forinstance,withtheAmazonexample,eachordercorrespondstoasingleobservation.Eachobservationhasseveralattributes(e.g.,thetotalvalueoftheorder,whethertheorderisbeingshippedtoanaddressthiscustomerhasusedbefore,andsoon).Theclassoftheobservationiseither0or1,where0meansthattheorderisnotfraudulentand1meansthattheorderisfraudulent.Giventheattributesofsomeneworder,wearetryingtopredictitsclass.

Classificationrequiresdata.Itinvolveslookingforpatterns,andtofindpatterns,youneeddata.That'swherethedatasciencecomesin.Inparticular,we'regoingtoassumethatwehaveaccesstotrainingdata:abunchofobservations,whereweknowtheclassofeachobservation.Thecollectionofthesepre-classifiedobservationsisalsocalledatrainingset.Aclassificationalgorithmisgoingtoanalyzethetrainingset,andthencomeupwithaclassifier:analgorithmforpredictingtheclassoffutureobservations.

Notethatclassifiersdonotneedtobeperfecttobeuseful.Theycanbeusefuleveniftheiraccuracyislessthan100%.Forinstance,iftheonlinedatingsiteoccasionallymakesabadrecommendation,that'sOK;theircustomersalreadyexpecttohavetomeetmanypeoplebeforethey'llfindsomeonetheyhititoffwith.Ofcourse,youdon'twanttheclassifiertomaketoomanyerrors--butitdoesn'thavetogettherightanswereverysingletime.

Chronickidneydisease¶

Let'sworkthroughanexample.We'regoingtoworkwithadatasetthatwascollectedtohelpdoctorsdiagnosechronickidneydisease(CKD).Eachrowinthedatasetrepresentsasinglepatientwhowastreatedinthepastandwhosediagnosisisknown.Foreachpatient,wehaveabunchofmeasurementsfromabloodtest.We'dliketofindwhichmeasurementsaremostusefulfordiagnosingCKD,anddevelopawaytoclassifyfuturepatientsas"hasCKD"or"doesn'thaveCKD"basedontheirbloodtestresults.

Let'sloadthedatasetintoatableandlookatit.

ComputationalandInferentialThinking

259Classification

Page 260: Computational and inferential thinking

ckd=Table.read_table('ckd.csv').relabeled('BloodGlucoseRandom',

ckd

Age BloodPressure

SpecificGravity Albumin Sugar

RedBloodCells

PusCell PusCellclumps

48 70 1.005 4 0 normal abnormal present

53 90 1.02 2 0 abnormal abnormal present

63 70 1.01 3 0 abnormal abnormal present

68 80 1.01 3 2 normal abnormal present

61 80 1.015 2 0 abnormal abnormal notpresent

48 80 1.025 4 0 normal abnormal notpresent

69 70 1.01 3 4 normal abnormal notpresent

73 70 1.005 0 0 normal normal notpresent

73 80 1.02 2 0 abnormal abnormal notpresent

46 60 1.01 1 0 normal normal notpresent

...(148rowsomitted)

Wehavedataon158patients.Thereareanawfullotofattributeshere.Thecolumnlabelled"Class"indicateswhetherthepatientwasdiagnosedwithCKD:1meanstheyhaveCKD,0meanstheydonothaveCKD.

Let'slookattwocolumnsinparticular:thehemoglobinlevel(inthepatient'sblood),andthebloodglucoselevel(atarandomtimeintheday;withoutfastingspeciallyforthebloodtest).We'lldrawascatterplot,tomakeiteasytovisualizethis.ReddotsarepatientswithCKD;bluedotsarepatientswithoutCKD.WhattestresultsseemtoindicateCKD?

ckd.scatter('Hemoglobin','Glucose',c=ckd.column('Class'))

ComputationalandInferentialThinking

260Classification

Page 261: Computational and inferential thinking

SupposeAliceisanewpatientwhoisnotinthedataset.IfItellyouAlice'shemoglobinlevelandbloodglucoselevel,couldyoupredictwhethershehasCKD?Itsurelookslikeit!Youcanseeaveryclearpatternhere:pointsinthelower-righttendtorepresentpeoplewhodon'thaveCKD,andtheresttendtobefolkswithCKD.Toahuman,thepatternisobvious.Buthowcanweprogramacomputertoautomaticallydetectpatternssuchasthisone?

Well,therearelotsofkindsofpatternsonemightlookfor,andlotsofalgorithmsforclassification.ButI'mgoingtotellyouaboutonethatturnsouttobesurprisinglyeffective.Itiscallednearestneighborclassification.Here'stheidea.IfwehaveAlice'shemoglobinandglucosenumbers,wecanputhersomewhereonthisscatterplot;thehemoglobinisherx-coordinate,andtheglucoseishery-coordinate.Now,topredictwhethershehasCKDornot,wefindthenearestpointinthescatterplotandcheckwhetheritisredorblue;wepredictthatAliceshouldreceivethesamediagnosisasthatpatient.

Inotherwords,toclassifyAliceasCKDornot,wefindthepatientinthetrainingsetwhois"nearest"toAlice,andthenusethatpatient'sdiagnosisasourpredictionforAlice.Theintuitionisthatiftwopointsareneareachotherinthescatterplot,thenthecorrespondingmeasurementsareprettysimilar,sowemightexpectthemtoreceivethesamediagnosis(morelikelythannot).Wedon'tknowAlice'sdiagnosis,butwedoknowthediagnosisofallthepatientsinthetrainingset,sowefindthepatientinthetrainingsetwhoismostsimilartoAlice,andusethatpatient'sdiagnosistopredictAlice'sdiagnosis.

Thescatterplotsuggeststhatthisnearestneighborclassifiershouldbeprettyaccurate.Pointsinthelower-rightwilltendtoreceivea"noCKD"diagnosis,astheirnearestneighborwillbeabluepoint.Therestofthepointswilltendtoreceivea"CKD"diagnosis,astheirnearestneighborwillbearedpoint.Sothenearestneighborstrategyseemstocaptureourintuitionprettywell,forthisexample.

ComputationalandInferentialThinking

261Classification

Page 262: Computational and inferential thinking

However,theseparationbetweenthetwoclasseswon'talwaysbequitesoclean.Forinstance,supposethatinsteadofhemoglobinlevelsweweretolookatwhitebloodcellcount.Lookatwhathappens:

ckd.scatter('WhiteBloodCellCount','Glucose',c=ckd.column('Class'

Asyoucansee,non-CKDindividualsareallclusteredinthelower-left.MostofthepatientswithCKDareaboveortotherightofthatcluster...butnotall.TherearesomepatientswithCKDwhoareinthelowerleftoftheabovefigure(asindicatedbythehandfulofreddotsscatteredamongthebluecluster).Whatthismeansisthatyoucan'ttellforcertainwhethersomeonehasCKDfromjustthesetwobloodtestmeasurements.

IfwearegivenAlice'sglucoselevelandwhitebloodcellcount,canwepredictwhethershehasCKD?Yes,wecanmakeaprediction,butweshouldn'texpectittobe100%accurate.Intuitively,itseemslikethere'sanaturalstrategyforpredicting:plotwhereAlicelandsinthescatterplot;ifsheisinthelower-left,predictthatshedoesn'thaveCKD,otherwisepredictshehasCKD.Thisisn'tperfect--ourpredictionswillsometimesbewrong.(Takeaminuteandthinkitthrough:forwhichpatientswillitmakeamistake?)Asthescatterplotaboveindicates,sometimespeoplewithCKDhaveglucoseandwhitebloodcelllevelsthatlookidenticaltothoseofsomeonewithoutCKD,soanyclassifierisinevitablygoingtomakethewrongpredictionforthem.

Canweautomatethisonacomputer?Well,thenearestneighborclassifierwouldbeareasonablechoiceheretoo.Takeaminuteandthinkitthrough:howwillitspredictionscomparetothosefromtheintuitivestrategyabove?Whenwilltheydiffer?Itspredictionswillbeprettysimilartoourintuitivestrategy,butoccasionallyitwillmakeadifferentprediction.In

ComputationalandInferentialThinking

262Classification

Page 263: Computational and inferential thinking

particular,ifAlice'sbloodtestresultshappentoputherrightnearoneofthereddotsinthelower-left,theintuitivestrategywouldpredict"notCKD",whereasthenearestneighborclassifierwillpredict"CKD".

Thereisasimplegeneralizationofthenearestneighborclassifierthatfixesthisanomaly.Itiscalledthek-nearestneighborclassifier.TopredictAlice'sdiagnosis,ratherthanlookingatjusttheoneneighborclosesttoher,wecanlookatthe3pointsthatareclosesttoher,andusethediagnosisforeachofthose3pointstopredictAlice'sdiagnosis.Inparticular,we'llusethemajorityvalueamongthose3diagnosesasourpredictionforAlice'sdiagnosis.Ofcourse,there'snothingspecialaboutthenumber3:wecoulduse4,or5,ormore.(It'softenconvenienttopickanoddnumber,sothatwedon'thavetodealwithties.)Ingeneral,wepickanumber ,andourpredicteddiagnosisforAliceisbasedonthe patientsinthetrainingsetwhoareclosesttoAlice.Intuitively,thesearethe patientswhosebloodtestresultsweremostsimilartoAlice,soitseemsreasonabletousetheirdiagnosestopredictAlice'sdiagnosis.

The -nearestneighborclassifierwillnowbehavejustlikeourintuitivestrategyabove.

Decisionboundary¶

Sometimesahelpfulwaytovisualizeaclassifieristomaptheregionofspacewheretheclassifierwouldpredict'CKD',andtheregionofspacewhereitwouldpredict'notCKD'.Weendupwithsomeboundarybetweenthetwo,wherepointsononesideoftheboundarywillbeclassified'CKD'andpointsontheothersidewillbeclassified'notCKD'.Thisboundaryiscalledthedecisionboundary.Eachdifferentclassifierwillhaveadifferentdecisionboundary;thedecisionboundaryisjustawaytovisualizewhatcriteriatheclassifierisusingtoclassifypoints.

Banknoteauthentication¶

Let'sdoanotherexample.Thistimewe'lllookatpredictingwhetherabanknote(e.g.,a$20bill)iscounterfeitorlegitimate.Researchershaveputtogetheradatasetforus,basedonphotographsofmanyindividualbanknotes:somecounterfeit,somelegitimate.Theycomputedafewnumbersfromeachimage,usingtechniquesthatwewon'tworryaboutforthiscourse.So,foreachbanknote,weknowafewnumbersthatwerecomputedfromaphotographofitaswellasitsclass(whetheritiscounterfeitornot).Let'sloaditintoatableandtakealook.

banknotes=Table.read_table('banknote.csv')

banknotes

ComputationalandInferentialThinking

263Classification

Page 264: Computational and inferential thinking

WaveletVar WaveletSkew WaveletCurt Entropy Class

3.6216 8.6661 -2.8073 -0.44699 0

4.5459 8.1674 -2.4586 -1.4621 0

3.866 -2.6383 1.9242 0.10645 0

3.4566 9.5228 -4.0112 -3.5944 0

0.32924 -4.4552 4.5718 -0.9888 0

4.3684 9.6718 -3.9606 -3.1625 0

3.5912 3.0129 0.72888 0.56421 0

2.0922 -6.81 8.4636 -0.60216 0

3.2032 5.7588 -0.75345 -0.61251 0

1.5356 9.1772 -2.2718 -0.73535 0

...(1362rowsomitted)

Let'slookatwhetherthefirsttwonumberstellusanythingaboutwhetherthebanknoteiscounterfeitornot.Here'sascatterplot:

banknotes.scatter('WaveletVar','WaveletCurt',c=banknotes.column('Class'

Prettyinteresting!Thosetwomeasurementsdoseemhelpfulforpredictingwhetherthebanknoteiscounterfeitornot.However,inthisexampleyoucannowseethatthereissomeoverlapbetweentheblueclusterandtheredcluster.Thisindicatesthattherewillbesome

ComputationalandInferentialThinking

264Classification

Page 265: Computational and inferential thinking

imageswhereit'shardtotellwhetherthebanknoteislegitimatebasedonjustthesetwonumbers.Still,youcouldusea -nearestneighborclassifiertopredictthelegitimacyofabanknote.

Takeaminuteandthinkitthrough:Supposeweused (say).Whatpartsoftheplotwouldtheclassifiergetright,andwhatpartswoulditmakeerrorson?Whatwouldthedecisionboundarylooklike?

Thepatternsthatshowupinthedatacangetprettywild.Forinstance,here'swhatwe'dgetifusedadifferentpairofmeasurementsfromtheimages:

banknotes.scatter('WaveletSkew','Entropy',c=banknotes.column('Class'

Theredoesseemtobeapattern,butit'saprettycomplexone.Nonetheless,the -nearestneighborsclassifiercanstillbeusedandwilleffectively"discover"patternsoutofthis.Thisillustrateshowpowerfulmachinelearningcanbe:itcaneffectivelytakeadvantageofevenpatternsthatwewouldnothaveanticipated,orthatwewouldhavethoughtto"programinto"thecomputer.

Multipleattributes¶

SofarI'vebeenassumingthatwehaveexactly2attributesthatwecanusetohelpusmakeourprediction.Whatifwehavemorethan2?Forinstance,whatifwehave3attributes?

ComputationalandInferentialThinking

265Classification

Page 266: Computational and inferential thinking

Here'sthecoolpart:youcanusethesameideasforthiscase,too.Allyouhavetodoismakea3-dimensionalscatterplot,insteadofa2-dimensionalplot.Youcanstillusethe -nearestneighborsclassifier,butnowcomputingdistancesin3dimensionsinsteadofjust2.Itjustworks.Verycool!

Infact,there'snothingspecialabout2or3.Ifyouhave4attributes,youcanusethe -nearestneighborsclassifierin4dimensions.5attributes?Workin5-dimensionalspace.Andnoneedtostopthere!Thisallworksforarbitrarilymanyattributes;youjustworkinaveryhighdimensionalspace.Itgetswicked-impossibletovisualize,butthat'sOK.Thecomputeralgorithmgeneralizesverynicely:allyouneedistheabilitytocomputethedistance,andthat'snothard.Mind-blowingstuff!

Forinstance,let'sseewhathappensifwetrytopredictwhetherabanknoteiscounterfeitornotusing3ofthemeasurements,insteadofjust2.Here'swhatyouget:

ax=plt.figure(figsize=(8,8)).add_subplot(111,projection='3d')

ax.scatter(banknotes.column('WaveletSkew'),

banknotes.column('WaveletVar'),

banknotes.column('WaveletCurt'),

c=banknotes.column('Class'))

<mpl_toolkits.mplot3d.art3d.Path3DCollectionat0x1093bdef0>

ComputationalandInferentialThinking

266Classification

Page 267: Computational and inferential thinking

Awesome!Withjust2attributes,therewassomeoverlapbetweenthetwoclusters(whichmeansthattheclassifierwasboundtomakesomemistakesforpointersintheoverlap).Butwhenweusethese3attributes,thetwoclustershavealmostnooverlap.Inotherwords,aclassifierthatusesthese3attributeswillbemoreaccuratethanonethatonlyusesthe2attributes.

Thisisageneralphenomenominclassification.Eachattributecanpotentiallygiveyounewinformation,somoreattributessometimeshelpsyoubuildabetterclassifier.Ofcourse,thecostisthatnowwehavetogathermoreinformationtomeasurethevalueofeachattribute,butthiscostmaybewellworthitifitsignificantlyimprovestheaccuracyofourclassifier.

Tosumup:younowknowhowtouse -nearestneighborclassificationtopredicttheanswertoayes/noquestion,basedonthevaluesofsomeattributes,assumingyouhaveatrainingsetwithexampleswherethecorrectpredictionisknown.Thegeneralroadmapisthis:

1. identifysomeattributesthatyouthinkmighthelpyoupredicttheanswertothequestion;2. gatheratrainingsetofexampleswhereyouknowthevaluesoftheattributesaswellas

thecorrectprediction;3. tomakepredictionsinthefuture,measurethevalueoftheattributesandthenuse -

nearestneighborclassificationtopredicttheanswertothequestion.

ComputationalandInferentialThinking

267Classification

Page 268: Computational and inferential thinking

Breastcancerdiagnosis¶NowIwanttodoamoreextendedexamplebasedondiagnosingbreastcancer.IwasinspiredbyBrittanyWenger,whowontheGooglenationalsciencefairthreeyearsagoasa17-yearoldhighschoolstudent.Here'sBrittany:

Brittany'ssciencefairprojectwastobuildaclassificationalgorithmtodiagnosebreastcancer.Shewongrandprizeforbuildinganalgorithmwhoseaccuracywasalmost99%.

Let'sseehowwellwecando,withtheideaswe'velearnedinthiscourse.

So,letmetellyoualittlebitaboutthedataset.Basically,ifawomanhasalumpinherbreast,thedoctorsmaywanttotakeabiopsytoseeifitiscancerous.Thereareseveraldifferentproceduresfordoingthat.Brittanyfocusedonfineneedleaspiration(FNA),becauseitislessinvasivethanthealternatives.Thedoctorgetsasampleofthemass,putsitunderamicroscope,takesapicture,andatrainedlabtechanalyzesthepicturetodeterminewhetheritiscancerornot.Wegetapicturelikeoneofthefollowing:

ComputationalandInferentialThinking

268Classification

Page 269: Computational and inferential thinking

Unfortunately,distinguishingbetweenbenignvsmalignantcanbetricky.So,researchershavestudiedusingmachinelearningtohelpwiththistask.Theideaisthatwe'llaskthelabtechtoanalyzetheimageandcomputevariousattributes:thingslikethetypicalsizeofacell,howmuchvariationthereisamongthecellsizes,andsoon.Then,we'lltrytousethisinformationtopredict(classify)whetherthesampleismalignantornot.Wehaveatrainingsetofpastsamplesfromwomenwherethecorrectdiagnosisisknown,andwe'llhopethatourmachinelearningalgorithmcanusethosetolearnhowtopredictthediagnosisforfuturesamples.

Weendupwiththefollowingdataset.Forthe"Class"column,1meansmalignant(cancer);0meansbenign(notcancer).

patients=Table.read_table('breast-cancer.csv').drop('ID')

patients

ComputationalandInferentialThinking

269Classification

Page 270: Computational and inferential thinking

ClumpThickness

UniformityofCellSize

UniformityofCellShape

MarginalAdhesion

SingleEpithelialCellSize

BareNuclei

BlandChromatin

5 1 1 1 2 1 3

5 4 4 5 7 10 3

3 1 1 1 2 2 3

6 8 8 1 3 4 3

4 1 1 3 2 1 3

8 10 10 8 7 10 9

1 1 1 1 2 10 3

2 1 2 1 2 1 3

2 1 1 1 2 1 1

4 2 1 1 2 1 2

...(673rowsomitted)

Sowehave9differentattributes.Idon'tknowhowtomakea9-dimensionalscatterplotofallofthem,soI'mgoingtopicktwoandplotthem:

patients.scatter('BlandChromatin','SingleEpithelialCellSize',

ComputationalandInferentialThinking

270Classification

Page 271: Computational and inferential thinking

Oops.Thatplotisutterlymisleading,becausethereareabunchofpointsthathaveidenticalvaluesforboththex-andy-coordinates.Tomakeiteasiertoseeallthedatapoints,I'mgoingtoaddalittlebitofrandomjittertothex-andy-values.Here'showthatlooks:

defrandomize_column(a):

returna+np.random.normal(0.0,0.09,size=len(a))

Table().with_columns([

'BlandChromatin(jittered)',

randomize_column(patients.column('BlandChromatin')),

'SingleEpithelialCellSize(jittered)',

randomize_column(patients.column('SingleEpithelialCellSize'

]).scatter(0,1,c=patients.column('Class'))

Forinstance,youcanseetherearelotsofsampleswithchromatin=2andepithelialcellsize=2;allnon-cancerous.

Keepinmindthatthejitteringisjustforvisualizationpurposes,tomakeiteasiertogetafeelingforthedata.Whenwewanttoworkwiththedata,we'llusetheoriginal(unjittered)data.

Applyingthek-nearestneighborclassifiertobreastcancerdiagnosis¶

We'vegotadataset.Let'stryoutthe -nearestneighborclassifierandseehowitdoes.Thisisgoingtobegreat.

ComputationalandInferentialThinking

271Classification

Page 272: Computational and inferential thinking

We'regoingtoneedanimplementationofthe -nearestneighborclassifier.Inpracticeyouwouldprobablyuseanexistinglibrary,butit'ssimpleenoughthatI'mgoingtoimeplmentitmyself.

Thefirstthingweneedisawaytocomputethedistancebetweentwopoints.Howdowedothis?In2-dimensionalspace,it'sprettyeasy.Ifwehaveapointatcoordinates andanotherat ,thedistancebetweenthemis

(Wheredidthiscomefrom?ItcomesfromthePythogoreantheorem:wehavearighttrianglewithsidelengths and ,andwewanttofindthelengthofthediagonal.)

In3-dimensionalspace,theformulais

In -dimensionalspace,thingsareabithardertovisualize,butIthinkyoucanseehowtheformulageneralized:wesumupthesquaresofthedifferencesbetweeneachindividualcoordinate,andthentakethesquarerootofthat.Let'simplementafunctiontocomputethisdistancefunctionforus:

defdistance(pt1,pt2):

total=0

foriinnp.arange(len(pt1)):

total=total+(pt1.item(i)-pt2.item(i))**2

returnmath.sqrt(total)

Next,we'regoingtowritesomecodetoimplementtheclassifier.Theinputisapatientpwhowewanttodiagnose.Theclassifierworksbyfindingthe nearestneighborsofpfromthetrainingset.So,ourapproachwillgolikethis:

1. Findtheclosest neighborsofp,i.e.,the patientsfromthetrainingsetthataremostsimilartop.

2. Lookatthediagnosesofthose neighbors,andtakethemajorityvotetofindthemost-commondiagnosis.Usethatasourpredicteddiagnosisforp.

SothatwillguidethestructureofourPythoncode.

ComputationalandInferentialThinking

272Classification

Page 273: Computational and inferential thinking

Toimplementthefirststep,wewillcomputethedistancefromeachpatientinthetrainingsettop,sortthembydistance,andtakethe closestpatientsinthetrainingset.Thecodewillmakeacopyofthetable,computethedistancefromeachpatienttop,addanewcolumntothetablewiththosedistances,andthensortthetablebydistanceandtakethefirst rows.ThatleadstothefollowingPythoncode:

defclosest(training,p,k):

...

defmajority(topkclasses):

...

defclassify(training,p,k):

kclosest=closest(training,p,k)

kclosest.classes=kclosest.select('Class')

returnmajority(kclosest)

ComputationalandInferentialThinking

273Classification

Page 274: Computational and inferential thinking

defcomputetablewithdists(training,p):

dists=np.zeros(training.num_rows)

attributes=training.drop('Class')

foriinnp.arange(training.num_rows):

dists[i]=distance(attributes.row(i),p)

returntraining.with_column('Distance',dists)

defclosest(training,p,k):

withdists=computetablewithdists(training,p)

sortedbydist=withdists.sort('Distance')

topk=sortedbydist.take(np.arange(k))

returntopk

defmajority(topkclasses):

iftopkclasses.where('Class',1).num_rows>topkclasses.where('Class'

return1

else:

return0

defclassify(training,p,k):

closestk=closest(training,p,k)

topkclasses=closestk.select('Class')

returnmajority(topkclasses)

Let'sseehowthisworks,withourdataset.We'lltakepatient12andimaginewe'regoingtotrytodiagnosethem:

patients.take(12)

ClumpThickness

UniformityofCellSize

UniformityofCellShape

MarginalAdhesion

SingleEpithelialCellSize

BareNuclei

BlandChromatin

5 3 3 3 2 3 4

Wecanpulloutjusttheirattributes(excludingtheclass),likethis:

patients.drop('Class').row(12)

ComputationalandInferentialThinking

274Classification

Page 275: Computational and inferential thinking

Row(ClumpThickness=5,UniformityofCellSize=3,UniformityofCellShape=3,MarginalAdhesion=3,SingleEpithelialCellSize=2,BareNuclei=3,BlandChromatin=4,NormalNucleoli=4,Mitoses=1)

Let'stake .Wecanfindthe5nearestneighbors:

closest(patients,patients.drop('Class').row(12),5)

ClumpThickness

UniformityofCellSize

UniformityofCellShape

MarginalAdhesion

SingleEpithelialCellSize

BareNuclei

BlandChromatin

5 3 3 3 2 3 4

5 3 3 4 2 4 3

5 1 3 3 2 2 2

5 2 2 2 2 2 3

5 3 3 1 3 3 3

3outofthe5nearestneighborshaveclass1,sothemajorityis1(hascancer)--andthatistheoutputofourclassifierforthispatient:

classify(patients,patients.drop('Class').row(12),5)

1

Awesome!Wenowhaveaclassificationalgorithmfordiagnosingwhetherapatienthasbreastcancerornot,basedonthemeasurementsfromthelab.Arewedone?Shallwegivethistodoctorstouse?

Holdon:we'renotdoneyet.There'sanobviousquestiontoanswer,beforewestartusingthisinpractice:

Howaccurateisthismethod,atdiagnosingbreastcancer?

Andthatraisesamorefundamentalissue.Howcanwemeasuretheaccuracyofaclassificationalgorithm?

Measuringaccuracyofaclassifier¶

We'vegotaclassifier,andwe'dliketodeterminehowaccurateitwillbe.Howcanwemeasurethat?

ComputationalandInferentialThinking

275Classification

Page 276: Computational and inferential thinking

Tryitout.Onenaturalideaistojusttryitonpatientsforayear,keeprecordsonit,andseehowaccurateitis.However,thishassomedisadvantages:(a)we'retryingsomethingonpatientswithoutknowinghowaccurateitis,whichmightbeunethical;(b)wehavetowaitayeartofindoutwhetherourclassifierisanygood.Ifit'snotgoodenoughandwegetanideaforanimprovement,we'llhavetowaitanotheryeartofindoutwhetherourimprovementwasbetter.

Getsomemoredata.Wecouldtrytogetsomemoredatafromotherpatientswhosediagnosisisknown,andmeasurehowaccurateourclassifier'spredictionsareonthoseadditionalpatients.Wecancomparewhattheclassifieroutputsagainstwhatweknowtobetrue.

Usethedatawealreadyhave.Anothernaturalideaistore-usethedatawealreadyhave:wehaveatrainingsetthatweusedtotrainourclassifier,sowecouldjustrunourclassifieroneverypatientinthedatasetandcomparewhatitoutputstowhatweknowtobetrue.Thisissometimesknownastestingtheclassifieronyourtrainingset.

Howshouldwechooseamongtheseoptions?Aretheyallequallygood?

Itturnsoutthatthethirdoption,testingtheclassifieronourtrainingset,isfundamentallyflawed.Itmightsoundattractive,butitgivesmisleadingresults:itwillover-estimatetheaccuracyoftheclassifier(itwillmakeusthinktheclassifierismoreaccuratethanitreallyis).Intuitively,theproblemisthatwhatwereallywanttoknowishowwelltheclassifierhasdoneat"generalizing"beyondthespecificexamplesinthetrainingset;butifwetestitonpatientsfromthetrainingset,thenwehaven'tlearnedanythingabouthowwellitwouldgeneralizetootherpatients.

Thisissubtle,soitmightbehelpfultotryanexample.Let'stryathoughtexperiment.Let'sfocusonthe1-nearestneighborclassifier( ).Supposeyoutrainedthe1-nearestneighborclassifierondatafromall683patientsinthedataset,andthenyoutesteditonthosesame683patients.Howmanywoulditgetright?Thinkitthroughandseeifyoucanworkoutwhatwillhappen.That'sright!Theclassifierwillgettherightanswerforall683patients.Supposeweapplytheclassifiertoapatientfromthetrainingset,sayAlice.Theclassifierwilllookforthenearestneighbor(themostsimilarpatientfromthetrainingset),andthenearestneighborwillturnouttobeAliceherself(thedistancefromanypointtoitselfiszero).Thus,theclassifierwillproducetherightdiagnosisforAlice.Thesamereasoningappliestoeveryotherpatientinthetrainingset.

So,ifwetestthe1-nearestneighborclassifieronthetrainingset,theaccuracywillalwaysbe100%:absolutelyperfect.Thisistruenomatterwhetherthereareactuallyanypatternsinthedata.Butthe100%isatotallie.Whenyouapplytheclassifiertootherpatientswhowerenotinthetrainingset,theaccuracycouldbefarworse.

ComputationalandInferentialThinking

276Classification

Page 277: Computational and inferential thinking

Inotherwords,testingonthetrainingtellsyounothingabouthowaccuratethe1-nearestneighborclassifierwillbe.Thisillustrateswhytestingonthetrainingsetissoflawed.Thisflawisprettyblatantwhenyouusethe1-nearestneighborclassifier,butdon'tthinkthatwithsomeotherclassifieryou'dbeimmunetothisproblem--theproblemisfundamentalandappliesnomatterwhatclassifieryouuse.Testingonthetrainingsetgivesyouabiasedestimateoftheclassifier'saccurate.Forthesereasons,youshouldnevertestonthetrainingset.

Sowhatshouldyoudo,instead?Isthereamoreprincipledapproach?

Itturnsoutthereis.Theapproachcomesdownto:getmoredata.Morespecifically,therightsolutionistouseonedatasetfortraining,andadifferentdatasetfortesting,withnooverlapbetweenthetwodatasets.Wecalltheseatrainingsetandatestset.

Wheredowegetthesetwodatasetsfrom?Typically,we'llstartoutwithsomedata,e.g.,thedataseton683patients,andbeforewedoanythingelsewithit,we'llsplititupintoatrainingsetandatestset.Wemightput50%ofthedataintothetrainingsetandtheother50%intothetestset.Basically,wearesettingasidesomedataforlateruse,sowecanuseittomeasuretheaccuracyofourclassifier.Sometimespeoplewillcallthedatathatyousetasidefortestingahold-outset,andthey'llcallthisstrategyforestimatingaccuracythehold-outmethod.

Notethatthisapproachrequiresgreatdiscipline.Beforeyoustartapplyingmachinelearningmethods,youhavetotakesomeofyourdataandsetitasidefortesting.Youmustavoidusingthetestsetfordevelopingyourclassifier:youshouldn'tuseittohelptrainyourclassifierortweakitssettingsorforbrainstormingwaystoimproveyourclassifier.Instead,youshoulduseitonlyonce,attheveryend,afteryou'vefinalizedyourclassifier,whenyouwantanunbiasedestimateofitsaccuracy.

Theeffectivenessofourclassifier,forbreastcancer¶OK,solet'sapplythehold-outmethodtoevaluatetheeffectivenessofthe -nearestneighborclassifierforbreastcancerdiagnosis.Thedatasethas683patients,sowe'llrandomlypermutethedatasetandput342oftheminthetrainingsetandtheremaining341inthetestset.

patients=patients.sample(683)#Randomlypermutetherows

trainset=patients.take(range(342))

testset=patients.take(range(342,683))

ComputationalandInferentialThinking

277Classification

Page 278: Computational and inferential thinking

We'lltraintheclassifierusingthe342patientsinthetrainingset,andevaluatehowwellitperformsonthetestset.Tomakeourliveseasier,we'llwriteafunctiontoevaluateaclassifieroneverypatientinthetestset:

defevaluate_accuracy(training,test,k):

testattrs=test.drop('Class')

numcorrect=0

foriinrange(test.num_rows):

#Runtheclassifierontheithpatientinthetestset

c=classify(training,testattrs.rows[i],k)

#Wastheclassifier'spredictioncorrect?

ifc==test.column('Class').item(i):

numcorrect=numcorrect+1

returnnumcorrect/test.num_rows

Nowforthegrandreveal--let'sseehowwedid.We'llarbitrarilyuse .

evaluate_accuracy(trainset,testset,5)

0.967741935483871

About96%accuracy.Notbad!Prettydarngoodforsuchasimpletechnique.

Asafootnote,youmighthavenoticedthatBrittanyWengerdidevenbetter.Whattechniquesdidsheuse?Onekeyinnovationisthatsheincorporatedaconfidencescoreintoherresults:heralgorithmhadawaytodeterminewhenitwasnotabletomakeaconfidentprediction,andforthosepatients,itdidn'teventrytopredicttheirdiagnosis.Heralgorithmwas99%accurateonthepatientswhereitmadeaprediction--sothatextensionseemedtohelpquiteabit.

Importanttakeaways¶

Hereareafewlessonswewantyoutolearnfromthis.

First,machinelearningispowerful.Ifyouhadtotrytowritecodetomakeadiagnosiswithoutknowingaboutmachinelearning,youmightspendalotoftimebytrial-and-errortryingtocomeupwithsomecomplicatedsetofrulesthatseemtowork,andtheresultmightnotbeveryaccurate.The -nearestneighborsalgorithmautomatestheentiretaskforyou.Andmachinelearningoftenletsthemmakepredictionsfarmoreaccuratelythananythingyou'dcomeupwithbytrial-and-error.

ComputationalandInferentialThinking

278Classification

Page 279: Computational and inferential thinking

Second,youcandoit.Yes,you.Youcanusemachinelearninginyourownworktomakepredictionsbasedondata.Younowknowenoughtostartapplyingtheseideastonewdatasetsandhelpothersmakeusefulpredictions.Thetechniquesareverypowerful,butyoudon'thavetohaveaPh.D.instatisticstousethem.

Third,becarefulabouthowtoevaluateaccuracy.Useahold-outset.

There'slotsmoreonecansayaboutmachinelearning:howtochooseattributes,howtochoose orotherparameters,whatotherclassificationmethodsareavailable,howtosolvemorecomplexpredictiontasks,andlotsmore.Inthiscourse,we'vebarelyevenscratchedthesurface.Ifyouenjoyedthismaterial,youmightenjoycontinuingyourstudiesinstatisticsandcomputerscience;courseslikeStats132and154andCS188and189gointoalotmoredepth.

ComputationalandInferentialThinking

279Classification

Page 280: Computational and inferential thinking

FeaturesSectionauthor:DavidWagner

Interact

Buildingamodel¶Sofar,wehavetalkedaboutprediction,wherethepurposeoflearningistobeabletopredicttheclassofnewinstances.I'mnowgoingtoswitchtomodelbuilding,wherethegoalistolearnamodelofhowtheclassdependsupontheattributes.

Oneplacewheremodelbuildingisusefulisforscience:e.g.,whichgenesinfluencewhetheryoubecomediabetic?Thisisinterestingandusefulinitsright(apartfromanyapplicationstopredictingwhetheraparticularindividualwillbecomediabetic),becauseitcanpotentiallyhelpusunderstandtheworkingsofourbody.

Anotherplacewheremodelbuildingisusefulisforcontrol:e.g.,whatshouldIchangeaboutmyadvertisementtogetmorepeopletoclickonit?HowshouldIchangetheprofilepictureIuseonanonlinedatingsite,togetmorepeopleto"swiperight"?Whichattributesmakethebiggestdifferencetowhetherpeopleclick/swipe?Ourgoalistodeterminewhichattributestochange,tohavethebiggestpossibleeffectonsomethingwecareabout.

Wealreadyknowhowtobuildaclassifier,givenatrainingset.Let'sseehowtousethatasabuildingblocktohelpussolvetheseproblems.

Howdowefigureoutwhichattributeshavethebiggestinfluenceontheoutput?Takeamomentandseewhatyoucancomeupwith.

Featureselection¶Background:attributesarealsocalledfeatures,inthemachinelearningliterature.

Ourgoalistofindasubsetoffeaturesthataremostrelevanttotheoutput.Thewaywe'llformalizeisthisistoidentifyasubsetoffeaturesthat,whenwetrainaclassifierusingjustthosefeatures,givesthehighestpossibleaccuracyatprediction.

Intuitively,ifweget90%accuracyusingallthefeaturesand88%accuracyusingjustthreeofthefeatures(forexample),thenitstandstoreasonthatthosethreefeaturesareprobablythemostrelevant,andtheycapturemostoftheinformationthataffectsordeterminesthe

ComputationalandInferentialThinking

280Features

Page 281: Computational and inferential thinking

output.

Withthisinsight,ourproblembecomes:

Findthesubsetof featuresthatgivesthebestpossibleaccuracy(whenweuseonlythose featuresforprediction).

Thisisafeatureselectionproblem.Therearemanypossibleapproachestofeatureselection.Onesimpleoneistotryallpossiblewaysofchoosing ofthefeatures,andevaluatetheaccuracyofeach.However,thiscanbeveryslow,becausetherearesomanywaystochooseasubsetof features.

Therefore,we'llconsideramoreefficientprocedurethatoftenworksreasonablywellinpractice.Itisknownasgreedyfeatureselection.Here'showitworks.

1. Supposethereare features.Tryeachonitsown,toseehowmuchaccuracywecangetusingaclassifiertrainedwithjustthatonefeature.Keepthebestfeature.

2. Nowwehaveonefeature.Tryremaining features,toseewhichisthebestonetoaddtoit(i.e.,wearenowtrainingaclassifierwithjust2features:thebestfeaturepickedinstep1,plusonemore).Keeptheonethatbestimprovesaccuracy.Nowwehave2features.

3. Repeat.Ateachstage,wetryallpossibilitiesforhowtoaddonemorefeaturetothefeaturesubsetwe'vealreadypicked,andwekeeptheonethatbestimprovesaccuracy.

Let'simplementitandtryitonsomeexamples!

Codefork-NN¶First,somecodefromlasttime,toimplement -nearestneighbors.

defdistance(pt1,pt2):

tot=0

foriinrange(len(pt1)):

tot=tot+(pt1[i]-pt2[i])**2

returnmath.sqrt(tot)

ComputationalandInferentialThinking

281Features

Page 282: Computational and inferential thinking

defcomputetablewithdists(training,p):

dists=np.zeros(training.num_rows)

attributes=training.drop('Class').rows

foriinrange(training.num_rows):

dists[i]=distance(attributes[i],p)

withdists=training.copy()

withdists.append_column('Distance',dists)

returnwithdists

defclosest(training,p,k):

withdists=computetablewithdists(training,p)

sortedbydist=withdists.sort('Distance')

topk=sortedbydist.take(range(k))

returntopk

defmajority(topkclasses):

iftopkclasses.where('Class',1).num_rows>topkclasses.where('Class'

return1

else:

return0

defclassify(training,p,k):

closestk=closest(training,p,k)

topkclasses=closestk.select('Class')

returnmajority(topkclasses)

defevaluate_accuracy(training,valid,k):

validattrs=valid.drop('Class')

numcorrect=0

foriinrange(valid.num_rows):

#Runtheclassifierontheithpatientinthetestset

c=classify(training,validattrs.rows[i],k)

#Wastheclassifier'spredictioncorrect?

ifc==valid['Class'][i]:

numcorrect=numcorrect+1

returnnumcorrect/valid.num_rows

ComputationalandInferentialThinking

282Features

Page 283: Computational and inferential thinking

Codeforfeatureselection¶Nowwe'llimplementthefeatureselectionalgorithm.First,asubroutinetoevaluatetheaccuracywhenusingaparticularsubsetoffeatures:

defevaluate_features(training,valid,features,k):

tr=training.select(['Class']+features)

va=valid.select(['Class']+features)

returnevaluate_accuracy(tr,va,k)

Next,we'llimplementasubroutinethat,givenacurrentsubsetoffeatures,triesallpossiblewaystoaddonemorefeaturetothesubset,andevaluatestheaccuracyofeachcandidate.Thisreturnsatablethatsummarizestheaccuracyofeachoptionitexamined.

deftry_one_more_feature(training,valid,baseattrs,k):

results=Table(['Attribute','Accuracy'])

forattrintraining.drop(['Class']+baseattrs).labels:

acc=evaluate_features(training,valid,[attr]+baseattrs,

results.append((attr,acc))

returnresults.sort('Accuracy',descending=True)

Finally,we'llimplementthegreedyfeatureselectionalgorithm,usingtheabovesubroutines.Forourownpurposesofunderstandingwhat'sgoingon,I'mgoingtohaveitprintout,ateachiteration,allfeaturesitconsideredandtheaccuracyitgotwitheach.

ComputationalandInferentialThinking

283Features

Page 284: Computational and inferential thinking

defselect_features(training,valid,k,maxfeatures=3):

results=Table(['NumAttrs','Attributes','Accuracy'])

curattrs=[]

iters=min(maxfeatures,len(training.labels)-1)

whilelen(curattrs)<iters:

print('==Computingbestfeaturetoaddto'+str(curattrs))

#Tryallwaysofaddingjustonemorefeaturetocurattrs

r=try_one_more_feature(training,valid,curattrs,k)

r.show()

print()

#Takethesinglebestfeatureandaddittocurattrs

attr=r['Attribute'][0]

acc=r['Accuracy'][0]

curattrs.append(attr)

results.append((len(curattrs),','.join(curattrs),acc))

returnresults

Example:TreeCover¶Nowlet'stryitoutonanexample.I'mworkingwithadatasetgatheredbytheUSForestryservice.Theyvisitedthousandsofwildnernesslocationsandrecordedvariouscharacteristicsofthesoilandland.Theyalsorecordedwhatkindoftreewasgrowingpredominantlyonthatland.FocusingonlyonareaswherethetreecoverwaseitherSpruceorLodgepolePine,let'sseeifwecanfigureoutwhichcharacteristicshavethegreatesteffectonwhetherthepredominanttreecoverisSpruceorLodgepolePine.

Thereare500,000recordsinthisdataset--morethanIcananalyzewiththesoftwarewe'reusing.So,I'llpickarandomsampleofjustafractionoftheserecords,toletusdosomeexperimentsthatwillcompleteinareasonableamountoftime.

all_trees=Table.read_table('treecover2.csv.gz',sep=',')

training=all_trees.take(range(0,1000))

validation=all_trees.take(range(1000,1500))

test=all_trees.take(range(1500,2000))

training.show(2)

ComputationalandInferentialThinking

284Features

Page 285: Computational and inferential thinking

Elevation Aspect Slope HorizDistToWater VertDistToWater HorizDistToRoad

2804 139 9 268 65 3180

2785 155 18 242 118 3090

...(998rowsomitted)

Let'sstartbyfiguringouthowaccurateaclassifierwillbe,iftrainedusingthisdata.I'mgoingtoarbitrarilyuse forthe -nearestneighborclassifier.

evaluate_accuracy(training,validation,15)

0.572

Nowwe'llapplyfeatureselection.IwonderwhichcharacteristicshavethebiggestinfluenceonwhetherSprucevsLodgepolePinegrows?We'lllookforthebest4features.

best_features=select_features(training,validation,15)

==Computingbestfeaturetoaddto[]

ComputationalandInferentialThinking

285Features

Page 286: Computational and inferential thinking

Attribute Accuracy

Elevation 0.762

HorizDistToRoad 0.562

HorizDistToFire 0.534

Hillshade9am 0.518

Hillshade3pm 0.516

Slope 0.516

Area4 0.514

Area3 0.514

Area2 0.514

Area1 0.514

VertDistToWater 0.512

HillshadeNoon 0.51

Aspect 0.51

HorizDistToWater 0.502

==Computingbestfeaturetoaddto['Elevation']

Attribute Accuracy

HorizDistToWater 0.782

Hillshade9am 0.776

Slope 0.77

Hillshade3pm 0.768

Area4 0.762

Area3 0.762

Area2 0.762

Area1 0.762

VertDistToWater 0.762

Aspect 0.75

HillshadeNoon 0.748

HorizDistToRoad 0.73

HorizDistToFire 0.664

ComputationalandInferentialThinking

286Features

Page 287: Computational and inferential thinking

==Computingbestfeaturetoaddto['Elevation','HorizDistToWater']

Attribute Accuracy

Hillshade3pm 0.792

HillshadeNoon 0.792

Slope 0.784

Aspect 0.784

Area4 0.782

Area3 0.782

Area2 0.782

Area1 0.782

Hillshade9am 0.78

VertDistToWater 0.776

HorizDistToRoad 0.708

HorizDistToFire 0.64

best_features

NumAttrs Attributes Accuracy

1 Elevation 0.762

2 Elevation,HorizDistToWater 0.782

3 Elevation,HorizDistToWater,Hillshade3pm 0.792

Aswecansee,Elevationlookslikefarandawaythemostdiscriminativefeature.Thissuggeststhatthischaracteristicmightplayalargeroleinthebiologyofwhichtreegrowsbest,andthusmighttellussomethingaboutthescience.

Whataboutthehorizontaldistancetowaterandtheamountofshadeat3pm?Itlooksliketheymightalsobepredictive,andthusmightplayaroleinthebiologyaswell--assumingtheapparentimprovementinaccuracyisreal,andwe'renotjustfoolingourselves.

Hold-outsets:Training,Validation,Testing¶

ComputationalandInferentialThinking

287Features

Page 288: Computational and inferential thinking

Supposewebuiltapredictorusingjustthebesttwofeatures,ElevationandHorizDistToWater.Howaccuratewouldweexpectittobe,ontheentirepopulationoflocations?78.2%accurate?more?less?Why?

Well,atthispoint,it'shardtotell.It'sthesameissuewementionedearlieraboutfoolingourselves.We'vetriedmultipledifferentapproaches,andtakenthebest;ifwethenevaluateitonthesamedatasetweusedtoselectwhichisbest,wewillgetabiasednumbers--itmightnotbeanaccurateestimateofthetrueaccuracy.

Why?Ourdatasetisnoisy.We'velookedforcorrelations,andkepttheassociationthathadthehighestcorrelationinourtrainingset.Butwasthatarealrelationship,orwasitjustnoise?Ifyoupickasingleattributeandmeasureitsaccuracyonasample,we'dexpectthistobeareasonableapproximationtoitsaccuracyontheentirepopulation,butwithsomerandomerror.Ifwelookat100possiblecombinationsandchoosethebest,it'spossiblewefoundonewhoseaccuracyontheentirepopulationisindeedlarge--orit'spossiblewewerejustselectingtheonewhoseerrortermhappenedtobethelargestofthe100.

Forthesereasons,wecan'texpecttheaccuracynumberswe'vecomputedabovetonecessarilybeagood,unbiasedmeasureoftheaccuracyontheentirepopulation.

Thewaytogetanunbiasedestimateofaccuracyisthesameaslastlecture:getsomemoredata;orsetsomeasideinthebeginningsowehavemorewhenweneedit.Inthiscase,Isetasidetwoextrachunksofdata,avalidationdatasetandatestdataset.Iusedthevalidationsettoselectafewbestfeatures.Nowwe'regoingtomeasuretheperformanceofthisonthetestset,justtoseewhathappens.

evaluate_features(training,validation,['Elevation'],15)

0.762

evaluate_features(training,test,['Elevation'],15)

0.712

evaluate_features(training,validation,['Elevation','HorizDistToWater'

0.782

ComputationalandInferentialThinking

288Features

Page 289: Computational and inferential thinking

evaluate_features(training,test,['Elevation','HorizDistToWater'],

0.742

Whydoyouthinkweseethisdifference?

Togetabetterfeelingforthis,wecanlookatallofourdifferentfeatures,andcompareitsaccuracyonthetrainingsetvsitsaccuracyonthevalidationset,toseehowmuchrandomerrorthereisintheseaccuracyfigures.Iftherewasnoerror,thenbothaccuracynumberswouldalwaysbethesame,butaswe'llsee,inpracticethereissomeerror,becausecomputingtheaccuracyonasampleisonlyanapproximationtotheaccuracyontheentirepopulation.

accuracies=Table(['ValidationAccuracy','Accuracyonanothersample'

another_sample=all_trees.take(range(2000,2500))

forfeatureintraining.drop('Class').labels:

x=evaluate_features(training,validation,[feature],15)

y=evaluate_features(training,another_sample,[feature],15)

accuracies.append((x,y))

accuracies.sort('ValidationAccuracy')

ValidationAccuracy Accuracyonanothersample

0.502 0.346

0.51 0.384

0.51 0.35

0.512 0.358

0.514 0.342

0.514 0.342

0.514 0.342

0.514 0.342

0.516 0.366

0.516 0.352

...(4rowsomitted)

ComputationalandInferentialThinking

289Features

Page 290: Computational and inferential thinking

accuracies.scatter('ValidationAccuracy')

/opt/conda/lib/python3.4/site-packages/matplotlib/collections.py:590:FutureWarning:elementwisecomparisonfailed;returningscalarinstead,butinthefuturewillperformelementwisecomparison

ifself._edgecolors==str('face'):

ThoughtQuestions¶SupposethatthetoptwoattributeshadbeenElevationandHorizDistToRoad.Interpretthisforme.Whatmightthismeanforthebiologyoftrees?Onepossibleexplanationisthatthedistancetothenearestroadaffectswhatkindoftreegrows;canyougiveanyotherpossibleexplanations?

OnceweknowthetoptwoattributesareElevationandHorizDistToWater,supposewenextwantedtoknowhowtheyaffectwhatkindoftreegrows:e.g.,doeshighelevationtendtofavorspruce,ordoesitfavorlodgepolepine?Howwouldyougoaboutansweringthesekindsofquestions?

ThescientistsalsogatheredsomemoredatathatIleftout,forsimplicity:foreachlocation,theyalsogatheredwhatkindofsoilithas,outof40differenttypes.Theoriginaldatasethadacolumnforsoiltype,withnumbersfrom1-40indicatingwhichofthe40typesofsoilwaspresent.SupposeIwantedtoincludethisamongtheothercharacteristics.Whatwouldgowrong,andhowcouldIfixitup?

ComputationalandInferentialThinking

290Features

Page 291: Computational and inferential thinking

Forthisexamplewepicked arbitrarily.Supposewewantedtopickthebestvalueof--theonethatgivesthebestaccuracy.Howcouldwegoaboutdoingthat?Whatarethe

pitfalls,andhowcouldtheybeaddressed?

SupposeIwantedtousefeatureselectiontohelpmeadjustmyonlinedatingprofilepicturetogetthemostresponses.TherearesomecharacteristicsIcan'tchange(suchashowhandsomeIam),andsomeIcan(suchaswhetherIsmileornot).HowwouldIadjustthefeatureselectionalgorithmabovetoaccountforthis?

ComputationalandInferentialThinking

291Features

Page 292: Computational and inferential thinking

Chapter5

Inference

ComputationalandInferentialThinking

292Inference

Page 293: Computational and inferential thinking

ConfidenceIntervalsInteractWheneverweanalyzearandomsampleofapopulation,ourgoalistodrawrobustconclusionsaboutthewholepopulation,despitethefactthatwecanmeasureonlypartofit.Wemustguardagainstbeingmisledbytheinevitablerandomvariationsinasample.Itispossiblethatasamplehasverydifferentcharacteristicsthanthepopulationitself,butthosesamplesareunlikely.Itismuchmoretypicalthatarandomsample,especiallyalargesimplerandomsample,isrepresentativeofthepopulation.Moreimportantly,thechancethatasampleresemblesthepopulationissomethingthatweknowinadvance,basedonhowweselectedourrandomsample.Statisticalinferenceisthepracticeofusingthisinformationtomakeprecisestatementsaboutapopulationbasedonarandomsample.

StatisticsandParameters¶TheBayAreaBikeShareservicepublishedadatasetdescribingeverybicyclerentalfromSeptember2014toAugust2015intheirsystem.Thissetofalltripsisapopulation,andfortunatelywehavethedataforeachandeverytripratherthanjustasample.We'llfocusonlyonthefreetrips,whicharetripsthatlastlessthan1800seconds(halfanhour).

free_trips=Table.read_table("trip.csv").where('Duration',are.below

free_trips

ComputationalandInferentialThinking

293ConfidenceIntervals

Page 294: Computational and inferential thinking

StartStation EndStation Duration

HarryBridgesPlaza(FerryBuilding)

SanFranciscoCaltrain(Townsendat4th) 765

SanAntonioShoppingCenter MountainViewCityHall 1036

PostatKearny 2ndatSouthPark 307

SanJoseCityHall SanSalvadorat1st 409

EmbarcaderoatFolsom EmbarcaderoatSansome 789

YerbaBuenaCenteroftheArts(3rd@Howard)

SanFranciscoCaltrain(Townsendat4th) 293

EmbarcaderoatFolsom EmbarcaderoatSansome 896

EmbarcaderoatSansome SteuartatMarket 255

BealeatMarket TemporaryTransbayTerminal(HowardatBeale) 126

PostatKearny SouthVanNessatMarket 932

...(338333rowsomitted)

Aquantitymeasuredforawholepopulationiscalledaparameter.Forexample,theaveragedurationforanyfreetripbetweenSeptember2014toAugust2015canbecomputedexactlyfromthesedata:550.00.

np.average(free_trips.column('Duration'))

550.00143345658103

Althoughwehaveinformationforthefullpopulation,wewillinvestigatewhatwecanlearnfromasimplerandomsample.Below,wesample100tripswithoutreplacementfromfree_tripsnotjustonce,but40differenttimes.All4000tripsarestoredinatablecalledsamples.

n=100

samples=Table(['Sample#','Duration'])

foriinnp.arange(40):

fortripinfree_trips.sample(n).rows:

samples.append([i,trip.item('Duration')])

samples

ComputationalandInferentialThinking

294ConfidenceIntervals

Page 295: Computational and inferential thinking

Sample# Duration

0 557

0 491

0 877

0 279

0 339

0 542

0 715

0 255

0 1543

0 194

...(3990rowsomitted)

Aquantitymeasuredforasampleiscalledastatistic.Forexample,theaveragedurationofafreetripinasampleisastatistic,andwefindthatthereissomevariationacrosssamples.

foriinnp.arange(3):

sample_average=np.average(samples.where('Sample#',i).column

print("Averagedurationinsample",i,"is",sample_average)

Averagedurationinsample0is547.29

Averagedurationinsample1is500.6

Averagedurationinsample2is660.16

Often,thegoalofanalysisistoestimateparametersfromstatistics.Properstatisticalinferenceinvolvesmorethanjustpickinganestimate:weshouldalsosaysomethingabouthowconfidentweshouldbeinthatestimate.Asingleguessisrarelysufficienttomakearobustconclusionaboutapopulation.Wealsoneedtoquantifyinsomewayhowprecisewebelievethatguesstobe.

Alreadywecanseefromtheexampleabovethattheaveragedurationofasampleisaround550,theparameter.Ourgoalistoquantifythisnotionofaround,andonewaytodosoistostatearangeofvalues.

ErrorinEstimates¶

ComputationalandInferentialThinking

295ConfidenceIntervals

Page 296: Computational and inferential thinking

Wecanlearnagreatdealaboutthebehaviorofastatisticbycomputingitformanysamples.Inthescatterdiagrambelow,theorderinwhichthestatisticwascomputedappearsonthehorizontalaxis,andthevalueoftheaveragedurationforasampleofntripsappearsontheverticalaxis.

samples.group(0,np.average).scatter(0,1,s=50)

Thesamplenumberdoesn'ttellusanythingbeyondwhenthesamplewasselected,andwecanseefromthisunstructuredcloudofpointsthatonesampledoesnotaffectthenextinanyconsistentway.Thesampleaveragesareindeedallaround550.Someareabove,andsomearebelow.

samples.group(0,np.average).hist(1,bins=np.arange(480,641,10))

ComputationalandInferentialThinking

296ConfidenceIntervals

Page 297: Computational and inferential thinking

Thereare40averagesfor40samples.Ifwefindanintervalthatcontainsthemiddle38outof40,itwillcontain95%ofthesamples.Onesuchintervalisboundedbythesecondsmallestandthesecondlargestsampleaveragesamongthe40samples.

averages=np.sort(samples.group(0,np.average).column(1))

lower=averages.item(1)

upper=averages.item(38)

print('Empirically,95%ofsampleaveragesfellwithintheinterval'

Empirically,95%ofsampleaveragesfellwithintheinterval502.38to620.04

Thisstatementgivesusourfirstnotionofamarginoferror.Ifthisstatementweretoholdupnotjustforour40samples,butforallsamples,thenareasonablestrategyforestimatingthetruepopulationaveragewouldbetodrawa100-tripsample,computeitssampleaverage,andguessthatthetrueaveragewaswithinmax_deviationofthisquantity.Wewouldberightatleast95%ofthetime.

Theproblemwiththisapproachisthatfindingthemax_deviationrequireddrawingmanydifferentsamples,andwewouldliketobeabletosaysomethingsimilarafterhavingcollectedonlyone100-tripsample.

VariabilityWithinSamples¶

Bycontrast,Theindividualdurationsinthesamplesthemselvesarenotcloselyclusteredaround550.Instead,theyspanthewholerangefrom0to1800.Thedurationsforthefirstthree100-itemsamplesarevisualizedbelow.Allhavesomewhatsimilarshapes;theywere

ComputationalandInferentialThinking

297ConfidenceIntervals

Page 298: Computational and inferential thinking

drawnfromthesamepopulation.

foriinnp.arange(3):

samples.where('Sample#',i).hist(1,bins=np.arange(0,1801,200

ComputationalandInferentialThinking

298ConfidenceIntervals

Page 299: Computational and inferential thinking

Wehavenowobservedtwodifferentwaysinwhichthevariationofarandomsamplecanbeobserved:thestatisticscomputedfromthosesamplesvaryinmagnitude,andthesampledquantitiesthemselvesvaryindistribution.Thatis,thereisvariabilityacrosssamples,whichweseebycomparingsampleaverages,andthereisvariabilitywithineachsample,whichweseeinthedurationhistogramsabove.

Percentiles¶The80thpercentileofacollectionofvaluesisthesmallestvalueinthecollectionthatisatleastaslargeas80%ofthevaluesinthecollection.Likewise,the40thpercentileisatleastaslargeas40%ofthevalues.

Forexample,let'sconsiderthesizesofthefivelargestcontinents:Africa,Antarctica,Asia,NorthAmerica,andSouthAmerica,roundedtothenearestmillionsquaremiles.

sizes=[12,17,6,9,7]

The20thpercentileisthesmallestvaluethatisatleastaslargeas20%oronefifthoftheelements.

percentile(20,sizes)

6

The60thpercentileisatleastaslargeas60%orthreefifthsoftheelements.

ComputationalandInferentialThinking

299ConfidenceIntervals

Page 300: Computational and inferential thinking

percentile(60,sizes)

9

Tofindtheelementcorrespondingtoapercentile ,sorttheelementsandtakethe thelement,where where isthesizeofthecollection.Ingeneral,any

percentilefrom0to100canbecomputedforacollectionofvalues,anditisalwaysanelementofthecollection.If isnotaninteger,roundituptofindthepercentileelement.Forexample,the90thpercentileofafiveelementsetisthefifthelement,because is

4.5,whichroundsupto5.

percentile(90,sizes)

17

Whenthepercentilefunctioniscalledwithasingleargument,itreturnsafunctionthatcomputespercentilesofcollections.

tenth=percentile(10)

tenth(sizes)

6

tenth(np.arange(50))

4

tenth(np.arange(5000))

499

ComputationalandInferentialThinking

300ConfidenceIntervals

Page 301: Computational and inferential thinking

ConfidenceIntervals¶Inferencebeginswithapopulation:acollectionthatyoucandescribebutoftencannotmeasuredirectly.Onethingwecanoftendowithapopuluationistodrawasimplerandomsample.Bymeasuringastatisticofthesample,wemayhopetoestimateaparameterofthepopulation.Inthisprocess,thepopulationisnotrandom,norareitsparameters.Thesampleisrandom,andsoisitsstatistic.Thewholeprocessofestimationthereforegivesarandomresult.Itstartsfromapopulation(fixed),drawsasample(random),computesastatistic(random),andthenestimatestheparameterfromthestatistic(random).Bycallingtheprocessrandom,wemeanthatthefinalestimatecouldhavebeendifferentbecausethesamplecouldhavebeendifferent.

Aconfidenceintervalisonenotionofamarginoferroraroundanestimate.Themarginoferroracknowledgesthattheestimationprocesscouldhavebeenmisleadingbecauseofthevariationinthesample.Itgivesarangethatmightcontaintheoriginalparameter,alongwithapercentageofconfidence.Forexample,a95%confidenceintervalisoneinwhichwewouldexpecttheintervaltocontainthetrueparameterfor95%ofrandomsamples.Foranyparticularsample,anyparticularestimate,andanyparticularinterval,theparameteriseithercontainedornot;weneverknowforsure.Thepurposeoftheconfidenceintervalistoquantifyhowoftenourestimationprocessgivesusarangecontainingthetruth.

Therearemanydifferentwaystocomputeaconfidenceinterval.Thechallengeincomputingaconfidenceintervalfromasinglesampleisthatwedon'tknowthepopulation,andsowedon'tknowhowitssampleswillvary.Onewaytomakeprogressdespitethislimitedinformationistoassumethatthevariabilitywithinthesampleissimilartothevariabilitywithinthepopulation.

ResamplingfromaSample¶

Eventhoughwehappentohavethewholepopulationoffree_trips,let'simagineforamomentthatwehaveonlyasingle100-tripsamplefromthispopulation.Thatsamplehas100durations.Ourgoalnowistolearnwhatwecanaboutthepopulationusingjustthissample.

sample=free_trips.sample(n)

sample

ComputationalandInferentialThinking

301ConfidenceIntervals

Page 302: Computational and inferential thinking

StartStation EndStation Duration

GrantAvenueatColumbusAvenue

SanFranciscoCaltrain2(330Townsend) 877

EmbarcaderoatFolsom SanFranciscoCaltrain(Townsendat4th) 597

BealeatMarket BealeatMarket 67

2ndatFolsom WashingtonatKearny 714

ClayatBattery EmbarcaderoatSansome 510

SouthVanNessatMarket PowellStreetBART 494

SanJoseDiridonCaltrainStation MLKLibrary 530

MechanicsPlaza(MarketatBattery) ClayatBattery 286

SouthVanNessatMarket CivicCenterBART(7thatMarket) 172

BroadwayStatBatterySt 5thatHoward 551

...(90rowsomitted)

average=np.average(sample.column('Duration'))

average

518.55999999999995

Atechniquecalledbootstrapresamplingusesthevariabilityofthesampleasaproxyforthevariabilityofthepopulation.Insteadofdrawingmanysamplesfromthepopulation,wewilldrawmanysamplesfromthesample.Sincetheoriginalsampleandthere-sampledsamplehavethesamesize,wemustdrawwithreplacementinordertoseeanyvariabilityatall.

resamples=Table(['Resample#','Duration'])

foriinnp.arange(40):

resample=sample.sample(n,with_replacement=True)

fortripinresample.rows:

resamples.append([i,trip.item('Duration')])

resamples

ComputationalandInferentialThinking

302ConfidenceIntervals

Page 303: Computational and inferential thinking

Resample# Duration

0 329

0 252

0 252

0 685

0 434

0 396

0 434

0 585

0 394

0 202

...(3990rowsomitted)

Usingtheresamples,wecaninvestigatethevariabilityofthecorrespondingsampleaverages.

resamples.group(0,np.average).scatter(0,1,s=50)

Thesesampleaveragesarenotnecessarilyclusteredaroundthetrueaverage550.Instead,theywillbeclusteredaroundtheaverageofthesamplefromwhichtheywerere-sampled.

ComputationalandInferentialThinking

303ConfidenceIntervals

Page 304: Computational and inferential thinking

resamples.group(0,np.average).hist(1,bins=np.arange(480,641,10))

Althoughthecenterofthisdistributionisnotthesameasthetruepopulationaverage,itsvariabilityissimilar.

BootstrapConfidenceInterval¶

Atechniquecalledbootstrapresamplingestimatesaconfidenceintervalusingresampling.Tofindaconfidenceinterval,wemustestimatehowmuchthesamplestatisticdeviatesfromthepopulationparameter.Theboostrapassumesthataresampledstatisticwilldeviatefromthesamplestatisticinapproximatelythesamewaythatthesamplestatisticdeviatesfromtheparameter.

Tocarryoutthetechnique,wefirstcomputethedeviationsfromthesamplestatistic(inthiscase,thesampleaverage)formanyresampledsamples.Thenumberisarbitrary,anditdoesn'trequireanycollectingofnewdata;1000ormoreistypical.

deviations=Table(['Resample#','Deviation'])

foriinnp.arange(1000):

resample=sample.sample(n,with_replacement=True)

dev=average-np.average(resample.column(2))

deviations.append([i,dev])

deviations

ComputationalandInferentialThinking

304ConfidenceIntervals

Page 305: Computational and inferential thinking

Resample# Deviation

0 30.95

1 41.16

2 3.37

3 2.54

4 -22.87

5 27.63

6 49.93

7 11.91

8 26.32

9 -5.97

...(990rowsomitted)

Adeviationofzeroindicatesthattheresampledstatisticisthesameasthesampledstatistic.That'sevidencethatthesamplestatisticisperhapsthesameasthepopulationparameter.Apositivedeviationisevidencethatthesamplestatisticmaybegreaterthantheparameter.Aswecansee,foraverages,thedistributionofdeviationsisroughlysymmetric.

deviations.hist(1,bins=np.arange(-100,101,10))

A95%confidenceintervaliscomputedbasedonthemiddle95%ofthisdistributionofdeviations.Themaxandmindevationsarecomputedbythe97.5thpercentileand2.5thpercentilerespectively,becausethespanbetweenthesecontains95%ofthedistribution.

ComputationalandInferentialThinking

305ConfidenceIntervals

Page 306: Computational and inferential thinking

lower=average-percentile(97.5,deviations.column(1))

lower

461.93000000000001

upper=average-percentile(2.5,deviations.column(1))

upper

576.29999999999995

Now,wehavenotonlyanestimate,butanintervalarounditthatexpressesouruncertaintyaboutthevalueoftheparameter.

print('A95%confidenceintervalaround',average,'is',lower,'to'

A95%confidenceintervalaround518.56is461.93to576.3

Andbehold,theintervalhappenstocontainthetrueparameter550.Normally,aftergeneratingaconfidenceinterval,thereisnowaytocheckthatitiscorrect,becausetodosorequiresaccesstotheentirepopulation.Inthiscase,westartedwiththewholepopulationinasingletable,andsoweareabletocheck.

Thefollowingfunctionencapsulatestheentireprocessofproducingaconfidenceintervalforthepopulationaverageusingasinglesample.Eachtimeitisrun,evenforthesamesample,theintervalwillbeslightlydifferentbecauseoftherandomvariationofresampling.However,theupperandlowerboundsgeneratedby2000samplesforthisproblemaretypicallyquitesimilarformultiplecalls.

ComputationalandInferentialThinking

306ConfidenceIntervals

Page 307: Computational and inferential thinking

defaverage_ci(sample,label):

deviations=Table(['Resample#','Deviation'])

n=sample.num_rows

average=np.average(sample.column(label))

foriinnp.arange(1000):

resample=sample.sample(n,with_replacement=True)

dev=np.average(resample.column(label))-average

deviations.append([i,dev])

return(average-percentile(97.5,deviations.column(1)),

average-percentile(2.5,deviations.column(1)))

average_ci(sample,'Duration')

(457.43999999999994,572.63999999999987)

average_ci(sample,'Duration')

(458.7399999999999,575.50999999999988)

However,ifwedrawanentirelynewsample,thentheconfidenceintervalwillchange.

average_ci(free_trips.sample(n),'Duration')

(530.18000000000006,661.21000000000004)

Theintervalisaboutthesamewidth,andthat'stypicalofmultiplesamplesofthesamesizedrawnfromthesamepopulation.However,ifamuchlargersampleisdrawn,thentheconfidenceintervalthatresultsistypicallygoingtobemuchsmaller.We'renotlosingconfidenceinthiscase;we'rejustusingmoreevidencetoperforminference,andsowehavelessuncertaintyaboutthefullpopulation.

average_ci(free_trips.sample(16*n),'Duration')

(527.18937499999993,555.11437499999988)

ComputationalandInferentialThinking

307ConfidenceIntervals

Page 308: Computational and inferential thinking

EvaluatingConfidenceIntervals¶

Whengivenonlyasample,itisnotpossibletocheckwhethertheparameterwasestimatedcorrectly.However,inthecaseoftheBayAreaBikeShare,wehavenotonlythesample,butmanysamplesandthewholepopulation.Therefore,wecancheckhowoftentheconfidenceintervalswecomputeinthiswaydoinfactcontainthetrueparameter.

Beforemeasuringtheaccuracyofourtechnique,wecanvisualizetheresultofresamplingeachsample.Thefollowingscatterdiagramhas6verticallinesfor6samples,andeachlineconsistsof100pointsthatrepresentthestatisticsfromresamplesforthatsample.

samples=Table(['Sample#','Resampleaverage','SampleAverage'])

foriinnp.arange(6):

sample=free_trips.sample(n)

average=np.average(sample.column('Duration'))

forjinnp.arange(100):

resample=sample.sample(n,with_replacement=True)

resample_average=np.average(resample.column('Duration'))

samples.append([i,resample_average,average])

samples.scatter(0,s=80)

Wecanseefromtheplotabovethatforalmosteverysample,therearesomeresampledaveragesabove550andsomebelow.Whenweusetheseresampledaveragestocomputeconfidenceintervals,manyofthoseintervalscontain550asaresult.

ComputationalandInferentialThinking

308ConfidenceIntervals

Page 309: Computational and inferential thinking

Below,wedraw100moresamplesandshowtheupperandlowerboundsoftheconfidenceintervalscomputedfromeachone.

intervals=Table(['Sample#','Lower','Estimate','Upper'])

foriinnp.arange(100):

sample=free_trips.sample(n)

average=np.average(sample.column('Duration'))

lower,upper=average_ci(sample,'Duration')

intervals.append([i,lower,average,upper])

Wecancomputepreciselywhichintervalscontainthetruth,andweshouldfindthatabout95ofthemdo.

correct=np.logical_and(intervals.column('Lower')<=550,intervals

np.count_nonzero(correct)

97

Eachconfidenceintervalisgeneratedfromonlyasinglesampleof100trips.Asaresult,thewidthoftheintervalvariesquiteabitaswell.Comparedtotheintervalwidthwecomputedoriginally(whichrequiredmultiplesamples),weseethatsomeoftheseintervalsaremuchmorenarrowandsomearemuchwider.

Table().with_column('Width',intervals.column('Upper')-intervals.column

ComputationalandInferentialThinking

309ConfidenceIntervals

Page 310: Computational and inferential thinking

Givenasinglesample,itisimpossibletoknowwhetheryouhavecorrectlyestimatedtheparameterornot.However,youcaninferaprecisenotionofconfidencebyfollowingaprocessbasedonrandomsampling.

ComputationalandInferentialThinking

310ConfidenceIntervals

Page 311: Computational and inferential thinking

DistanceBetweenDistributionsInteractBynowwehaveseenmanyexamplesofsituationsinwhichonedistributionappearstobequiteclosetoanother,suchasapopulationanditssample.Thegoalofthissectionistoquantifythedifferencebetweentwodistributions.Thiswillallowouranalysestobebasedonmorethantheassessementsthatweareabletomakebyeye.

Forthis,weneedameasureofthedistancebetweentwodistributions.Wewilldevelopsuchameasureinthecontextofastudyconductedin2010bytheAmericalCivilLibertiesUnion(ACLU)ofNorthernCalifornia.

ThefocusofthestudywastheracialcompositionofjurypanelsinAlamedaCounty.Ajurypanelisagroupofpeoplechosentobeprospectivejurors;thefinaltrialjuryisselectedfromamongthem.Jurypanelscanconsistofafewdozenpeopleorseveralthousand,dependingonthetrial.Bylaw,ajurypanelissupposedtoberepresentativeofthecommunityinwhichthetrialistakingplace.Section197ofCalifornia'sCodeofCivilProceduresays,"Allpersonsselectedforjuryserviceshallbeselectedatrandom,fromasourceorsourcesinclusiveofarepresentativecrosssectionofthepopulationoftheareaservedbythecourt."

Thefinaljuryisselectedfromthepanelbydeliberateinclusionorexclusion:thelawallowspotentialjurorstobeexcusedformedicalreasons;lawyersonbothsidesmaystrikeacertainnumberofpotentialjurorsfromthelistinwhatarecalled"peremptorychallenges";thetrialjudgemightmakeaselectionbasedonquestionnairesfilledoutbythepanel;andsoon.Buttheinitialpanelissupposedtoresemblearandomsampleofthepopulationofeligiblejurors.

TheACLUofNorthernCaliforniacompileddataontheracialcompositionofthejurypanelsin11felonytrialsinAlamedaCountyintheyears2009and2010.Inthosepanels,thetotalnumberofpoeplewhoreportedforjuryservicewas1453.TheACLUgathereddemographicdataonalloftheseprosepctivejurors,andcomparedthosedatawiththecompositionofalleligiblejurorsinthecounty.

Thedataaretabulatedbelowinatablecalledjury.Foreachrace,thefirstvalueisthepercentageofalleligiblejurorcandidatesofthatrace.Thesecondvalueisthepercentageofpeopleofthatraceamongthosewhoappearedfortheprocessofselectionintothejury.

ComputationalandInferentialThinking

311DistanceBetweenDistributions

Page 312: Computational and inferential thinking

jury=Table(["Race","Eligible","Panel"]).with_rows([

["Asian",0.15,0.26],

["Black",0.18,0.08],

["Latino",0.12,0.08],

["White",0.54,0.54],

["Other",0.01,0.04],

])

jury.set_format([1,2],PercentFormatter(0))

Race Eligible Panel

Asian 15% 26%

Black 18% 8%

Latino 12% 8%

White 54% 54%

Other 1% 4%

Someracesareoverrepresentedandsomeareunderrepresentedonthejurypanelsinthestudy.Abargraphishelpfulforvisualizingthedifferences.

jury.barh('Race')

TotalVariationDistance¶

ComputationalandInferentialThinking

312DistanceBetweenDistributions

Page 313: Computational and inferential thinking

Tomeasurethedifferencebetweenthetwodistributions,wewillcomputeaquantitycalledthetotalvariationdistancebetweenthem.Tocomputethetotalvariationdistance,takethedifferencebetweenthetwoproportionsineachcategory,adduptheabsolutevaluesofallthedifferences,andthendividethesumby2.

Herearethedifferencesandtheabsolutedifferences:

jury.append_column("Difference",jury.column("Panel")-jury.column

jury.append_column("Abs.Difference",np.abs(jury.column("Difference"

jury.set_format([3,4],PercentFormatter(0))

Race Eligible Panel Difference Abs.Difference

Asian 15% 26% 11% 11%

Black 18% 8% -10% 10%

Latino 12% 8% -4% 4%

White 54% 54% 0% 0%

Other 1% 4% 3% 3%

Andhereisthesumoftheabsolutedifferences,dividedby2:

sum(jury.column(3))/2

1.3877787807814457e-17

Thetotalvariationdistancebetweenthedistributionofeligiblejurorsandthedistributionofthepanelsis0.14.Beforeweexaminethenumercialvalue,letusexaminethereasonsbehindthestepsinthecalculation.

Itisquitenaturaltocomputethedifferencebetweentheproportionsineachcategory.Nowtakealookatthecolumndiffandnoticethatthesumofitsentriesis0:thepositiveentriesaddupto0.14,exactlycancelingthetotalofthenegativeentrieswhichis-0.14.TheproportionsineachofthetwocolumnsPanelandEligibleaddupto1,andsothegive-and-takebetweentheirentriesmustaddupto0.

Toavoidthecancellation,wedropthenegativesignsandthenaddalltheentries.Butthisgivesustwotimesthetotalofthepositiveentries(equivalently,twotimesthetotalofthenegativeentries,withthesignremoved).Sowedividethesumby2.

ComputationalandInferentialThinking

313DistanceBetweenDistributions

Page 314: Computational and inferential thinking

Wewouldhaveobtainedthesameresultbyjustaddingthepositivedifferences.Butourmethodofincludingalltheabsolutedifferenceseliminatestheneedtokeeptrackofwhichdifferencesarepositiveandwhicharenot.

Thefollowingfunctioncomputesthetotalvariationdistancebetweentwocolumnsofatable.

deftotal_variation_distance(column,other):

returnsum(np.abs(column-other))/2

deftable_tvd(table,label,other):

returntotal_variation_distance(table.column(label),table.column

table_tvd(jury,'Eligible','Panel')

0.14000000000000001

Arethepanelsrepresentativeofthepopulation?¶Wewillnowturntothenumericalvalueofthetotalvariationdistancebetweentheeligiblejurorsandthepanel.Howcanweinterpretthedistanceof0.14?Toanswerthis,letusrecallthatthepanelsaresupposedtobeselectedatrandom.Itwillthereforebeinformativetocomparethevalueof0.14withthetotalvariationdistancebetweentheeligiblejurorsandarandomlyselectedpanel.

Todothis,wewillemployourskillsatsimulation.Therewere1453prosepectivejurorsinthepanelsinthestudy.Soletustakearandomsampleofsize1453fromthepopulationofeligiblejurors.

[Technicalnote.Randomsamplesofprospectivejurorswouldbeselectedwithoutreplacement.However,whenthesizeofasampleissmallrelativetothesizeofthepopulation,samplingwithoutreplacementresemblessamplingwithreplacement;theproportionsinthepopulationdon'tchangemuchbetweendraws.ThepopulationofeligiblejurorsinAlamedaCountyisoveramillion,andcomparedtothat,asamplesizeofabout1500isquitesmall.Wewillthereforesamplewithreplacement.]

Thenp.random.multinomialfunctiondrawsarandomsampleuniformlywithreplacementfromapopulationwhosedistributioniscategorical.Specifically,np.random.multinomialtakesasitsinputasamplesizeandanarrayconsistingoftheprobabilitiesofchoosing

ComputationalandInferentialThinking

314DistanceBetweenDistributions

Page 315: Computational and inferential thinking

differentcategories.Itreturnsthecountofthenumberoftimeseachcategorywaschosen.

sample_size=1453

random_panel=np.random.multinomial(sample_size,jury["Eligible"])

random_panel

array([208,265,166,805,9])

sum(random_panel)

1453

Wecancomparethisdistributionwiththedistributionofeligiblejurors,byconvertingthecountstoproportions.Todothis,wewilldividethecountsbythesamplesize.Itisclearfromtheresultsthatthetwodistributionsarequitesimilar.

sampled=jury.select(['Race','Eligible','Panel']).with_column(

'RandomPanel',random_panel/sample_size)

sampled.set_format(3,PercentFormatter(0))

Race Eligible Panel RandomPanel

Asian 15% 26% 14%

Black 18% 8% 18%

Latino 12% 8% 11%

White 54% 54% 55%

Other 1% 4% 1%

Asalways,ithelpstovisualize.Thepopulationandsamplearequitesimilar.

sampled.barh('Race',[1,3])

ComputationalandInferentialThinking

315DistanceBetweenDistributions

Page 316: Computational and inferential thinking

Whenwecomparealsotothepanel,thedifferenceisstark.

sampled.barh('Race',[1,3,2])

Thetotalvariationdistancebetweenthepopulationdistributionandthesampledistributionisquitesmall.

table_tvd(sampled,'Eligible','RandomPanel')

0.016407432897453545

Comparingthistothedistanceof0.14forthepanelquantifiesourobservationthattherandomsampleisclosetothedistributionofeligiblejurors,andthepanelisrelativelyfarfromthedistributionofeligiblejurors.

However,thedistancebetweentherandomsampleandtheeligiblejurorsdependsonthesample;samplingagainmightgiveadifferentresult.

ComputationalandInferentialThinking

316DistanceBetweenDistributions

Page 317: Computational and inferential thinking

Howdorandomsamplesdifferfromthepopulation?¶

Thetotalvariationdistancebetweenthedistributionoftherandomsampleandthedistributionoftheeligiblejurorsisthestatisticthatweareusingtomeasurethedistancebetweenthetwodistributions.Byrepeatingtheprocessofsampling,wecanmeasurehowmuchthestatisticvariesacrossdifferentrandomsamples.Thecodebelowcomputestheempiricaldistributionofthestatisticbasedonalargenumberofreplicationsofthesamplingprocess.

#ComputeempiricaldistributionofTVDs

sample_size=1453

repetitions=1000

eligible=jury.column('Eligible')

tvds=Table(["TVDfromarandomsample"])

foriinnp.arange(repetitions):

sample=np.random.multinomial(sample_size,eligible)/sample_size

tvd=sum(abs(sample-eligible))/2

tvds.append([tvd])

tvds

TVDfromarandomsample

0.0207571

0.0286098

0.0201721

0.0308672

0.0185134

0.0102546

0.0119133

0.0346731

0.0271576

0.0216862

...(990rowsomitted)

ComputationalandInferentialThinking

317DistanceBetweenDistributions

Page 318: Computational and inferential thinking

Eachrowofthecolumnabovecontainsthetotalvariationdistancebetweenarandomsampleandthepopulationofeligiblejurors.

Thehistogramofthiscolumnshowsthatdrawing1453jurorsatrandomfromthepoolofeligiblecandidatesresultsinadistributionthatrarelydeviatesfromtheeligiblejurors'racedistributionbymorethan0.05.

tvds.hist(bins=np.arange(0,0.2,0.005))

Howdothepanelscomparetorandomsamples?¶

Thepanelsinthestudy,however,werenotquitesosimilartotheeligiblepopulation.Thetotalvariationdistancebetweenthepanelsandthepopulationwas0.14,whichisfaroutinthetailofthehistogramabove.Itdoesnotlooklikeatypicaldistancebetweenarandomsampleandtheeligiblepopulation.

OuranalysissupportstheACLU'sconclusionthatthepanelswerenotrepresentativeofthepopulation.TheACLUreportdiscussesseveralreasonsforthediscrepancies.Forexample,someminoritygroupswereunderrepresentedontherecordsofvoterregistrationandoftheDepartmentofMotorVehicles,thetwomainsourcesfromwhichjurorsareselected;atthetimeofthestudy,thecountydidnothaveaneffectiveprocessforfollowinguponprospectivejurorswhohadbeencalledbuthadfailedtoappear;andsoon.Whateverthereasons,itseemsclearthatthecompositionofthejurypanelswasdifferentfromwhatwewouldhaveexpectedinarandomsample.

AClassicalInterlude:theChi-SquaredStatistic¶

ComputationalandInferentialThinking

318DistanceBetweenDistributions

Page 319: Computational and inferential thinking

"Dothedatalooklikearandomsample?"isaquestionthathasbeenaskedinmanycontextsformanyyears.Inclassicaldataanalysis,thestatisticmostcommonlyusedinansweringthisquestioniscalledthe ("chisquared")statistic,andiscalculatedasfollows:

Step1.Foreachcategory,definethe"expectedcount"asfollows:

Thisisthecountthatyouwouldexpectinthatcategory,inarandomlychosensample.

Step2.Foreachcategory,compute

Step3.AddupallthenumberscomputedinStep2.

Alittlealgebrashowsthatthisisequivalentto:

AlternativeMethod,Step1.Foreachcategory,compute

AlternativeMethod,Step2.Addupallthenumbersyoucomputedinthestepabove,andmultiplythesumbythesamplesize.

Itmakessensethatthestatisticshouldbebasedonthedifferencebetweenproportionsineachcategory,justasthetotalvariationdistanceis.Theremainingstepsofthemethodarenotveryeasytoexplainwithoutgettingdeeperintomathematics.

Thereasonforchoosingthisstatisticoveranyotherwasthatstatisticianswereabletocomeupaformulaforitsapproximateprobabilitydistributionwhenthesamplesizeislarge.Thedistributionhasanimpressivename:itiscalledthe distributionwithdegreesoffreedomequalto4(onefewerthanthenumberofcategories).Dataanalystswouldcomputethestatisticfortheirsample,andthencompareitwithtabulatednumericalvaluesofthedistributions.

Simulatingthe statistic.

The statisticisjustastatisticlikeanyother.Wecanusethecomputertosimulateitsbehaviorevenifwehavenotstudiedalltheunderlyingmathematics.

ForthepanelsintheACLUstudy,the statisticisabout348:

ComputationalandInferentialThinking

319DistanceBetweenDistributions

Page 320: Computational and inferential thinking

sum(((jury["Panel"]-jury["Eligible"])**2)/jury["Eligible"])*sample_size

348.07422222222226

Togeneratetheempiricaldistributionofthestatistic,allweneedisasmallmodificationofthecodeweusedforsimulatingtotalvariationdistance:

#Computeempiricaldistributionofchi-squaredstatistic

classical_from_sample=Table(["'Chi-squared'statistic,fromarandomsample"

foriinnp.arange(repetitions):

sample=np.random.multinomial(sample_size,eligible)/sample_size

chisq=sum(((sample-eligible)**2)/eligible)*sample_size

classical_from_sample.append([chisq])

classical_from_sample

'Chi-squared'statistic,fromarandomsample

5.29816

1.83597

5.03969

4.3795

10.738

4.74311

4.15047

5.83918

9.6604

0.127756

...(990rowsomitted)

Hereisahistogramoftheempiricaldistributionofthe statistic.Thesimulatedvaluesofbasedonrandomsamplesareconsiderablysmallerthanthevalueof348thatwegot

fromthejurypanels.

ComputationalandInferentialThinking

320DistanceBetweenDistributions

Page 321: Computational and inferential thinking

classical_from_sample.hist(bins=np.arange(0,18,0.5),normed=True)

Infact,thelongrunaveragevalueof statisticsisequaltothedegreesoffreedom.Andindeed,theaverageofthesimulatedvaluesof iscloseto4,thedegreesoffreedominthiscase.

np.mean(classical_from_sample.column(0))

3.978631566873136

Inthissituation,randomsamplesareexpectedtoproducea statisticofabout4.Thepanels,ontheotherhand,produceda statisticof348.Thisisyetmoreconfirmationoftheconclusionthatthepanelsdonotresemblearandomsamplefromthepopulationofeligiblejurors.

ComputationalandInferentialThinking

321DistanceBetweenDistributions

Page 322: Computational and inferential thinking

HypothesisTestingInteractOurinvestigationofjurypanelsisanexampleofhypothesistesting.Wewishedtoknowtheanswertoayes-or-noquestionabouttheworld,andwereasonedaboutrandomsamplesinordertoanswerthequestion.Inthissection,weformalizetheprocess.

SamplingfromaDistribution¶Therowsofatabledescribeindividualelementsofacollectionofdata.Intheprevioussection,weworkedwithatablethatdescribedadistributionovercategories.Tablesthatdescribetheproportionsorcountsofdifferentcategoriesinasampleorpopulationariseoftenindataanalysisasasummaryofadatacollectionprocess.Thesample_from_distributionmethoddrawsasampleofcountsforthecategories.Itsimplementationusesthesamenp.random.multinomialmethodfromtheprevioussection.

ThetablebelowdescribestheproportionoftheeligiblepotentialjurorsinAlamedaCounty(estimatedfrom2000censusdataandothersources)forbroadcategoriesofraceandethnicbackground.ThistablewascompiledbyProfessorWeeks,ademographeratSanDiegoStateUniversity,fortheAlamedaCountytrialPeoplev.StuartAlexanderin2002.

population=Table(["Race","Eligible"]).with_rows([

["Asian",0.15],

["Black",0.18],

["Latino",0.12],

["White",0.54],

["Other",0.01],

])

population

Race Eligible

Asian 0.15

Black 0.18

Latino 0.12

White 0.54

Other 0.01

ComputationalandInferentialThinking

322HypothesisTesting

Page 323: Computational and inferential thinking

Thesample_from_distributionmethodtakesanumberofsamplesandacolumnindexorlabel.Itreturnsatableinwhichthedistributioncolumnhasbeenreplacedbycategorycountsofthesample.

sample_size=1453

population.sample_from_distribution("Eligible",sample_size)

Race Eligible Eligiblesample

Asian 0.15 233

Black 0.18 245

Latino 0.12 177

White 0.54 787

Other 0.01 11

Eachcountisselectedrandomlyinsuchawaythatthechanceofeachcategoryisthecorrespondingproportionintheoriginal"Eligible"column.Samplingagainwillgivedifferentcounts,butagainthe"White"categoryismuchmorecommonthan"Other"becauseofitsmuchhigherproportion.

population.sample_from_distribution("Eligible",sample_size)

Race Eligible Eligiblesample

Asian 0.15 227

Black 0.18 247

Latino 0.12 155

White 0.54 819

Other 0.01 5

Theoptionproportions=Truedrawssamplecountsandthendivideseachcountbythetotalnumberofsamples.Theresultisanotherdistribution;onebasedonasample.

sample=population.sample_from_distribution("Eligible",sample_size

sample

ComputationalandInferentialThinking

323HypothesisTesting

Page 324: Computational and inferential thinking

Race Eligible Eligiblesample

Asian 0.15 0.140399

Black 0.18 0.189264

Latino 0.12 0.125258

White 0.54 0.532691

Other 0.01 0.0123882

Thisutsamplecanalsobeusedtogenerateasample.Thesampleofasampleiscalledaresampledsampleorabootstrapsample.Itisnotarandomsampleoftheoriginaldistribution,butsharesmanycharacteristicswithsuchasample.

sample.sample_from_distribution('Eligiblesample',sample_size)

Race Eligible Eligiblesample Eligiblesamplesample

Asian 0.15 0.140399 204

Black 0.18 0.189264 258

Latino 0.12 0.125258 163

White 0.54 0.532691 810

Other 0.01 0.0123882 18

EmpiricalDistributions¶Anempiricaldistributionofastatisticisatablegeneratedbycomputingthestatisticformanysamples.Theprevioussectionusedempiricaldistributionstoreasonaboutwhetherornotasamplehadastatisticthatwastypicalofarandomsample.Thefollowingfunctioncomputestheempiricalhistogramofastatistic,whichisexpressedasafunctionthattakesasample.

defempirical_distribution(table,label,sample_size,k,f):

stats=Table(['Sample#','Statistic'])

foriinnp.arange(k):

sample=table.sample_from_distribution(label,sample_size)

statistic=f(sample)

stats.append([i,statistic])

returnstats

ComputationalandInferentialThinking

324HypothesisTesting

Page 325: Computational and inferential thinking

Thisfunctioncanbeusedtocomputeanempiricaldistributionofthetotalvariationdistancefromthepopulationdistribution.

deftvd_from_eligible(sample):

counts=sample.column('Eligiblesample')

distribution=counts/sum(counts)

returntotal_variation_distance(population.column('Eligible'),

empirical_distribution(population,'Eligible',sample_size,1000,tvd_from_eligible

Sample# Statistic

0 0.0248796

1 0.0195664

2 0.00401927

3 0.0165864

4 0.0207502

5 0.013042

6 0.0297041

7 0.0153544

8 0.00830695

9 0.014384

...(990rowsomitted)

Theempiricaldistributioncanbevisualizedusingahistogram.

empirical_distribution(population,'Eligible',sample_size,1000,tvd_from_eligible

ComputationalandInferentialThinking

325HypothesisTesting

Page 326: Computational and inferential thinking

U.S.SupremeCourt,1965:Swainvs.Alabama¶Intheearly1960's,inTalladegaCountyinAlabama,ablackmancalledRobertSwainwasconvictedofrapingawhitewomanandwassentencedtodeath.Heappealedhissentence,citingamongotherfactorstheall-whitejury.Atthetime,onlymenaged21orolderwereallowedtoserveonjuriesinTalladegaCounty.Inthecounty,26%oftheeligiblejurorswereblack,buttherewereonly8blackmenamongthe100selectedforthejurypanelinSwain'strial.Noblackmanwasselectedforthetrialjury.

In1965,theSupremeCourtoftheUnitedStatesdeniedSwain'sappeal.Initsruling,theCourtwrote"...theoverallpercentagedisparityhasbeensmallandreflectsnostudiedattempttoincludeorexcludeaspecifiednumberof[blackmen]."

Letususethemethodswehavedevelopedtoexaminethedisparitybetween8outof100and26outof100blackmeninapanelof100drawnatrandomfromamongtheeligiblejurors.

alabama=Table(['Race','Eligible']).with_rows([

["Black",0.26],

["Other",0.74]

])

alabama.set_format(1,PercentFormatter(0))

Race Eligible

Black 26%

Other 74%

ComputationalandInferentialThinking

326HypothesisTesting

Page 327: Computational and inferential thinking

Asourteststatistic,wewillusethenumberofblackmeninarandomsampleofsize100.Hereisanempiricaldistributionofthisstatistic.

defvalue_for(category,category_label,value_label):

defin_table(sample):

returnsample.where(category_label,category).column(value_label

returnin_table

num_black=value_for('Black','Race','Eligiblesample')

black_in_sample=empirical_distribution(alabama,'Eligible',100,

black_in_sample

Sample# Statistic

0 27

1 31

2 28

3 27

4 24

5 23

6 29

7 28

8 16

9 25

...(9990rowsomitted)

Thenumbersofblackmeninthefirst10repetitionswerequiteabitlargerthan8.Theempiricalhistogrambelowshowsthedistributionoftheteststatisticoveralltherepetitions.

black_in_sample.hist(1,bins=np.arange(0,50,1))

ComputationalandInferentialThinking

327HypothesisTesting

Page 328: Computational and inferential thinking

Ifthe100meninSwain'sjurypanelhadbeenchosenatrandom,itwouldhavebeenextremelyunlikelyforthenumberofblackmenonthepaneltobeassmallas8.Wemustconcludethatthepercentagedisparitywaslargerthanthedisparityexpectedduetochancevariation.

MethodandTerminologyofStatisticalTestsofHypotheses¶Wehavedevelopedsomeofthefundamentalconceptsofstatisticaltestsofhypotheses,inthecontextofexamplesaboutjuryselection.Usingstatisticaltestsasawayofmakingdecisionsisstandardinmanyfieldsandhasastandardterminology.Hereisthesequenceofthestepsinmoststatisticaltests,alongwithsometerminologyandexamples.

STEP1:THEHYPOTHESES

Allstatisticaltestsattempttochoosebetweentwoviewsofhowthedataweregenerated.Thesetwoviewsarecalledhypotheses.

Thenullhypothesis.Thissaysthatthedataweregeneratedatrandomunderclearlyspecifiedassumptionsthatmakeitpossibletocomputechances.Theword"null"reinforcestheideathatifthedatalookdifferentfromwhatthenullhypothesispredicts,thedifferenceisduetonothingbutchance.

Inbothofourexamplesaboutjuryselection,thenullhypothesisisthatthepanelswereselectedatrandomfromthepopulationofeligiblejurors.Thoughtheracialcompositionofthepanelswasdifferentfromthatofthepopulationsofeligiblejurors,therewasnoreasonforthedifferenceotherthanchancevariation.

ComputationalandInferentialThinking

328HypothesisTesting

Page 329: Computational and inferential thinking

Thealternativehypothesis.Thissaysthatsomereasonotherthanchancemadethedatadifferfromwhatwaspredictedbythenullhypothesis.Informally,thealternativehypothesissaysthattheobserveddifferenceis"real."

Inbothofourexamplesaboutjuryselection,thealternativehypothesisisthatthepanelswerenotselectedatrandom.Somethingotherthanchanceledtothedifferencesbetweentheracialcompositionofthepanelsandtheracialcompositionofthepopulationsofeligiblejurors.

STEP2:THETESTSTATISTIC

Inordertodecidebetweenthetwohypothesis,wemustchooseastatisticuponwhichwewillbaseourdecision.Thisiscalledtheteststatistic.

IntheexampleaboutjurypanelsinAlamedaCounty,theteststatisticweusedwasthetotalvariationdistancebetweentheracialdistributionsinthepanelsandinthepopulationofeligiblejurors.IntheexampleaboutSwainversustheStateofAlabama,theteststatisticwasthenumberofblackmenonthejurypanel.

Calculatingtheobservedvalueoftheteststatisticisoftenthefirstcomputationalstepinastatisticaltest.InthecaseofjurypanelsinAlamedaCounty,theobservedvalueofthetotalvariationdistancebetweenthedistributionsinthepanelsandthepopulationwas0.14.IntheexampleaboutSwain,thenumberofblackmenonhisjurypanelwas8.

STEP3:THEPROBABILITYDISTRIBUTIONOFTHETESTSTATISTIC,UNDERTHENULLHYPOTHESIS

Thisstepsetsasidetheobservedvalueoftheteststatistic,andinsteadfocusesonwhatthevaluemightbeifthenullhypothesisweretrue.Underthenullhypothesis,thesamplecouldhavecomeoutdifferentlyduetochance.Sotheteststatisticcouldhavecomeoutdifferently.Thisstepconsistsoffiguringoutallpossiblevaluesoftheteststatisticandalltheirprobabilities,underthenullhypothesisofrandomness.

Inotherwords,inthisstepwecalculatetheprobabilitydistributionoftheteststatisticpretendingthatthenullhypothesisistrue.Formanyteststatistics,thiscanbeadauntingtaskbothmathematicallyandcomputationally.Therefore,weapproximatetheprobabilitydistributionoftheteststatisticbytheempiricaldistributionofthestatisticbasedonalargenumberofrepetitionsofthesamplingprocedure.

Thiswastheempiricaldistributionoftheteststatistic–thenumberofblackmenonthejurypanel–intheexampleaboutSwainandtheSupremeCourt:

black_in_sample.hist(1,bins=np.arange(0,50,1))

ComputationalandInferentialThinking

329HypothesisTesting

Page 330: Computational and inferential thinking

STEP4.THECONCLUSIONOFTHETEST

ThechoicebetweenthenullandalternativehypothesesdependsonthecomparisonbetweentheresultsofSteps2and3:theobservedteststatisticanditsdistributionaspredictedbythenullhypothesis.

Ifthetwoareconsistentwitheachother,thentheobservedteststatisticisinlinewithwhatthenullhypothesispredicts.Inotherwords,thetestdoesnotpointtowardsthealternativehypothesis;thenullhypothesisisbettersupportedbythedata.

Butifthetwoarenotconsistentwitheachother,asisthecaseinbothofourexamplesaboutjurypanels,thenthedatadonotsupportthenullhypothesis.Inbothourexamplesweconcludedthatthejurypanelswerenotselectedatrandom.Somethingotherthanchanceaffectedtheircomposition.

Howis"consistent"defined?Whethertheobservedteststatisticisconsistentwithitspredicteddistributionunderthenullhypothesisisamatterofjudgment.Werecommendthatyouprovideyourjudgmentalongwiththevalueoftheteststatisticandagraphofitspredicteddistribution.Thatwillallowyourreadertomakehisorherownjudgmentaboutwhetherthetwoareconsistent.

Ifyoudonotwanttomakeyourownjudgment,thereareconventionsthatyoucanfollow.TheseconventionsarebasedonwhatiscalledtheobservedsignificancelevelorP-valueforshort.TheP-valueisachancecomputedusingtheprobabilitydistributionoftheteststatistic,andcanbeapproximatedbyusingtheempiricaldistributioninStep3.

PracticalnoteonP-valuesandconventions.Placetheobservedteststatisticonthehorizontalaxisofthehistogram,andfindtheproportioninthetailstartingatthatpoint.That'stheP-value.

IftheP-valueissmall,thedatasupportthealternativehypothesis.Theconventionsforwhatis"small":

ComputationalandInferentialThinking

330HypothesisTesting

Page 331: Computational and inferential thinking

IftheP-valueislessthan5%,theresultiscalled"statisticallysignificant."

IftheP-valueisevensmaller–lessthan1%–theresultiscalled"highlystatisticallysignificant."

MoreformaldefinitionofP-value.TheP-valueisthechance,underthenullhypothesis,thattheteststatisticisequaltothevaluethatwasobservedorisevenfurtherinthedirectionofthealternative.

TheP-valueisbasedoncomparingtheobservedteststatisticandwhatthenullhypothesispredicts.AsmallP-valueimpliesthatunderthenullhypothesis,thevalueoftheteststatisticisunlikelytobeneartheoneweobserved.Whenahypothesisandthedataarenotinaccord,thehypothesishastogo.ThatiswhywerejectthenullhypothesisiftheP-valueissmall.

TheP-valueforthenullhypothesisthatSwain'sjurypanelwasselectedatrandomfromthepopulationofTalladegaisapproximatedusingtheempiricaldistributionasfollows:

np.count_nonzero(black_in_sample.column(1)<=8)/black_in_sample.

0.0

HISTORICALNOTEONTHECONVENTIONS

Thedeterminationofstatisticalsignificance,asdefinedabove,hasbecomestandardinstatisticalanalysesinallfieldsofapplication.Whenaconventionissouniversallyfollowed,itisinterestingtoexaminehowitarose.

Themethodofstatisticaltesting–choosingbetweenhypothesesbasedondatainrandomsamples–wasdevelopedbySirRonaldFisherintheearly20thcentury.SirRonaldmighthavesettheconventionforstatisticalsignificancesomewhatunwittingly,inthefollowingstatementinhis1925bookStatisticalMethodsforResearchWorkers.Aboutthe5%level,hewrote,"Itisconvenienttotakethispointasalimitinjudgingwhetheradeviationistobeconsideredsignificantornot."

Whatwas"convenient"forSirRonaldbecameacutoffthathasacquiredthestatusofauniversalconstant.NomatterthatSirRonaldhimselfmadethepointthatthevaluewashispersonalchoicefromamongmany:inanarticlein1926,hewrote,"Ifoneintwentydoesnotseemhighenoughodds,wemay,ifwepreferitdrawthelineatoneinfifty(the2percentpoint),oroneinahundred(the1percentpoint).Personally,theauthorpreferstosetalowstandardofsignificanceatthe5percentpoint..."

ComputationalandInferentialThinking

331HypothesisTesting

Page 332: Computational and inferential thinking

Fisherknewthat"low"isamatterofjudgmentandhasnouniquedefinition.Wesuggestthatyoufollowhisexcellentexample.Provideyourdata,makeyourjudgment,andexplainwhyyoumadeit.

ComputationalandInferentialThinking

332HypothesisTesting

Page 333: Computational and inferential thinking

PermutationInteract

ComparingTwoGroups¶Intheexamplesabove,weinvestigatedwhetherasampleappearstobechosenrandomlyfromanunderlyingpopulation.Wedidthisbycomparingthedistributionofthesamplewiththedistributionofthepopulation.Asimilarlineofreasoningcanbeusedtocomparethedistributionsoftwosamples.Inparticular,wecaninvestigatewhetherornottwosamplesappeartobedrawnfromthesameunderlyingdistribution.

Example:MarriedCouplesandUnmarriedPartners¶

Ournextexampleisbasedonastudyconductedin2010undertheauspicesoftheNationalCenterforFamilyandMarriageResearch.

IntheUnitedStates,theproportionofcoupleswholivetogetherbutarenotmarriedhasbeenrisinginrecentdecades.Thestudyinvolvedanationalrandomsampleofover1,000heterosexualcoupleswhowereeithermarriedor"cohabitingpartners"–livingtogetherbutunmarried.Oneofthegoalsofthestudywastocomparetheattitudesandexperiencesofthemarriedandunmarriedcouples.

Thetablebelowshowsasubsetofthedatacollectedinthestudy.Eachrowcorrespondstooneperson.Thevariablesthatwewillexamineinthissectionare:

MaritalStatus:marriedorunmarriedEmploymentStatus:oneofseveralcategoriesdescribedbelowGenderAge:Ageinyears

columns=['MaritalStatus','EmploymentStatus','Gender','Age',

couples=Table.read_table('couples.csv').select(columns)

couples

ComputationalandInferentialThinking

333Permutation

Page 334: Computational and inferential thinking

MaritalStatus EmploymentStatus Gender Age

married workingaspaidemployee male 51

married workingaspaidemployee female 53

married workingaspaidemployee male 57

married workingaspaidemployee female 57

married workingaspaidemployee male 60

married workingaspaidemployee female 57

married working,self-employed male 62

married workingaspaidemployee female 59

married notworking-other male 53

married notworking-retired female 61

...(2056rowsomitted)

Letusconsiderjustthemalesfirst.Thereare742marriedcouplesand292unmarriedcouples,andallcouplesinthisstudyhadonemaleandonefemale,making1,034malesinall.

#Separatetablesformarriedandcohabitingunmarriedcouples:

married_men=couples.where('Gender','male').where('MaritalStatus'

partnered_men=couples.where('Gender','male').where('MaritalStatus'

#Let'sseehowmanymarriedandunmarriedpeoplethereare:

couples.where('Gender','male').group('MaritalStatus').barh("MaritalStatus"

ComputationalandInferentialThinking

334Permutation

Page 335: Computational and inferential thinking

Societalnormshavechangedoverthedecades,andtherehasbeenagradualacceptanceofcoupleslivingtogetherwithoutbeingmarried.Thusitisnaturaltoexpectthatunmarriedcoupleswillingeneralconsistofyoungerpeoplethanmarriedcouples.

Thehistogramsoftheagesofthemarriedandunmarriedmenshowthatthisisindeedthecase.Wewilldrawthesehistogramsandcomparethem.Inordertocomparetwohistograms,bothshouldbedrawntothesamescale.Letuswriteafunctionthatdoesthisforus.

defplot_age(table,subject):

"""

DrawsahistogramoftheAgecolumninthegiventable.

tableshouldbeaTablewithacolumnofpeople'sagescalledAge.

subjectshouldbeastring--thenameofthegroupwe'redisplaying,

like"marriedmen".

"""

#Drawahistogramofagesrunningfrom15yearsto70years

table.hist('Age',bins=np.arange(15,71,5),unit='year')

#Setthelowerandupperboundsoftheverticalaxissothat

#theplotswemakeareallcomparable.

plt.ylim(0,0.045)

plt.title("Agesof"+subject)

ComputationalandInferentialThinking

335Permutation

Page 336: Computational and inferential thinking

#Agesofmen:

plot_age(married_men,"marriedmen")

plot_age(partnered_men,"cohabitingunmarriedmen")

Thedifferenceisevenmoremarkedwhenwecomparethemarriedandunmarriedwomen.

married_women=couples.where('Gender','female').where('MaritalStatus'

partnered_women=couples.where('Gender','female').where('MaritalStatus'

#Agesofwomen:

plot_age(married_women,"marriedwomen")

plot_age(partnered_women,"cohabitingunmarriedwomen")

ComputationalandInferentialThinking

336Permutation

Page 337: Computational and inferential thinking

Thehistogramsshowthatthemarriedmeninthesampleareingeneralolderthanunmarriedcohabitingmen.Marriedwomenareingeneralolderthanunmarriedwomen.Theseobservationsareconsistentwithwhatwehadpredictedbasedonchangingsocialnorms.

Ifmarriedcouplesareingeneralolder,theymightdifferfromunmarriedcouplesinotherwaysaswell.Letuscomparetheemploymentstatusofthemarriedandunmarriedmeninthesample.

Thetablebelowshowsthemaritalstatusandemploymentstatusofeachmaninthesample.

males=couples.where('Gender','male').select(['MaritalStatus','EmploymentStatus'

males.show(5)

ComputationalandInferentialThinking

337Permutation

Page 338: Computational and inferential thinking

MaritalStatus EmploymentStatus

married workingaspaidemployee

married workingaspaidemployee

married workingaspaidemployee

married working,self-employed

married notworking-other

...(1028rowsomitted)

ContingencyTables¶

Toinvestigatetheassociationbetweenemploymentandmarriage,wewouldliketobeabletoaskquestionslike,"Howmanymarriedmenareretired?"

Recallthatthemethodpivotletsusdoexactlythat.Itcross-classifieseachmanaccordingtothetwovariables–maritalstatusandemploymentstatus.Itsoutputisacontingencytablethatcontainsthecountsineachpairofcategories.

employment=males.pivot('MaritalStatus','EmploymentStatus')

employment

EmploymentStatus married partner

notworking-disabled 44 20

notworking-lookingforwork 28 33

notworking-onatemporarylayofffromajob 15 8

notworking-other 16 9

notworking-retired 44 4

workingaspaidemployee 513 170

working,self-employed 82 47

Theargumentsofpivotarethelabelsofthetwocolumnscorrespondingtothevariableswearestudying.Categoriesofthefirstargumentappearascolumns;categoriesofthesecondargumentaretherows.Eachcellofthetablecontainsthenumberofmeninapairofcategories–aparticularemploymentstatusandaparticularmaritalstatus.

Thetableshowsthatregardlessofmaritalstatus,themeninthesamplearemostlikelytobeworkingaspaidemployees.Butitisquitehardtocomparetheentiredistributionsbasedonthistable,becausethenumbersofmarriedandunmarriedmeninthesamplearenotthe

ComputationalandInferentialThinking

338Permutation

Page 339: Computational and inferential thinking

same.Thereare742marriedmenbutonly291unmarriedones.

employment.drop(0).sum()

married partner

742 291

Toadjustforthisdifferenceintotalnumbers,wewillconvertthecountsintoproportions,bydividingallthemarriedcountsby742andallthepartnercountsby291.

proportions=Table().with_columns([

"EmploymentStatus",cc.column("EmploymentStatus"),

"married",employment.column('married')/sum(employment.column

"partner",employment.column('partner')/sum(employment.column

])

proportions.show()

EmploymentStatus married partner

notworking-disabled 0.0592992 0.0687285

notworking-lookingforwork 0.0377358 0.113402

notworking-onatemporarylayofffromajob 0.0202156 0.0274914

notworking-other 0.0215633 0.0309278

notworking-retired 0.0592992 0.0137457

workingaspaidemployee 0.691375 0.584192

working,self-employed 0.110512 0.161512

Themarriedcolumnofthistableshowsthedistributionofemploymentstatusofthemarriedmeninthesample.Forexample,amongmarriedmen,theproportionwhoareretiredisabout0.059.Thepartnercolumnshowsthedistributionoftheemploymentstatusoftheunmarriedmeninthesample.Amongunmarriedmen,theproportionwhoareretiredisabout0.014.

Thetwodistributionslookdifferentfromeachotherinotherwaystoo,ascanbeseenmoreclearlyinthebargraphsbelow.Itappearsthatalargerproportionofthemarriedmeninthesampleworkaspaidemployees,whereasalargerproportionoftheunmarriedmenarenotworkingbutarelookingforwork.

ComputationalandInferentialThinking

339Permutation

Page 340: Computational and inferential thinking

proportions.barh('EmploymentStatus')

Thedistributionsofemploymentstatusofthemeninthetwogroups–marriedandunmarried–isclearlydifferentinthesample.

Arethetwodistributionsdifferentinthepopulation?¶

Thisraisesthequestionofwhetherthedifferenceisduetorandomnessinthesampling,orwhetherthedistributionsofemploymentstatusareindeeddifferentformarriedandumarriedcohabitingmenintheU.S.Rememberthatthedatathatwehavearefromasampleofjust1,033couples;wedonotknowthedistributionofemploymentstatusofmarriedorunmarriedcohabitingmenintheentirecountry.

Wecananswerthequestionbyperformingastatisticaltestofhypotheses.Letususetheterminologythatwedevelopedforthisintheprevioussection.

Nullhypothesis.IntheUnitedStates,thedistributionofemploymentstatusamongmarriedmenisthesameasamongunmarriedmenwholivewiththeirpartners.

Anotherwayofsayingthisisthatemploymentstatusandmaritalstatusareindependentornotassociated.

Ifthenullhypothesisweretrue,thenthedifferencethatwehaveobservedinthesamplewouldbejustduetochance.

Alternativehypothesis.IntheUnitedStates,thedistributionsoftheemploymentstatusofthetwogroupsofmenaredifferent.Inotherwords,employmentstatusandmaritalstatusareassociatedinsomeway.

Asourteststatistic,wewillusethetotalvariationdistancebetweentwodistributions.

Theobservedvalueoftheteststatisticisabout0.15:

ComputationalandInferentialThinking

340Permutation

Page 341: Computational and inferential thinking

#TVDbetweenthetwodistributionsinthesample

married=proportions.column('married')

partner=proportions.column('partner')

observed_tvd=0.5*sum(abs(married-partner))

observed_tvd

0.15273571011754242

RandomPermutations¶

Inordertocomparethisobservedvalueofthetotalvariationdistancewithwhatispredictedbythenullhypothesis,weneedtoknowhowthetotalvariationdistancewouldvaryacrossallpossiblerandomsamplesifemploymentstatusandmaritalstatuswerenotrelated.

Thisisquitedauntingtoderivebymathematics,butletusseeifwecangetagoodapproximationbysimulation.

Withjustonesampleathand,andnofurtherknowledgeofthedistributionofemploymentstatusamongmenintheUnitedStates,howcanwegoaboutreplicatingthesamplingprocedure?Thekeyistonotethatifmaritalstatusandemploymentstatuswerenotconnectedinanyway,thenwecouldreplicatethesamplingprocessbyreplacingeachman'semploymentstatusbyarandomlypickedemploymentstatusfromamongallthemen,marriedandunmarried.

Doingthisforallthemenisequivalenttorandomlyrearrangingtheentirecolumncontainingemploymentstatus,whileleavingthemaritalstatuscolumnunchanged.Sucharearrangementiscalledarandompermutation.

Thus,underthenullhypothesis,wecanreplicatethesamplingprocessbyassigningtoeachmananemploymentstatuschosenatrandomwithoutreplacementfromtheentriesinthecolumnEmploymentStatus.WecandothereplicationbysimplypermutingtheentireEmploymentStatuscolumnandleavingeverythingelseunchanged.

Let'simplementthisplan.First,wewillshufflethecolumnempl_statususingthesamplemethod,whichjustshufflesalltherowswhenprovidedwithnoarguments.

#Randomlypermutetheemploymentstatusofallmen

shuffled=males.select('EmploymentStatus').sample()

shuffled

ComputationalandInferentialThinking

341Permutation

Page 342: Computational and inferential thinking

EmploymentStatus

working,self-employed

working,self-employed

workingaspaidemployee

workingaspaidemployee

working,self-employed

workingaspaidemployee

notworking-disabled

workingaspaidemployee

workingaspaidemployee

workingaspaidemployee

...(1023rowsomitted)

Thefirsttwocolumnsofthetablebelowaretakenfromtheoriginalsample.ThethirdhasbeencreatedbyrandomlypermutingtheoriginalEmploymentStatuscolumn.

#Constructatableinwhichemploymentstatushasbeenshuffled

males_with_shuffled_empl=Table().with_columns([

"MaritalStatus",males.column('MaritalStatus'),

"EmploymentStatus",males.column('EmploymentStatus'),

"EmploymentStatus(shuffled)",shuffled.column('EmploymentStatus'

])

males_with_shuffled_empl

ComputationalandInferentialThinking

342Permutation

Page 343: Computational and inferential thinking

MaritalStatus EmploymentStatus EmploymentStatus

(shuffled)

married workingaspaidemployee working,self-employed

married workingaspaidemployee working,self-employed

married workingaspaidemployee workingaspaidemployee

married working,self-employed workingaspaidemployee

married notworking-other working,self-employed

married notworking-onatemporarylayofffromajob workingaspaidemployee

married notworking-disabled notworking-disabled

married workingaspaidemployee workingaspaidemployee

married workingaspaidemployee workingaspaidemployee

married notworking-retired workingaspaidemployee

...(1023rowsomitted)

Onceagain,thepivotmethodcomputesthecontingencytable,whichallowsustocalculatethetotalvariationdistancebetweenthedistributionsofthetwogroupsofmenaftertheiremploymentstatushasbeenshuffled.

employment_shuffled=males_with_shuffled_empl.pivot('MaritalStatus'

employment_shuffled

EmploymentStatus(shuffled) married partner

notworking-disabled 48 16

notworking-lookingforwork 44 17

notworking-onatemporarylayofffromajob 16 7

notworking-other 18 7

notworking-retired 39 9

workingaspaidemployee 489 194

working,self-employed 88 41

ComputationalandInferentialThinking

343Permutation

Page 344: Computational and inferential thinking

#TVDbetweenthetwodistributionsinthecontingencytableabove

e_s=employment_shuffled

married=e_s.column('married')/sum(e_s.column('married'))

partner=e_s.column('partner')/sum(e_s.column('partner'))

0.5*sum(abs(married-partner))

0.032423745611841297

Thistotalvariationdistancewascomputedbasedonthenullhypothesisthatthedistributionsofemploymentstatusforthetwogroupsofmenarethesame.Youcanseethatitisnoticeablysmallerthantheobservedvalueofthetotalvariationdistance(0.15)betweenthetwogroupsinouroriginalsample.

APermutationTest¶

Couldthisjustbeduetochancevariation?Wewillonlyknowifwerunmanymorereplications,byrandomlypermutingtheEmploymentStatuscolumnrepeatedly.Thismethodoftestingisknownasapermutationtest.

ComputationalandInferentialThinking

344Permutation

Page 345: Computational and inferential thinking

#Putitalltogetherinaforlooptoperformapermutationtest

repetitions=500

tvds=Table().with_column("TVDbetweenmarriedandpartneredmen",

foriinnp.arange(repetitions):

#Constructapermutedtable

shuffled=males.select('EmploymentStatus').sample()

combined=Table().with_columns([

"MaritalStatus",males.column('MaritalStatus'),

"EmploymentStatus",shuffled.column('EmploymentStatus'

])

employment_shuffled=combined.pivot('MaritalStatus','EmploymentStatus'

#ComputeTVD

e_s=employment_shuffled

married=e_s.column('married')/sum(e_s.column('married'))

partner=e_s.column('partner')/sum(e_s.column('partner'))

permutation_tvd=0.5*sum(abs(married-partner))

tvds.append([permutation_tvd])

tvds.hist(bins=np.arange(0,0.2,0.01))

Thefigureaboveistheempiricaldistributionofthetotalvariationdistancebetweenthedistributionsoftheemploymentstatusofmarriedandunmarriedmen,underthenullhypothesis.

ComputationalandInferentialThinking

345Permutation

Page 346: Computational and inferential thinking

Theobservedteststatisticof0.15isquitefarinthetail,andsothechanceofobservingsuchanextremevalueunderthenullhypothesisiscloseto0.

Asbefore,thischanceiscalledanempiricalP-value.TheP-valueisthechancethatourteststatistic(TVD)wouldcomeoutatleastasextremeastheobservedvalue(inthiscase0.15orgreater)underthenullhypothesis.

Conclusionofthetest¶

Ourempiricalestimatebasedonrepeatedsamplinggivesusalltheinformationweneedfordrawingconclusionsfromthedata:theobservedstatisticisveryunlikelyunderthenullhypothesis.

ThelowP-valueconstitutesevidenceinfavorofthealternativehypothesis.ThedatasupportthehypothesisthatintheUnitedStates,thedistributionoftheemploymentstatusofmarriedmenisnotthesameasthatofunmarriedmenwholivewiththeirpartners.

Wehavejustcompletedourfirstpermutationtest.Permutationtestsarequitecommoninpracticebecausetheymakeveryfewassumptionsabouttheunderlyingpopulationandarestraightforwardtoperformandinterpret.

NoteabouttheapproximateP-value

OursimulationgivesusanapproximateempiricalP-value,becauseitisbasedonjust500randompermutationsinsteadofallthepossiblerandompermutations.WecancomputethisempiricalP-valuedirectly,withoutdrawingthehistogram:

empirical_p_value=np.count_nonzero(tvds.column(0)>=observed_tvd

empirical_p_value

0.0

ComputingtheexactP-valuewouldrequireustoconsiderallpossibleoutcomesofshuffling(whichisverylarge)insteadof500randomshuffles.Ifwehadperformedalltherandomshuffles,therewouldhavebeenafewwithmoreextremeTVDs.ThetrueP-valueisgreaterthanzero,butnotbymuch.

GeneralizingOurHypothesisTest¶

ComputationalandInferentialThinking

346Permutation

Page 347: Computational and inferential thinking

Theexampleaboveincludesasubstantialamountofcodeinordertoinvestigatetherelationshipbetweentwocharacteristics(maritalstatusandemploymentstatus)foraparticularsubsetofthesurveyedpopulation(males).Supposewewouldliketoinvestigatedifferentcharacteristicsoradifferentpopulation.Howcanwereusethecodewehavewrittensofarinordertoexploremorerelationships?

Whenyouareabouttocopyyourcode,youshouldthink,"MaybeIshouldwritesomefunctions."

Whatfunctionstowrite?Agoodwaytomakethisdecisionistothinkaboutwhatyouhavetocomputerepeatedly.

Inourexample,thetotalvariationdistanceiscomputedoverandoveragain.Sowewillbeginwithageneralizedcomputationoftotalvariationdistancebetweenthedistributionofanycolumnofvalues(suchasemploymentstatus)whenseparatedintoanytwoconditions(suchasmaritalstatus)foracollectionofdatadescribedbyanytable.Ourimplementationincludesthesamestatementsasweusedabove,butusesgenericnamesthatarespecifiedbythefinalfunctioncall.

#TVDbetweenthedistributionsofvaluesunderanytwoconditions

deftvd(t,conditions,values):

"""Computethetotalvariationdistance

betweenproportionsofvaluesundertwoconditions.

t(Table)--atable

conditions(str)--acolumnlabelint;shouldhaveonlytwocategories

values(str)--acolumnlabelint

"""

e=t.pivot(conditions,values)

a=e.column(1)

b=e.column(2)

return0.5*sum(abs(a/sum(a)-b/sum(b)))

tvd(males,'MaritalStatus','EmploymentStatus')

0.15273571011754242

ComputationalandInferentialThinking

347Permutation

Page 348: Computational and inferential thinking

Next,wecanwriteafunctionthatperformsapermutationtestusingthistvdfunctiontocomputethesamestatisticonshuffledvariantsofanytable.It'sworthreadingthroughthisimplementationtounderstanditsdetails.

defpermutation_tvd(original,conditions,values):

"""

Performapermutationtestofwhether

thedistributionofvaluesfortwoconditions

isthesameinthepopulation,

usingthetotalvariationdistancebetweentwodistributions

astheteststatistic.

originalisaTablewithtwocolumns.Thevalueoftheargument

conditionsisthenameofonecolumn,andthevalueoftheargument

valuesisthenameoftheothercolumn.Theconditionstableshould

haveonly2possiblevaluescorrespondingto2categoriesinthe

data.

Thevaluescolumnisshuffledmanytimes,andthedataaregrouped

accordingtotheconditionscolumn.Thetotalvariationdistance

betweentheproportionsvaluesinthe2categoriesiscomputed.

ThenwedrawahistogramofallthoseTVdistances.Thisshowsus

whattheTVDbetweenthevaluesofthetwodistributionswouldtypically

looklikeifthevalueswereindependentoftheconditions.

"""

repetitions=500

stats=[]

foriinnp.arange(repetitions):

shuffled=original.sample()

combined=Table().with_columns([

conditions,original.column(conditions),

values,shuffled.column(values)

])

stats.append(tvd(combined,conditions,values))

observation=tvd(original,conditions,values)

p_value=np.count_nonzero(stats>=observation)/repetitions

ComputationalandInferentialThinking

348Permutation

Page 349: Computational and inferential thinking

print("Observation:",observation)

print("EmpiricalP-value:",p_value)

Table([stats],['Empiricaldistribution']).hist()

permutation_tvd(males,'MaritalStatus','EmploymentStatus')

Observation:0.152735710118

EmpiricalP-value:0.0

Nowthatwehavegeneralizedourpermutationtest,wecanapplyittootherhypotheses.Forexample,wecancomparethedistributionovertheemploymentstatusofwomen,groupingthembytheirmaritalstatus.Inthecaseofmenwefoundadifference,butwhataboutwithwomen?First,wecanvisualizethetwodistributions.

defcompare_bar(t,conditions,values):

"""Bargraphsofdistributionsofvaluesforeachoftwoconditions."""

e=t.pivot(conditions,values)

forlabeline.drop(0).labels:

#Converteachcolumnofcountsintoproportions

e.append_column(label,e.column(label)/sum(e.column(label)))

e.barh(values)

compare_bar(couples.where('Gender','female'),'MaritalStatus','EmploymentStatus'

ComputationalandInferentialThinking

349Permutation

Page 350: Computational and inferential thinking

Aglanceatthefigureshowsthatthetwodistributionsaredifferentinthesample.Thedifferenceinthecategory"Notworking–other"isparticularlystriking:about22%ofthemarriedwomenareinthiscategory,comparedtoonlyabout8%oftheunmarriedwomen.Thereareseveralreasonsforthisdifference.Forexample,thepercentofhomemakersisgreateramongmarriedwomenthanamongunmarriedwomen,possiblybecausemarriedwomenaremorelikelytobe"stay-at-home"mothersofyoungchildren.Thedifferencecouldalsobegenerational:aswesawearlier,themarriedcouplesareolderthantheunmarriedpartners,andolderwomenarelesslikelytobeintheworkforcethanyoungerwomen.

Whilewecanseethatthedistributionsaredifferentinthesample,wearenotreallyinterestedinthesampleforitsownsake.Weareexaminingthesamplebecauseitislikelytoreflectthepopulation.So,asbefore,wewillusethesampletotrytoansweraquestionaboutsomethingunknown:thedistributionsofemploymentstatusofmarriedandunmarriedcohabitingwomenintheUnitedStates.Thatisthepopulationfromwhichthesamplewasdrawn.

Wehavetoconsiderthepossibilitythattheobserveddifferenceinthesamplecouldsimplybetheresultofchancevariation.Rememberthatourdataareonlyfromarandomsampleofcouples.WedonothavedataforallthecouplesintheUnitedStates.

Nullhypothesis:IntheU.S.,thedistributionofemploymentstatusisthesameformarriedwomenasforunmarriedwomenlivingwiththeirpartners.Thedifferenceinthesampleisduetochance.

Alternativehypothesis:IntheU.S.,thedistributionsofemploymentstatusamongmarriedandunmarriedcohabitingwomenaredifferent.

Anotherpermutationtesttocomparedistributions¶

Wecantestthesehypothesesjustaswedidformen,byusingthefunctionpermuation_tvdthatwedefinedforthispurpose.

permutation_tvd(couples.where('Gender','female'),'MaritalStatus'

ComputationalandInferentialThinking

350Permutation

Page 351: Computational and inferential thinking

Observation:0.194755513565

EmpiricalP-value:0.0

Asforthemales,theempiricalP-valueis0basedonalarenumberofrepetitions.SotheexactP-valueiscloseto0,whichisevidenceinfavorofthealternativehypothesis.ThedatasupportthehypothesisthatforwomenintheUnitedStates,employmentstatusisassociatedwithwhethertheyaremarriedorunmarriedandlivingwiththeirpartners.

Anotherexample¶

Aregenderandemploymentstatusindependentinthepopulation?Wearenowinapositiontotestthisquiteswiftly:

Nullhypothesis.AmongmarriedandunmarriedcohabitingindividualsintheUnitedStates,genderisindependentofemploymentstatus.

Alternativehypothesis.AmongmarriedandunmarriedcohabitingpeopleintheUnitedStates,genderandemploymentstatusarerelated.

permutation_tvd(couples,'Gender','EmploymentStatus')

Observation:0.185866408519

EmpiricalP-value:0.0

ComputationalandInferentialThinking

351Permutation

Page 352: Computational and inferential thinking

Theconclusionofthetestisthatgenderandemploymentstatusarenotindependentinthepopulation.Thisisnosurprise;forexample,becauseofsocietalnorms,olderwomenwerelesslikelytohavegoneintotheworkforcethanmen.

Deflategate:PermutationTestsandQuantitativeVariables¶

OnJanuary18,2015,theIndianapolisColtsandtheNewEnglandPatriotsplayedtheAmericanFootballConference(AFC)championshipgametodeterminewhichofthoseteamswouldplayintheSuperBowl.Afterthegame,therewereallegationsthatthePatriots'footballshadnotbeeninflatedasmuchastheregulationsrequired;theyweresofter.Thiscouldbeanadvantage,assofterballsmightbeeasiertocatch.

Forseveralweeks,theworldofAmericanfootballwasconsumedbyaccusations,denials,theories,andsuspicions:thepresslabeledthetopicDeflategate,aftertheWatergatepoliticalscandalofthe1970's.TheNationalFootballLeague(NFL)commissionedanindependentanalysis.Inthisexample,wewillperformourownanalysisofthedata.

Pressureisoftenmeasuredinpoundspersquareinch(psi).NFLrulesstipulatethatgameballsmustbeinflatedtohavepressuresintherange12.5psiand13.5psi.Eachteamplayswith12balls.Teamshavetheresponsibilityofmaintainingthepressureintheirownfootballs,butgameofficialsinspecttheballs.BeforethestartoftheAFCgame,allthePatriots'ballswereatabout12.5psi.MostoftheColts'ballswereatabout13.0psi.However,thesepre-gamedatawerenotrecorded.

Duringthesecondquarter,theColtsinterceptedaPatriotsball.Onthesidelines,theymeasuredthepressureoftheballanddeterminedthatitwasbelowthe12.5psithreshold.Promptly,theyinformedofficials.

ComputationalandInferentialThinking

352Permutation

Page 353: Computational and inferential thinking

Athalf-time,allthegameballswerecollectedforinspection.Twoofficials,CleteBlakemanandDyrolPrioleau,measuredthepressureineachoftheballs.Herearethedata;pressureismeasuredinpsi.ThePatriotsballthathadbeeninterceptedbytheColtswasnotinspectedathalf-time.NorweremostoftheColts'balls–theofficialssimplyranoutoftimeandhadtorelinquishtheballsforthestartofsecondhalfplay.

football=Table.read_table('football.csv')

football.show()

Team Ball Blakeman Prioleau

0 Patriots1 11.5 11.8

0 Patriots2 10.85 11.2

0 Patriots3 11.15 11.5

0 Patriots4 10.7 11

0 Patriots5 11.1 11.45

0 Patriots6 11.6 11.95

0 Patriots7 11.85 12.3

0 Patriots8 11.1 11.55

0 Patriots9 10.95 11.35

0 Patriots10 10.5 10.9

0 Patriots11 10.9 11.35

1 Colts1 12.7 12.35

1 Colts2 12.75 12.3

1 Colts3 12.5 12.95

1 Colts4 12.55 12.15

Foreachofthe15ballsthatwereinspected,thetwoofficialsgotdifferentresults.Itisnotuncommonthatrepeatedmeasurementsonthesameobjectyielddifferentresults,especiallywhenthemeasurementsareperformedbydifferentpeople.Sowewillassigntoeachtheballtheaverageofthetwomeasurementsmadeonthatball.

ComputationalandInferentialThinking

353Permutation

Page 354: Computational and inferential thinking

football=football.with_column(

'Combined',(football.column('Blakeman')+football.column('Prioleau'

)

football.show()

Team Ball Blakeman Prioleau Combined

0 Patriots1 11.5 11.8 11.65

0 Patriots2 10.85 11.2 11.025

0 Patriots3 11.15 11.5 11.325

0 Patriots4 10.7 11 10.85

0 Patriots5 11.1 11.45 11.275

0 Patriots6 11.6 11.95 11.775

0 Patriots7 11.85 12.3 12.075

0 Patriots8 11.1 11.55 11.325

0 Patriots9 10.95 11.35 11.15

0 Patriots10 10.5 10.9 10.7

0 Patriots11 10.9 11.35 11.125

1 Colts1 12.7 12.35 12.525

1 Colts2 12.75 12.3 12.525

1 Colts3 12.5 12.95 12.725

1 Colts4 12.55 12.15 12.35

Ataglance,itseemsapparentthatthePatriots'footballswereatalowerpressurethantheColts'balls.Becausesomedeflationisnormalduringthecourseofagame,theindependentanalystsdecidedtocalculatethedropinpressurefromthestartofthegame.RecallthatthePatriots'ballshadallstartedoutatabout12.5psi,andtheColts'ballsatabout13.0psi.ThereforethedropinpressureforthePatriots'ballswascomputedas12.5minusthepressureathalf-time,andthedropinpressurefortheColts'ballswas13.0minusthepressureathalf-time.

ComputationalandInferentialThinking

354Permutation

Page 355: Computational and inferential thinking

football=football.with_column(

'Drop',np.array([12.5]*11+[13.0]*4)-football.column('Combined'

)

football.show()

Team Ball Blakeman Prioleau Combined Drop

0 Patriots1 11.5 11.8 11.65 0.85

0 Patriots2 10.85 11.2 11.025 1.475

0 Patriots3 11.15 11.5 11.325 1.175

0 Patriots4 10.7 11 10.85 1.65

0 Patriots5 11.1 11.45 11.275 1.225

0 Patriots6 11.6 11.95 11.775 0.725

0 Patriots7 11.85 12.3 12.075 0.425

0 Patriots8 11.1 11.55 11.325 1.175

0 Patriots9 10.95 11.35 11.15 1.35

0 Patriots10 10.5 10.9 10.7 1.8

0 Patriots11 10.9 11.35 11.125 1.375

1 Colts1 12.7 12.35 12.525 0.475

1 Colts2 12.75 12.3 12.525 0.475

1 Colts3 12.5 12.95 12.725 0.275

1 Colts4 12.55 12.15 12.35 0.65

Itisapparentthatthedropwaslarger,onaverage,forthePatriots'footballs.Butcouldthedifferencebejustduetochance?

Toanswerthis,wemustfirstexaminehowchancemightentertheanalysis.Thisisnotasituationinwhichthereisarandomsampleofdatafromalargepopulation.Itisalsonotclearhowtocreateajustifiableabstractchancemodel,astheballswerealldifferent,inflatedbydifferentpeople,andmaintainedunderdifferentconditions.

Onewaytointroducechancesistoaskwhetherthedropsinpressuresofthe11Patriotsballsandthe4Coltsballsresemblearandompermutationofthe15drops.Thenthe4Coltsdropswouldbeasimplerandomsampleofall15drops.Thisgivesusanullhypothesisthatwecantestusingrandompermutations.

ComputationalandInferentialThinking

355Permutation

Page 356: Computational and inferential thinking

Nullhypothesis.Thedropsinthepressuresofthe4Coltsballsarelikearandomsample(withoutreplacement)fromall15drops.

Anewteststatistic¶

Thedataarequantitative,sowecannotcomparethetwodistributionscategorybycategoryusingthetotalvariationdistance.IfwetrytobinthedatainordertousetheTVD,thechoiceofbinscananoticeableeffectonthestatistic.Soinstead,wewillworkwithasimplestatisticbasedonmeans.Wewilljustcomparetheaveragedropsinthetwogroups.

Theobserveddifferencebetweentheaveragedropsinthetwogroupswasabout0.7335psi.

patriots=football.where('Team',0).column('Drop')

colts=football.where('Team',1).column('Drop')

observed_difference=np.mean(patriots)-np.mean(colts)

observed_difference

0.73352272727272805

Nowthequestionbecomes:Ifwetookarandompermutationofthe15drops,howlikelyisitthatthedifferenceinthemeansofthefirst11andthelast4wouldbeatleastaslargeasthedifferenceobservedbytheofficials?

Toanswerthis,wewillrandomlypermutethe15drops,assignthefirst11permutedvaluestothePatriotsandthelast4totheColts.Thenwewillfindthedifferenceinthemeansofthetwopermutedgroups.

drops_shuffled=football.sample().column('Drop')

patriots_shuffled=drops_shuffled[:10]

colts_shuffled=drops_shuffled[11:]

shuffled_difference=np.mean(patriots_shuffled)-np.mean(colts_shuffled

shuffled_difference

0.010000000000000675

ComputationalandInferentialThinking

356Permutation

Page 357: Computational and inferential thinking

Thisisdifferentfromtheobservedvaluewecalculatedearlier.Buttogetabettersenseofthevariabilityunderrandomsamplingwemustrepeattheprocessmanytimes.Letustrymaking5000repetitionsanddrawingahistogramofthe5000differencesbetweenmeans.

repetitions=5000

test_stats=[]

foriinrange(repetitions):

drops_shuffled=football.sample().column('Drop')

patriots_shuffled=drops_shuffled[:10]

colts_shuffled=drops_shuffled[11:]

shuffled_difference=np.mean(patriots_shuffled)-np.mean(colts_shuffled

test_stats.append(shuffled_difference)

observation=np.mean(patriots)-np.mean(colts)

emp_p_value=np.count_nonzero(test_stats>=observation)/repetitions

differences=Table().with_column('DifferenceinMeans',test_stats

differences.hist(bins=20)

print("Observation:",observation)

print("EmpiricalP-value:",emp_p_value)

Observation:0.733522727273

EmpiricalP-value:0.0028

ComputationalandInferentialThinking

357Permutation

Page 358: Computational and inferential thinking

Theobserveddifferencewasroughly0.7335psi.Accordingtotheempiricaldistributionabove,thereisaverysmallchancethatarandompermutationwouldyieldadifferencethatlarge.Sothedatasupporttheconclusionthatthetwogroupsofpressureswerenotlikearandompermutationofall15pressures.

Theindependentinvestiagtiveteamanalyzedthedatainseveraldifferentways,takingintoaccountthelawsofphysics.Thefinalreportsaid,

"[T]heaveragepressuredropofthePatriotsgameballsexceededtheaveragepressuredropoftheColtsballsby0.45to1.02psi,dependingonvariouspossibleassumptionsregardingthegaugesused,andassuminganinitialpressureof12.5psiforthePatriotsballsand13.0fortheColtsballs."

--InvestigativereportcommissionedbytheNFLregardingtheAFCChampionshipgameonJanuary18,2015

Ouranalysisshowsanaveragepressuredropofabout0.73psi,whichisconsistentwiththeofficialanalysis.

Theall-importantquestioninthefootballworldwaswhethertheexcessdropofpressureinthePatriots'footballswasdeliberate.Tothatquestion,thedatahavenoanswer.Ifyouarecuriousabouttheanswergivenbytheinvestigators,hereisthefullreport.

ComputationalandInferentialThinking

358Permutation

Page 359: Computational and inferential thinking

A/BTestingInteract

UsingtheBootstrapMethodtoTestHypotheses¶Wehaveusedrandompermutationstoseewhethertwosamplesaredrawnfromthesameunderlyingdistribution.Thebootstrapmethodcanalsobeusedinsuchstatisticaltestsofhypotheses.

Theexamplesinthissectionarebasedondataonarandomsampleof1,174pairsofmothersandtheirnewborninfants.Thetablebabycontainsthedata.Eachrowrepresentsamother-babypair.Thevariablesare:

thebaby'sbirthweightinouncesthenumberofgestationaldaysthemother'sageincompletedyearsthemother'sheightininchesthemother'spregnancyweightinpoundswhetherornotthemotherwasasmoker

baby=Table.read_table('baby.csv')

baby

ComputationalandInferentialThinking

359A/BTesting

Page 360: Computational and inferential thinking

BirthWeight

GestationalDays

MaternalAge

MaternalHeight

MaternalPregnancyWeight

MaternalSmoker

120 284 27 62 100 False

113 282 33 64 135 False

128 279 28 64 115 True

108 282 23 67 125 True

136 286 25 62 93 False

138 244 33 62 178 False

132 245 23 65 140 False

120 289 25 62 125 False

143 299 30 66 136 True

140 351 27 68 120 False

...(1164rowsomitted)

Supposewewanttocomparethemotherswhosmokeandthemotherswhoarenon-smokers.Dotheydifferinanywayotherthansmoking?Akeyvariableofinterestisthebirthweightoftheirbabies.Tostudythisvariable,webeginbynotingthatthereare715non-smokersamongthewomeninthesample,and459smokers.

baby.group('MaternalSmoker')

MaternalSmoker count

False 715

True 459

Thefirsthistogrambelowdisplaysthedistributionofbirthweightsofthebabiesofthenon-smokersinthesample.Theseconddisplaysthebirthweightsofthebabiesofthesmokers.

baby.where('MaternalSmoker',False).hist('BirthWeight',bins=np.arange

ComputationalandInferentialThinking

360A/BTesting

Page 361: Computational and inferential thinking

baby.where('MaternalSmoker',True).hist('BirthWeight',bins=np.arange

Bothdistributionsareapproximatelybellshapedandcenterednear120ounces.Thedistributionsarenotidentical,ofcourse,whichraisesthequestionofwhetherthedifferencereflectsjustchancevariationoradifferenceinthedistributionsinthepopulation.

Thisquestioncanbeansweredbyatestofthenullandalternativehypothesesbelow.Becausewearetestingfortheequalityoftwodistributions,oneforCategoryA(motherswhodon'tsmoke)andtheotherforCategoryB(motherswhosmoke),themethodisknownratherunimaginativelyasA/Btesting.

Nullhypothesis.Inthepopulation,thedistributionofbirthweightsofbabiesisthesameformotherswhodon'tsmokeasformotherswhodo.Thedifferenceinthesampleisduetochance.

Alternativehypothesis.Thetwodistributionsaredifferentinthepopulation.

ComputationalandInferentialThinking

361A/BTesting

Page 362: Computational and inferential thinking

Teststatistic.Birthweightisaquantitativevariable,soitisreasonabletousethedifferencebetweenmeansastheteststatistic.Thereareotherreasonableteststatisticsaswell;thedifferencebetweenmeansisjustonesimplechoice.

Theobserveddifferencebetweenthemeansofthetwogroupsinthesampleisabout9.27ounces.

nonsmokers_mean=np.mean(baby.where('MaternalSmoker',False).column

smokers_mean=np.mean(baby.where('MaternalSmoker',True).column('BirthWeight'

nonsmokers_mean-smokers_mean

9.2661425720249184

Implementingthebootstrapmethod¶

Toseewhethersuchadifferencecouldhavearisenduetochanceunderthenullhypothesis,wecoulduseapermutationtestjustaswedidintheprevioussection.Analternativeistousethebootstrap,whichwewilldohere.

Underthenullhypothesis,thedistributionsofbirthweightsarethesameforbabiesofwomenwhosmokeandforbabiesofthosewhodon't.Toseehowmuchthedistributionscouldvaryduetochancealone,wewillusethebootstrapmethodanddrawnewsamplesatrandomwithreplacementfromtheentiresetofbirthweights.Theonlydifferencebetweenthisandthepermutationtestoftheprevioussectionisthatrandompermutationsarebasedonsamplingwithoutreplacement.

Wewillperform10,000repetitionsofthebootstrapprocess.Thereare1,174babies,soineachrepetitionofthebootstrapprocesswedrawn1,174timesatrandomwithreplacement.Ofthese1,174draws,715areassignedtothenon-smokingmothersandtheremaining459tothemotherswhosmoke.

Thecodestartsbysortingtherowssothattherowscorrespondingtothe715non-smokersappearfirst.Thiseliminatestheneedtouseconditionalexpressionstoidentifytherowscorrespondingtosmokersandnon-smokersineachreplicationofthesamplingprocess.

ComputationalandInferentialThinking

362A/BTesting

Page 363: Computational and inferential thinking

"""Bootstraptestforthedifferenceinmeanbirthweight

CategoryA:non-smokerCategoryB:smoker"""

sorted_by_smoking=baby.select(['MaternalSmoker','BirthWeight'])

#calculatetheobserveddifferenceinmeans

meanA=np.mean(sorted_by_smoking.column('BirthWeight')[:715])

meanB=np.mean(sorted_by_smoking.column('BirthWeight')[715:])

observed_difference=meanA-meanB

repetitions=10000

diffs=[]

foriinrange(repetitions):

#sampleWITHreplacement,samenumberasoriginalsamplesize

resample=sorted_by_smoking.sample(with_replacement=True)

#Computethedifferenceofthemeansoftheresampledvalues,betweenCategoriesAandB

diff_means=np.mean(resample.column('BirthWeight')[:715])-np

diffs.append([diff_means])

#Displayresults

differences=Table().with_column('diff_in_means',diffs)

differences.hist(bins=20)

plots.xlabel('Approxnulldistributionofdifferenceinmeans')

plots.title('BootstrapA/BTest')

plots.xlim(-4,4)

plots.ylim(0,0.4)

print('Observeddifferenceinmeans:',observed_difference)

Observeddifferenceinmeans:9.26614257202

ComputationalandInferentialThinking

363A/BTesting

Page 364: Computational and inferential thinking

Thefigureshowsthatunderthenullhypothesisofequaldistributionsinthepopulation,thebootstrapempiricaldistributionofthedifferencebetweenthesamplemeansofthetwogroupsisroughlybellshaped,centeredat0,stretchingfromabout ouncesto ounces.Theobserveddifferenceintheoriginalsampleisabout9.27ounces,whichisinconsistentwiththisdistribution.Sotheconclusionofthetestisthatinthepopulation,thedistributionsofbirthweightsofthebabiesofnon-smokersandsmokersaredifferent.

BootstrapA/Btesting¶

Letusdefineafunctionbootstrap_AB_testthatperformsanA/Btestusingthebootstrapmethodandthedifferenceinsamplemeansastheteststatistic.Thenullhypothesisisthatthetwounderlyingdistributionsinthepopulationareequal;thealternativeisthattheyarenot.

Theargumentsofthefunctionare:

thenameofthetablethatcontainsthedataintheoriginalsamplethelabelofthecolumncontainingthecode0forCategoryAand1forCategoryBthelabelofthecolumncontainingthevaluesoftheresponsevariable(thatis,thevariablewhosedistributionisofinterest)thenumberofrepetitionsoftheresamplingprocedure

Thefunctionreturnstheobserveddifferenceinmeans,thebootstrapempiricaldistributionofthedifferenceinmeans,andthebootstrapempiricalP-value.Becausethealternativesimplysaysthatthetwounderlyingdistributionsaredifferent,theP-valueiscomputedastheproportionofsampleddifferencesthatareatleastaslargeinabsolutevalueastheabsolutevalueoftheobserveddifference.

ComputationalandInferentialThinking

364A/BTesting

Page 365: Computational and inferential thinking

"""BootstrapA/Btestforthedifferenceinthemeanresponse

AssumesA=0,B=1"""

defbootstrap_AB_test(table,categories,values,repetitions):

#SortthesampletableaccordingtotheA/Bcolumn;

#thenselectonlythecolumnofeffects.

response=table.sort(categories).select(values)

#FindthenumberofentriesinCategoryA.

n_A=table.where(categories,0).num_rows

#Calculatetheobservedvalueoftheteststatistic.

meanA=np.mean(response.column(values)[:n_A])

meanB=np.mean(response.column(values)[n_A:])

observed_difference=meanA-meanB

#Runthebootstrapprocedureandgetalistofresampleddifferencesinmeans

diffs=[]

foriinrange(repetitions):

resample=response.sample(with_replacement=True)

d=np.mean(resample.column(values)[:n_A])-np.mean(resample

diffs.append([d])

#ComputethebootstrapempiricalP-value

diff_array=np.array(diffs)

empirical_p=np.count_nonzero(abs(diff_array)>=abs(observed_difference

#Displayresults

differences=Table().with_column('diff_in_means',diffs)

differences.hist(bins=20,normed=True)

plots.xlabel('Approxnulldistributionofdifferenceinmeans')

plots.title('BootstrapA/BTest')

print("Observeddifferenceinmeans:",observed_difference)

print("BootstrapempiricalP-value:",empirical_p)

Wecannowusethefunctionbootstrap_AB_testtocomparethesmokersandnon-smokerswithrespecttoseveraldifferentresponsevariables.Thetestsshowastatisticallysignificantdifferencebetweenthetwogroupsinbirthweight(asshownearlier),gestationaldays,

ComputationalandInferentialThinking

365A/BTesting

Page 366: Computational and inferential thinking

maternalage,andmaternalpregnancyweight.Itcomesasnosurprisethatthetwogroupsdonotdiffersignificantlyintheirmeanheights.

bootstrap_AB_test(baby,'MaternalSmoker','BirthWeight',10000)

Observeddifferenceinmeans:9.26614257202

BootstrapempiricalP-value:0.0

bootstrap_AB_test(baby,'MaternalSmoker','GestationalDays',10000

Observeddifferenceinmeans:1.97652238829

BootstrapempiricalP-value:0.0367

ComputationalandInferentialThinking

366A/BTesting

Page 367: Computational and inferential thinking

bootstrap_AB_test(baby,'MaternalSmoker','MaternalAge',10000)

Observeddifferenceinmeans:0.80767250179

BootstrapempiricalP-value:0.0192

bootstrap_AB_test(baby,'MaternalSmoker','MaternalPregnancyWeight'

Observeddifferenceinmeans:2.56033030151

BootstrapempiricalP-value:0.0374

ComputationalandInferentialThinking

367A/BTesting

Page 368: Computational and inferential thinking

bootstrap_AB_test(baby,'MaternalSmoker','MaternalHeight',10000

Observeddifferenceinmeans:-0.0905891494127

BootstrapempiricalP-value:0.5411

ComputationalandInferentialThinking

368A/BTesting

Page 369: Computational and inferential thinking

RegressionInferenceInteract

Regression:aReview¶Earlierinthecourse,wedevelopedtheconceptsofcorrelationandregressionaswaystodescriberelationsbetweenquantitativevariables.Wewillnowseehowtheseconceptscanbecomepowerfultoolsforinference,whenusedappropriately.

First,letusbrieflyreviewthemainideasofcorrelationandregression,alongwithcodeforcalculations.

Itisoftenconvenienttomeasurevariablesinstandardunits.Whenyouconvertavaluetostandardunits,youaremeasuringhowmanystandarddeviationsaboveaveragethevalueis.

Thefunctionstandard_unitstakesanarrayasitsargumentandreturnsthearrayconvertedtostandardunits.

defstandard_units(x):

return(x-np.mean(x))/np.std(x)

Thecorrelationbetweentwovariablesmeasuresthedegreetowhichascatterplotofthetwovariablesisclusteredaroundastraightline.Ithasnounitsandisanumberbetween-1and1.Itiscalculatedastheaverageoftheproductofthetwovariables,whenbothvariablesaremeasuredinstandardunits.

Thefunctioncorrelationtakesthreearguments–thenameofatableandthelabelsoftwocolumnsofthetable–andreturnsthecorrelationbetweenthetwocolumns.

defcorrelation(table,x,y):

x_in_standard_units=standard_units(table.column(x))

y_in_standard_units=standard_units(table.column(y))

returnnp.mean(x_in_standard_units*y_in_standard_units)

Hereisthescatterplotofbirthweight(they-variable)versusgestationaldays(thex-variable)ofthebabiesinthetablebaby.Theplotshowsapositiveassociation,andcorrelationreturnsavalueofjustover0.4.

ComputationalandInferentialThinking

369RegressionInference

Page 370: Computational and inferential thinking

correlation(baby,'GestationalDays','BirthWeight')

0.40754279338885108

baby.scatter('GestationalDays','BirthWeight')

Whenscatteriscalledwiththeoptionfit_line=True,theregressionlineisdrawnonthescatterplot.

baby.scatter('GestationalDays','BirthWeight',fit_line=True)

ComputationalandInferentialThinking

370RegressionInference

Page 371: Computational and inferential thinking

Theregressionlineisthebestamongallstraightlinesthatcouldbeusedtoestimatebasedon ,inthesensethatitminimizesthemeansquarederrorofestimation.

Theslopeoftheregressionlineisgivenby:

where isthecorrelation.

Theinterceptoftheregressionlineisgivenby

Thefunctionsslopeandinterceptcalculatethesetwoquantities.Eachtakesthesamethreeargumentsascorrelation:thenameofthetableandthelabelsofcolumns and ,inthatorder.

defslope(table,x,y):

r=correlation(table,x,y)

returnr*np.std(table.column(y))/np.std(table.column(x))

defintercept(table,x,y):

a=slope(table,x,y)

returnnp.mean(table.column(y))-a*np.mean(table.column(x))

ComputationalandInferentialThinking

371RegressionInference

Page 372: Computational and inferential thinking

Wecannowcalculatetheslopeandtheinterceptoftheregressionlinedrawnabove.Theslopeisabout0.47ouncespergestationalday.

slope(baby,'GestationalDays','BirthWeight')

0.46655687694921522

Theinterceptisabout-10.75ounces.

intercept(baby,'GestationalDays','BirthWeight')

-10.754138914450252

Thustheequationoftheregressionlineforestimatingbirthweightbasedongestationaldaysis:

Thefittedvalueatagivenvalueof istheestimateof basedonthatvalueof .Inotherwords,thefittedvalueatagivenvalueof istheheightoftheregressionlineatthat .

Thefunctionfitted_valuecomputesthisheight.Likethefunctionscorrelation,slope,andintercept,itsargumentsincludethenameofthetableandthelabelsofthe andcolumns.Butitalsorequiresafourthargument,whichisthevalueof atwhichtheestimatewillbemade.

deffitted_value(table,x,y,given_x):

a=slope(table,x,y)

b=intercept(table,x,y)

returna*given_x+b

Thefittedvalueat300gestationaldaysisabout129.2ounces.Forapregnancythathaslength300gestationaldays,ourestimateforthebaby'sweightisabout129.2ounces.

fitted_value(baby,'GestationalDays','BirthWeight',300)

ComputationalandInferentialThinking

372RegressionInference

Page 373: Computational and inferential thinking

129.2129241703143

Thefunctionfitreturnsanarrayconsistingofthefittedvaluesatallofthevaluesof inthetable.

deffit(table,x,y):

a=slope(table,x,y)

b=intercept(table,x,y)

returna*table.column(x)+b

Asweknow,theTablemethodscattercanbeusedtodrawascatterplotwitharegressionlinethroughit.Hereisanotherway,usingfit.Thefunctionscatter_fittakesthesamethreeargumentsasfitandreturnsascatterdiagramalongwiththestraightlineoffittedvalues.

defscatter_fit(table,x,y):

plots.scatter(table.column(x),table.column(y),s=15)

plots.plot(table.column(x),fit(table,x,y),lw=2,color='darkblue'

plots.xlabel(x)

plots.ylabel(y)

scatter_fit(baby,'GestationalDays','BirthWeight')

#Thenexttwolinesjustmaketheploteasiertoread

plots.ylim([0,200])

None

ComputationalandInferentialThinking

373RegressionInference

Page 374: Computational and inferential thinking

Assumptionsofrandomness:a"regressionmodel"¶Thusfarinthissection,ouranalysishasbeenpurelydescriptive.Butrecallthatthedataarearandomsampleofallthebirthsinasystemofhospitals.Whatdothedatasayaboutthewholepopulationofbirths?Whatcouldtheysayaboutanewbirthatoneofthehospitals?

Questionsofinferencemayariseifwebelievethatascatterplotreflectstheunderlyingrelationbetweenthetwovariablesbeingplottedbutdoesnotspecifytherelationcompletely.Forexample,ascatterplotofbirthweightversusgestationaldaysshowsusthepreciserelationbetweenthetwovariablesinoursample;butwemightwonderwhetherthatrelationholdstrue,oralmosttrue,forallbabiesinthepopulationfromwhichthesamplewasdrawn,orindeedamongbabiesingeneral.

Asalways,inferentialthinkingbeginswithacarefulexaminationoftheassumptionsaboutthedata.Setsofassumptionsareknownasmodels.Setsofassumptionsaboutrandomnessinroughlylinearscatterplotsarecalledregressionmodels.

Inbrief,suchmodelssaythattheunderlyingrelationbetweenthetwovariablesisperfectlylinear;thisstraightlineisthesignalthatwewouldliketoidentify.However,wearenotabletoseethelineclearly.Whatweseearepointsthatarescatteredaroundtheline.Ineachofthepoints,thesignalhasbeencontaminatedbyrandomnoise.Ourinferentialgoal,therefore,istoseparatethesignalfromthenoise.

Ingreaterdetail,theregressionmodelspecifiesthatthepointsinthescatterplotaregeneratedatrandomasfollows.

Therelationbetween and isperfectlylinear.Wecannotseethis"trueline"but

ComputationalandInferentialThinking

374RegressionInference

Page 375: Computational and inferential thinking

Tychecan.SheistheGoddessofChance.Tychecreatesthescatterplotbytakingpointsonthelineandpushingthemoffthelinevertically,eitheraboveorbelow,asfollows:

Foreach ,Tychefindsthecorrespondingpointonthetrueline,andthenaddsanerror.Theerrorsaredrawnatrandomwithreplacementfromapopulationoferrorsthathasanormaldistributionwithmean0.Tychecreatesapointwhosehorizontalcoordinateis andwhoseverticalcoordinateis"theheightofthetruelineat ,plustheerror".

Finally,Tycheerasesthetruelinefromthescatter,andshowsusjustthescatterplotofherpoints.

Basedonthisscatterplot,howshouldweestimatethetrueline?Thebestlinethatwecanputthroughascatterplotistheregressionline.Sotheregressionlineisanaturalestimateofthetrueline.

Thesimulationbelowshowshowclosetheregressionlineistothetrueline.ThefirstpanelshowshowTychegeneratesthescatterplotfromthetrueline;thesecondshowthescatterplotthatwesee;thethirdshowstheregressionlinethroughtheplot;andthefourthshowsboththeregressionlineandthetrueline.

Torunthesimulation,callthefunctiondraw_and_comparewiththreearguments:theslopeofthetrueline,theinterceptofthetrueline,andthesamplesize.

Runthesimulationafewtimes,withdifferentvaluesfortheslopeandinterceptofthetrueline,andvaryingsamplesizes.Becauseallthepointsaregeneratedaccordingtothemodel,youwillseethattheregressionlineisagoodestimateofthetruelineifthesamplesizeismoderatelylarge.

ComputationalandInferentialThinking

375RegressionInference

Page 376: Computational and inferential thinking

defdraw_and_compare(true_slope,true_int,sample_size):

x=np.random.normal(50,5,sample_size)

xlims=np.array([np.min(x),np.max(x)])

eps=np.random.normal(0,6,sample_size)

y=(true_slope*x+true_int)+eps

tyche=Table([x,y],['x','y'])

plots.figure(figsize=(6,16))

plots.subplot(4,1,1)

plots.scatter(tyche['x'],tyche['y'],s=15)

plots.plot(xlims,true_slope*xlims+true_int,lw=2,color='green'

plots.title('WhatTychedraws')

plots.subplot(4,1,2)

plots.scatter(tyche['x'],tyche['y'],s=15)

plots.title('Whatwegettosee')

plots.subplot(4,1,3)

scatter_fit(tyche,'x','y')

plots.xlabel("")

plots.ylabel("")

plots.title('Regressionline:ourestimateoftrueline')

plots.subplot(4,1,4)

scatter_fit(tyche,'x','y')

xlims=np.array([np.min(tyche['x']),np.max(tyche['x'])])

plots.plot(xlims,true_slope*xlims+true_int,lw=2,color='green'

plots.title("Regressionlineandtrueline")

#Tyche'strueline,

#thepointsshecreates,

#andourestimateofthetrueline.

#Arguments:trueslope,trueintercept,numberofpoints

draw_and_compare(3,-5,25)

ComputationalandInferentialThinking

376RegressionInference

Page 377: Computational and inferential thinking

ComputationalandInferentialThinking

377RegressionInference

Page 378: Computational and inferential thinking

PredictionusingRegression¶Inreality,ofcourse,wearenotTyche,andwewillneverseethetrueline.Whatthesimulationshowsthatiftheregressionmodellooksplausible,andifwehavealargesample,thentheregressionlineisagoodapproximationtothetrueline.

Thescatterdiagramofbirthweightsversusgestationaldayslooksroughlylinear.Letusassumethattheregressionmodelholds.Supposenowthatatthehospitalthereisanewbabywhohas300gestationaldays.Wecanusetheregressionlinetopredictthebirthweightofthisbaby.

Aswesawearlier,thefittedvalueat300gestationaldayswasabout129.2ounces.Thatisourpredictionforthebirthweightofthenewbaby.

fit_300=fitted_value(baby,'GestationalDays','BirthWeight',300

fit_300

129.2129241703143

Thefigurebelowshowswherethepredictionliesontheregressionline.Theredlineisat.

scatter_fit(baby,'GestationalDays','BirthWeight')

plots.scatter(300,fit_300,color='red',s=20)

plots.plot([300,300],[0,fit_300],color='red',lw=1)

plots.ylim([0,200])

None

ComputationalandInferentialThinking

378RegressionInference

Page 379: Computational and inferential thinking

TheVariabilityofthePrediction¶Wehavedevelopedamethodforpredictinganewbaby'sbirthweightbasedonthenumberofgestationaldays.Butasdatascientists,weknowthatthesamplemighthavebeendifferent.Hadthesamplebeendifferent,theregressionlinewouldhavebeendifferenttoo,andsowouldourprediction.Toseehowgoodourpredictionis,wemustgetasenseofhowvariablethepredictioncanbe.

Onewaytodothiswouldbebygeneratingnewrandomsamplesofpointsandmakingapredictionbasedoneachnewsample.Togeneratenewsamples,wecanbootstrapthescatterplot.

Specifically,wecansimulatenewsamplesbyrandomsamplingwithreplacementfromtheoriginalscatterplot,asmanytimesastherearepointsinthescatter.

Hereistheoriginalscatterdiagramfromthesample,andfourreplicationsofthebootstrapresamplingprocedure.Noticehowtheresampledscatterplotsareingeneralalittlemoresparsethantheoriginal.Thatisbecausesomeoftheoriginalpointdonotgetselectedinthesamples.

ComputationalandInferentialThinking

379RegressionInference

Page 380: Computational and inferential thinking

plots.figure(figsize=(8,18))

plots.subplot(5,1,1)

plots.scatter(baby[1],baby[0],s=10)

plots.xlim([150,400])

plots.title('Originalsample')

foriinnp.arange(1,5,1):

plots.subplot(5,1,i+1)

rep=baby.sample(with_replacement=True)

plots.scatter(rep[1],rep[0],s=10)

plots.xlim([150,400])

plots.title('Bootstrapsample'+str(i))

ComputationalandInferentialThinking

380RegressionInference

Page 381: Computational and inferential thinking

ComputationalandInferentialThinking

381RegressionInference

Page 382: Computational and inferential thinking

Thenextstepistofittheregressionlinetothescatterplotineachreplication,andmakeapredictionbasedoneachline.Thefigurebelowshows10suchlines,andthecorrespondingpredictedbirthweightat300gestationaldays.

x=300

lines=Table(['slope','intercept'])

foriinrange(10):

rep=baby.sample(with_replacement=True)

a=slope(rep,'GestationalDays','BirthWeight')

b=intercept(rep,'GestationalDays','BirthWeight')

lines.append([a,b])

lines['predictionatx='+str(x)]=lines.column('slope')*x+lines.

xlims=np.array([291,309])

left=xlims[0]*lines[0]+lines[1]

right=xlims[1]*lines[0]+lines[1]

fit_x=x*lines['slope']+lines['intercept']

foriinrange(10):

plots.plot(xlims,np.array([left[i],right[i]]),lw=1)

plots.scatter(x,fit_x[i],s=20)

Thepredictionsvaryfromonelinetothenext.Thetablebelowshowstheslopeandinterceptofeachofthe10lines,alongwiththeprediction.

ComputationalandInferentialThinking

382RegressionInference

Page 383: Computational and inferential thinking

lines

slope intercept predictionatx=300

0.446731 -5.08136 128.938

0.417925 3.65604 129.034

0.395636 8.95328 127.644

0.512511 -22.9981 130.755

0.477034 -13.4303 129.68

0.515773 -24.5881 130.144

0.48213 -15.7025 128.936

0.492106 -18.7225 128.909

0.434989 -1.72409 128.773

0.486076 -16.5506 129.272

BootstrapPredictionInterval¶Ifweincreasethenumberofrepetitionsoftheresamplingprocess,wecangenerateanempiricalhistogramofthepredictions.Thiswillallowustocreateapredictioninterval,usingmethodslikethoseweusedearliertocreatebootstrapconfidenceintervalsfornumericalparameters.

Letusdefineafunctioncalledbootstrap_predictiontodothis.Thefunctiontakesfivearguments:

thenameofthetablethecolumnlabelsofthepredictorandresponsevariables,inthatorderthevalueof atwhichtomakethepredictionthedesirednumberofbootstraprepetitions

Ineachrepetition,thefunctionbootstrapstheoriginalscatterplotandcalculatestheequationoftheregressionlinethroughthebootstrappedplot.Itthenusesthislinetofindthepredictedvalueof basedonthespecifiedvalueof .Specifically,itcallsthefunctionfitted_valuethatwedefinedearlierinthissection,tocalculatetheslopeandinterceptoftheregressionlineandthencalcuatethefittedvalueatthespecified .

ComputationalandInferentialThinking

383RegressionInference

Page 384: Computational and inferential thinking

Finally,itcollectsallofthepredictedvaluesintoacolumnofatable,drawstheempiricalhistogramofthesevalues,andprintstheintervalconsistingofthe"middle95%"ofthepredictedvalues.Italsoprintsthepredictedvaluebasedontheregressionlinethroughtheoriginalscatterplot.

#Bootstrappredictionofvariableyatnew_x

#Datacontainedintable;predictionbyregressionofybasedonx

#repetitions=numberofbootstrapreplicationsoftheoriginalscatterplot

defbootstrap_prediction(table,x,y,new_x,repetitions):

#Foreachrepetition:

#Bootstrapthescatter;

#gettheregressionpredictionatnew_x;

#augmentthepredictionslist

predictions=[]

foriinrange(repetitions):

bootstrap_sample=table.sample(with_replacement=True)

predicted_value=fitted_value(bootstrap_sample,x,y,new_x

predictions.append(predicted_value)

#Predictionbasedonoriginalsample

original=fitted_value(table,x,y,new_x)

#Displayresults

pred=Table().with_column('Prediction',predictions)

pred.hist(bins=20)

plots.xlabel('predictionsatx='+str(new_x))

print('Heightofregressionlineatx='+str(new_x)+':',original

print('Approximate95%-confidenceinterval:')

print((pred.percentile(2.5).rows[0][0],pred.percentile(97.5).rows

bootstrap_prediction(baby,'GestationalDays','BirthWeight',300,

Heightofregressionlineatx=300:129.21292417

Approximate95%-confidenceinterval:

(127.25683835787581,131.33042912205815)

ComputationalandInferentialThinking

384RegressionInference

Page 385: Computational and inferential thinking

Thefigureaboveshowsabootstrapempiricalhistogramofthepredictedbirthweightofababyat300gestationaldays,basedon5,000repetitionsofthebootstrapprocess.Theempiricaldistributionisroughlynormal.

Anapproximate95%predictionintervalofscoreshasbeenconstructedbytakingthe"middle95%"ofthepredictions,thatis,theintervalfromthe2.5thpercentiletothe97.5thpercentileofthepredictions.Theintervalrangesfromabout127toabout131.Thepredictionbasedontheoriginalsamplewasabout129,whichisclosetothecenteroftheinterval.

Thefigurebelowshowsthehistogramof5,000bootstrappredictionsat285gestationaldays.Thepredictionbasedontheoriginalsampleisabout122ounces,andtheintervalrangesfromabout121ouncestoabout123ounces.

bootstrap_prediction(baby,'GestationalDays','BirthWeight',285,

Heightofregressionlineatx=285:122.214571016

Approximate95%-confidenceinterval:

(121.17047827189757,123.29506281889765)

ComputationalandInferentialThinking

385RegressionInference

Page 386: Computational and inferential thinking

Noticethatthisintervalisnarrowerthanthepredictionintervalat300gestationaldays.Letusinvestigatethereasonforthis.

Themeannumberofgestationaldaysisabout279days:

np.mean(baby['GestationalDays'])

279.10136286201021

So285isnearertothecenterofthedistributionthan300is.Typically,theregressionlinesbasedonthebootstrapsamplesareclosertoeachothernearthecenterofthedistributionofthepredictorvariable.Thereforeallofthepredictedvaluesareclosertogetheraswell.Thisexplainsthenarrowerwidthofthepredictioninterval.

Youcanseethisinthefigurebelow,whichshowspredictionsat and foreachoftenbootstrapreplications.Typically,thelinesarefartherapartat thanat

,andthereforethepredictionsat aremorevariable.

ComputationalandInferentialThinking

386RegressionInference

Page 387: Computational and inferential thinking

x1=300

x2=285

lines=Table(['slope','intercept'])

foriinrange(10):

rep=baby.sample(with_replacement=True)

a=slope(rep,'GestationalDays','BirthWeight')

b=intercept(rep,'GestationalDays','BirthWeight')

lines.append([a,b])

xlims=np.array([260,310])

left=xlims[0]*lines[0]+lines[1]

right=xlims[1]*lines[0]+lines[1]

fit_x1=x1*lines['slope']+lines['intercept']

fit_x2=x2*lines['slope']+lines['intercept']

plots.xlim(xlims)

foriinrange(10):

plots.plot(xlims,np.array([left[i],right[i]]),lw=1)

plots.scatter(x1,fit_x1[i],s=20)

plots.scatter(x2,fit_x2[i],s=20)

Istheobservedslopereal?¶

ComputationalandInferentialThinking

387RegressionInference

Page 388: Computational and inferential thinking

Wehavedevelopedamethodofpredictingthebirthweightofanewbabyinthehospitalsystem,basedonthebaby'snumberofgestationaldays.Ourpredictionmakesuseofthepositiveassociationthatweobservedbetweenbirthweightandgestationaldays.

Butwhatifthatobservationisspurious?Inotherwords,whatifTyche'struelinewasflat,andthepositiveassociationthatweobservedwasjustduetorandomnessinthepointsthatshegenerated?

Hereisasimulationthatillustrateswhythisquestionarises.Wewillonceagaincallthefunctiondraw_and_compare,thistimerequiringTyche'struelinetohaveslope0.Ourgoalistoseewhetherourregressionlineshowsaslopethatisnot0.

Rememberthattheargumentstothefunctiondraw_and_comparearetheslopeandtheinterceptofTyche'strueline,andthenumberofpointsthatshegenerates.

draw_and_compare(0,10,25)

ComputationalandInferentialThinking

388RegressionInference

Page 389: Computational and inferential thinking

ComputationalandInferentialThinking

389RegressionInference

Page 390: Computational and inferential thinking

Runthesimulationafewtimes,keepingtheslopeofTyche'sline0eachtime.Youwillnoticethatwhiletheslopeofthetruelineis0,theslopeoftheregressiontypicallyisnot0.Theregressionlinesometimesslopesupwards,andsometimesdownwards,eachtimegivingusafalseimpressionthatthetwovariablesarecorrelated.

Testingwhethertheslopeofthetruelineis0¶Thesesimulationsshowthatbeforeweputtoomuchfaithinourregressionpredictions,itisworthcheckingwhetherthetruelinedoesindeedhaveaslopethatisnot0.

Weareinagoodpositiontodothis,becauseweknowhowtobootstrapthescatterdiagramanddrawaregressionlinethrougheachbootstrappedplot.Eachofthoselineshasaslope,sowecansimplycollectalltheslopesanddrawtheirempiricalhistogram.Wecanthenconstructanapproximate95%confidenceintervalfortheslopeofthetrueline,usingthebootstrappercentilemethodthatisnowsofamiliar.

Iftheconfidenceintervalforthetrueslopedoesnotcontain0,thenwecanbeabout95%confidentthatthetrueslopeisnot0.Iftheintervaldoescontain0,thenwecannotconcludethatthereisagenuninelinearassociationbetweenthevariablethatwearetryingtopredictandthevariablethatwehavechosentouseasapredictor.Wemightthereforewanttolookforadifferentpredictorvariableonwhichtobaseourpredictions.

Thefunctionbootstrap_slopeexecutesourplan.Itsargumentsarethenameofthetableandthelabelsofthepredictorandresponsevariables,andthedesirednumberofbootstrapreplications.Ineachreplication,thefunctionbootstrapstheoriginalscatterplotandcalculatestheslopeoftheresultingregressionline.Itthendrawsthehistogramofallthegeneratedslopesandprintstheintervalconsistingofthe"middle95%"oftheslopes.

Noticethatthecodeforbootstrap_slopeisalmostidenticaltothatforbootstrap_prediction,exceptthatnowallweneedfromeachreplicationistheslope,notapredictedvalue.

ComputationalandInferentialThinking

390RegressionInference

Page 391: Computational and inferential thinking

defbootstrap_slope(table,x,y,repetitions):

#Foreachrepetition:

#Bootstrapthescatter,gettheslopeoftheregressionline,

#augmentthelistofgeneratedslopes

slopes=[]

foriinrange(repetitions):

bootstrap_scatter=table.sample(with_replacement=True)

slopes.append(slope(bootstrap_scatter,x,y))

#Slopeoftheregressionlinefromtheoriginalsample

observed_slope=slope(table,x,y)

#Displayresults

slopes=Table().with_column('slopes',slopes)

slopes.hist(bins=20)

print('Slopeofregressionline:',observed_slope)

print('Approximate95%-confidenceintervalforthetrueslope:'

print((slopes.percentile(2.5).rows[0][0],slopes.percentile(97.5

Letuscallbootstrap_slopetoseeifthatthetruelinearrelationbetweenbirthweightandgestationaldayshasslope0;inotherwords,ifTyche'struelineisflat.

bootstrap_slope(baby,'GestationalDays','BirthWeight',5000)

Slopeofregressionline:0.466556876949

Approximate95%-confidenceintervalforthetrueslope:

(0.37981209488027268,0.55753912103512426)

ComputationalandInferentialThinking

391RegressionInference

Page 392: Computational and inferential thinking

Theapproximate95%confidenceintervalforthetruesloperunsfromabout0.38ouncesperdaytoabout0.56ouncesperday.Thisintervaldoesnotcontain0.Thereforewecanconclude,withabout95%confidence,thattheslopeofthetruelineisnot0.

Thisconclusionsupportsourchoiceofusinggestationaldaystopredictbirthweight.

Supposenowwetrytoestimatethebirthweightofthebabybasedonthemother'sage.Basedonthesample,theslopeoftheregressionlineforestimatingbirthweightbasedonmaternalageispositive,about0.08ouncesperyear:

slope(baby,'MaternalAge','BirthWeight')

0.085007669415825132

Butanapproximate95%bootstrapconfidenceintervalforthetrueslopehasanegativeleftendpointandapositiverightendpoint–inotherwords,theintervalcontains0.Thereforewecannotrejectthehypothesisthattheslopeofthetruelinearrelationbetweenmaternalageandbaby'sbirthweightis0.Basedonthisobservation,itwouldbeunwisetopredictbirthweightbasedontheregressionmodelwithmaternalageasthepredictor.

bootstrap_slope(baby,'MaternalAge','BirthWeight',5000)

Slopeofregressionline:0.0850076694158

Approximate95%-confidenceintervalforthetrueslope:

(-0.10015674488514391,0.26974979944992733)

ComputationalandInferentialThinking

392RegressionInference

Page 393: Computational and inferential thinking

Usingconfidenceintervalstotesthypotheses¶Noticehowwehaveusedconfidenceintervalsforthetrueslopetotestthehypothesisthatthetrueslopeis0.Constructingaconfidenceintervalforaparameterisonewaytotestthenullhypothesisthattheparameterhasaspecifiedvalue.Ifthatvalueisnotintheconfidenceinterval,wecanrejectthenullhypothesisthattheparameterhasthatvalue.Ifthevalueisintheconfidenceinterval,wedonothaveenoughevidencetorejectthenullhypothesis.

Ifthesamplesizeislargeandiftheempiricaldistributionoftheestimateoftheparameterisroughlynormal,thenthismethodoftestingviaconfidenceintervalsisjustifiable.Usinga95%confidenceintervalcorrespondstousinga5%cutofffortheP-value.Thejustificationsofthesestatementsarerathertechnical,andwewillleavethemtoamoreadvancedcourse.Fornow,itisenoughtonotethatconfidenceintervalscanbeusedtotesthypothesesinthisway.

Indeed,themethodismoreconstructivethanothermethodsoftestinghypotheses,becausenotonlydoesitresultinadecisionaboutwhetherornottorejectthenullhypothesis,italsoprovidesanestimateoftheparameter.Forexample,intheexampleofusinggestationaldaystoestimatebirthweights,wedidn'tjustconcludethatthetrueslopewasnot0–wealsoestimatedthatthetrueslopeisbetween0.38ouncesperdayto0.56ouncesperday.

Wordsofcaution¶Allofthepredictionsandteststhatwehaveperformedinthissectionassumethattheregressionmodelholds.Specifically,themethodsassumethatthescatterplotresemblespointsgeneratedbystartingwithpointsthatareonastraightlineandthenpushingthemoffthelinebyaddingrandomnoise.

ComputationalandInferentialThinking

393RegressionInference

Page 394: Computational and inferential thinking

Ifthescatterplotdoesnotlooklikethat,thenperhapsthemodeldoesnotholdforthedata.Ifthemodeldoesnothold,thencalculationsthatassumethemodeltobetruearenotvalid.

Therefore,wemustfirstdecidewhethertheregressionmodelholdsforourdata,beforewestartmakingpredictionsbasedonthemodelortestinghypothesesaboutparametersofthemodel.Asimplewayistodowhatwedidinthissection,whichistodrawthescatterdiagramofthetwovariablesandseewhetheritlooksroughlylinearandevenlyspreadoutaroundaline.Inthenextsection,wewillstudysomemoretoolsforassessingthemodel.

ComputationalandInferentialThinking

394RegressionInference

Page 395: Computational and inferential thinking

RegressionDiagnosticsInteractThemethodsthatwehavedevelopedforinferenceinregressionarevalidiftheregressionmodelholdsforourdata.Ifthedatadonotsatisfytheassumptionsofthemodel,thenitcanbehardtojustifycalculationsthatassumethatthemodelisgood.

Inthissectionwewilldevelopsomesimpledescriptivemethodsthatcanhelpusdecidewhetherourdatasatisfytheregressionmodel.

Wewillstartwithareviewofthedetailsofthemodel,whichspecifiesthatthepointsinthescatterplotaregeneratedatrandomasfollows.

Tychecreatesthescatterplotbystartingwithpointsthatlieperfectlyonastraightlineandthenpushingthemoffthelinevertically,eitheraboveorbelow,asfollows:

Foreach ,Tychefindsthecorrespondingpointonthetrueline,andthenaddsanerror.Theerrorsaredrawnatrandomwithreplacementfromapopulationoferrorsthathasanormaldistributionwithmean0andanunknownstandarddeviation.Tychecreatesapointwhosehorizontalcoordinateis andwhoseverticalcoordinateis"theheightofthetruelineat ,plustheerror".Iftheerrorispositive,thepointisabovetheline.Iftheerrorisnegative,thepointisbelowtheline.

WegettoseetheresultingpointsbutnotthetruelineonwhichthepointsstartedoutnortherandomerrorsthatTycheadded.

FirstDiagnostic:TheScatterPlot¶

Itisalwayshelpfultostartwithavisualization.Continuingtheexampleofestimatingbirthweightbasedongestationaldays,letuslookatascatterplotofthetwovariables.

baby.scatter('GestationalDays','BirthWeight')

ComputationalandInferentialThinking

395RegressionDiagnostics

Page 396: Computational and inferential thinking

Thepointsdoappeartoberoughlyclusteredaroundastraightline.Thereisnostrikinglynoticeablecurveinthescatter.

Aplotoftheregressionlinethroughthescatterplotshowsthatabouthalfthepointsareabovethelineandhalfarebelowthelinealmostsymmetrically,supportingthemodel'sassumptionthatTycheisaddingrandomerrorsthatarenormallydistributedwithmean0.

scatter_fit(baby,'GestationalDays','BirthWeight')

SecondDiagnostic:AResidualPlot¶

ComputationalandInferentialThinking

396RegressionDiagnostics

Page 397: Computational and inferential thinking

Wecannotseethetruelinethatisthebasisoftheregressionmodel,buttheregressionlinethatwecalculateactsasasubstituteforit.SoalsotheverticaldistancesofthepointsfromtheregressionlineactassubstitutesfortherandomerrorsthatTycheusestopushthepointsverticallyoffthetrueline.

Aswehaveseenearlier,theverticaldistancesofthepointsfromtheregressionlinearecalledresiduals.Thereisoneresidualforeachpointinthescatterplot.Theresidualisthedifferencebetweentheobservedvalueof andthefittedvalueof :

Forthepoint ,

Thetablebaby_regressioncontainsthedata,thefittedvaluesandtheresiduals:

baby2=baby.select(['GestationalDays','BirthWeight'])

fitted=fit(baby,'GestationalDays','BirthWeight')

residuals=baby.column('BirthWeight')-fitted

baby_regression=baby2.with_columns([

'FittedValue',fitted,

'Residual',residuals

])

baby_regression

GestationalDays BirthWeight FittedValue Residual

284 120 121.748 -1.74801

282 113 120.815 -7.8149

279 128 119.415 8.58477

282 108 120.815 -12.8149

286 136 122.681 13.3189

244 138 103.086 34.9143

245 132 103.552 28.4477

289 120 124.081 -4.0808

299 143 128.746 14.2536

351 140 153.007 -13.0073

...(1164rowsomitted)

ComputationalandInferentialThinking

397RegressionDiagnostics

Page 398: Computational and inferential thinking

Onceagain,ithelpstovisualize.Thefigurebelowshowsfouroftheresiduals;foreachofthefourpoints,thelengthoftheredsegmentistheresidualforthatpoint.Thereisonesuchresidualforeachoftheother1170pointsaswell.

points=Table().with_columns([

'GestationalDays',[148,244,338,351],

'BirthWeight',[116,138,102,140]

])

scatter_fit(baby,'GestationalDays','BirthWeight')

foriinrange(4):

f=fitted_value(baby,'GestationalDays','BirthWeight',points

plots.plot([points.column(0)[i]]*2,[f,points.column(1)[i]],color

Accordingtotheregressionmodel,Tychegenerateserrorsbydrawingatrandomwithreplacementfromapopulationoferrorsthatisnormallydistributedwithmean0.Toseewhetherthislooksplausibleforourdata,wewillexaminetheresiduals,whichareoursubstitutesforTyche'serrors.

Aresidualplotcanbedrawnbyplottingtheresidualsagainst .Thefunctionresidual_plotdoesjustthat.Thefunctionregression_diagnostic_plotsdrawstheorigbinalscatterplotaswellastheresidualplotforeaseofcomparison.

ComputationalandInferentialThinking

398RegressionDiagnostics

Page 399: Computational and inferential thinking

defresidual_plot(table,x,y):

fitted=fit(table,x,y)

residuals=table[y]-fitted

t=Table().with_columns([

x,table[x],

'residuals',residuals

])

t.scatter(x,'residuals',color='r')

xlims=[min(table[x]),max(table[x])]

plots.plot(xlims,[0,0],color='darkblue',lw=2)

plots.title('ResidualPlot')

defregression_diagnostic_plots(table,x,y):

scatter_fit(table,x,y)

residual_plot(table,x,y)

regression_diagnostic_plots(baby,'GestationalDays','BirthWeight'

ComputationalandInferentialThinking

399RegressionDiagnostics

Page 400: Computational and inferential thinking

Theheightofeachredpointintheplotaboveisaresidual.Noticehowtheresidualsaredistributedfairlysymmetricallyaboveandbelowthehorizontallineat0.Noticealsothattheverticalspreadoftheplotisfairlyevenacrossthemostcommonvaluesof ,inaboutthe200-300rangeofgestationaldays.Inotherwords,apartfromafewoutlyingpoints,theplotisn'tnarrowerinsomeplacesandwiderinothers.

Allofthisisconsistentwiththemodel'sassumptionsthattheerrorsaredrawnatrandomwithreplacementfromanormallydistributedpopulationwithmean0.Indeed,ahistogramoftheresidualsiscenteredat0andlooksroughlybellshaped.

baby_regression.select('Residual').hist()

ComputationalandInferentialThinking

400RegressionDiagnostics

Page 401: Computational and inferential thinking

Theseobservationsshowusthatthedataarefairlyconsistentwiththeassumptionsoftheregressionmodel.Thereforeitmakessensetodoinferencebasedonthemodel,aswedidintheprevioussection.

UsingtheResidualPlottoDetectNonlinearity¶

Residualplotscanbeusedtodetectwhethertheregressionmodeldoesnothold.Themostseriousdeparturefromthemodelisnon-linearity.

Iftwovariableshaveanon-linearrelation,thenon-linearityissometimesvisibleinthescatterplot.Often,however,itiseasiertospotnon-linearityinaresidualplotthanintheoriginalscatterplot.Thisisusuallybecauseofthescalesofthetwoplots:theresidualplotallowsustozoominontheerrorsandhencemakesiteasiertospotpatterns.

Asanexample,wewillstudyadatasetontheageandlengthofdugongs,whicharemarinemammalsrelatedtomanateesandseacows.Thedataareinatablecalleddugong.Ageismeasuredinyearsandlengthinmeters.Theagesareestimated;mostdugongsdon'tkeeptrackoftheirbirthdays.

dugong=Table.read_table('http://www.statsci.org/data/oz/dugongs.txt'

dugong

ComputationalandInferentialThinking

401RegressionDiagnostics

Page 402: Computational and inferential thinking

Age Length

1 1.8

1.5 1.85

1.5 1.87

1.5 1.77

2.5 2.02

4 2.27

5 2.15

5 2.26

7 2.35

8 2.47

...(17rowsomitted)

Hereisaregressionoflength(theresponse)onage(thepredictor).Thecorrelationbetweenthetwovariablesisabout0.83,whichisquitehigh.

correlation(dugong,'Age','Length')

0.82964745549057139

scatter_fit(dugong,'Age','Length')

ComputationalandInferentialThinking

402RegressionDiagnostics

Page 403: Computational and inferential thinking

Highcorrelationnotwithstanding,theplotshowsacurvedpatternthatismuchmorevisibleintheresidualplot.

regression_diagnostic_plots(dugong,'Age','Length')

ComputationalandInferentialThinking

403RegressionDiagnostics

Page 404: Computational and inferential thinking

Atthelowendofages,theresidualsarealmostallnegative;thentheyarealmostallpositive;thennegativeagainatthehighendofages.SuchadiscerniblepatternwouldhavebeenunlikelyhadTychebeenpickingerrorsatrandomwithreplacementandusingthemtopushpointsoffastraightline.Thereforetheregressionmodeldoesnotholdforthesedata.

Asanotherexample,thedatasetus_womencontainstheaverageweightsofU.S.womenofdifferentheightsinthe1970s.Weightsaremeasuredinpoundsandheightsininches.Thecorrelationisspectacularlyhigh–morethan0.99.

us_women=Table.read_table('us_women.csv')

correlation(us_women,'height','aveweight')

0.99549476778421608

Thepointsseemtohugtheregressionlineprettyclosely,buttheresidualplotflashesawarningsign:itisU-shaped,indicatingthattherelationbetweenthetwovariablesisnotlinear.

regression_diagnostic_plots(us_women,'height','aveweight')

ComputationalandInferentialThinking

404RegressionDiagnostics

Page 405: Computational and inferential thinking

Theseexamplesdemonstratethatevenwhencorrelationishigh,fittingastraightlinemightnotbetherightthingtodo.Residualplotshelpusdecidewhetherornottouselinearregression.

UsingtheResidualPlottoDetectHeteroscedasticity¶

HeteroscedasticityisawordthatwillsurelybeofinteresttothosewhoarepreparingforSpellingBees.Fordatascientists,itsinterestliesinitsmeaning,whichis"unevenspread".

RecallthetablehybridthatcontainsdataonhybridcarsintheU.S.Hereisaregressionoffuelefficiencyontherateofacceleration.Theassociationisnegative:carsthataccelearatequicklytendtobelessefficient.

ComputationalandInferentialThinking

405RegressionDiagnostics

Page 406: Computational and inferential thinking

regression_diagnostic_plots(hybrid,'acceleration','mpg')

Noticehowtheresidualplotflaresouttowardsthelowendoftheaccelerations.Inotherwords,thevariabilityinthesizeoftheerrorsisgreaterforlowvaluesofaccelerationthanforhighvalues.Suchapatternwouldbeunlikelyundertheregressionmodel,becausethemodelassumesthatTychepickstheerrorsatrandomwithreplacementfromthesamenormallydistributedpopulation.

Unevenverticalvariationinaresidualplotcanindicatethattheregressionmodeldoesnothold.Unevenvariationisoftenmoreeasilynoticedinaresidualplotthanintheoriginalscatterplot.

ComputationalandInferentialThinking

406RegressionDiagnostics

Page 407: Computational and inferential thinking

TheAccuracyoftheRegressionEstimate¶

Wewillendourdiscussionofregressionbypointingoutsomeinterestingmathematicalfactsthatleadtoameasureoftheaccuracyofregression.Wewillleavetheproofstoanothercourse,butwewilldemonstratethefactsbyexamples.Notethatallofthefactsholdforallscatterplots,whetherornottheregressionmodelisgood.

Fact1.Nomatterwhattheshapeofthescatterdiagram,theresidualshaveanaverageof0.

Inalltheresidualplotsabove,youhaveseenthehorizontallineat0goingthroughthecenteroftheplot.Thatisavisualizationofthisfact.

Asanumericalexample,hereistheaverageoftheresidualsintheregressionofbirthweightongestationaldaysofbabies:

np.mean(baby_regression.column('Residual'))

-6.3912532279613784e-15

Thatdoesn'tlooklike0,butinfactitisaverysmallnumberthatis0apartfromroundingerror.

Hereistheaverageoftheresidualsintheregressionofthelengthofdugongsontheirage:

dugong_residuals=dugong.column('Length')-fit(dugong,'Age','Length'

np.mean(dugong_residuals)

-2.8783559897689243e-16

Onceagainthemeanoftheresidualsis0apartfromroundingerror.

Thisisanalogoustothefactthatifyoutakeanylistofnumbersandcalculatethelistofdeviationsfromaverage,theaverageofthedeviationsis0.

Fact2.Nomatterwhattheshapeofthescatterdiagram,theSDofthefittedvaluesistimestheSDoftheobservedvaluesof .

Tounderstandthisresult,itisworthnotingthatthefittedvaluesareallontheregressionlinewhereastheobservedvaluesof aretheheightsofallthepointsinthescatterplotandaremorevariable.

ComputationalandInferentialThinking

407RegressionDiagnostics

Page 408: Computational and inferential thinking

scatter_fit(baby,'GestationalDays','BirthWeight')

Thefittedvaluesofbirthweightare,therefore,lessvariablethantheobservedvaluesofbirthweight.Buthowmuchlessvariable?

Thescatterplotshowsacorrelationofabout0.4:

correlation(baby,'GestationalDays','BirthWeight')

0.40754279338885108

HereisratiooftheSDofthefittedvaluesandtheSDoftheobservedvaluesofbirthweight:

fitted_birth_weight=baby_regression.column('FittedValue')

observed_birth_weight=baby_regression.column('BirthWeight')

np.std(fitted_birth_weight)/np.std(observed_birth_weight)

0.40754279338885108

Theratiois .ThustheSDofthefittedvaluesis timestheSDoftheobservedvaluesof .

Theexampleoffuelefficiencyandaccelerationisonewherethecorrelationisnegative:

ComputationalandInferentialThinking

408RegressionDiagnostics

Page 409: Computational and inferential thinking

correlation(hybrid,'acceleration','mpg')

-0.5060703843771186

fitted_mpg=fit(hybrid,'acceleration','mpg')

observed_mpg=hybrid.column('mpg')

np.std(fitted_mpg)/np.std(observed_mpg)

0.5060703843771186

TheratioofthetwoSDsis .ThustheSDofthefittedvaluesis timestheSDoftheobservedvaluesoffuelefficiency.

Therelationsaysthatthecloserthescatterdiagramistoastraightline,thecloserthefittedvalueswillbetotheobservedvaluesof ,andthereforethecloserthevarianceofthefittedvalueswillbetothevarianceoftheobservedvalues.

Therefore,thecloserthescatterdiagramistoastraightline,thecloser willbeto1,andthecloser willbeto1or-1.

Fact3.Nomatterwhattheshapeofthescatterdiagram,theSDoftheresidualsistimestheSDoftheobservedvaluesof .Inotherwords,therootmeansquared

errorofregressionis timestheSDof .

Thisfactgivesameasureoftheaccuracyoftheregressionestimate.

Again,wewilldemonstratethisbyexample.Inthecaseofgestationaldaysandbirth

weights, isabout0.4and isabout0.91:

r=correlation(baby,'GestationalDays','BirthWeight')

np.sqrt(1-r**2)

0.91318611003278638

ComputationalandInferentialThinking

409RegressionDiagnostics

Page 410: Computational and inferential thinking

residuals=baby_regression.column('Residual')

observed_birth_weight=baby_regression.column('BirthWeight')

np.std(residuals)/np.std(observed_birth_weight)

0.9131861100327866

ThustheSDoftheresidualsis timestheSDoftheobservedbirthweights.

Itwillcomeasnosurprisethatthesamerelationistruefortheregressionoffuelefficiencyonacceleration:

r=correlation(hybrid,'acceleration','mpg')

np.sqrt(1-r**2)

0.862492183185677

residuals=hybrid.column('mpg')-fitted_mpg

observed_mpg=hybrid.column('mpg')

np.std(residuals)/np.std(observed_mpg)

0.862492183185677

LetusexaminetheextremecasesofthegeneralfactthattheSDoftheresidualsis

timestheSDoftheobservedvaluesof .Remembertheresidualsalwaysaverageoutto0.

If or ,then .Thereforetheresidualshaveanaverageof0andanSDof0aswell;therefore,theresidualsareallequalto0.Thisisconsistentwiththeobservationthatif ,thescatterplotisaperfectstraightlineandthereisnoerrorintheregressionestimate.

If ,the andtheSDoftheregressionestimateisequaltotheSDof .Thisisconsistentwiththeobservationthatif thentheregressionlineisaflatlineataverageof ,andtherootmeansquareerrorofregressionistherootmeansquareddeviationfromtheaverageof ,whichistheSDof .

ComputationalandInferentialThinking

410RegressionDiagnostics

Page 411: Computational and inferential thinking

Foreveryothervalueof ,theroughsizeoftheerroroftheregressionestimateissomewherebetween0andtheSDof .

ComputationalandInferentialThinking

411RegressionDiagnostics

Page 412: Computational and inferential thinking

Chapter5

Probability

ComputationalandInferentialThinking

412Probability

Page 413: Computational and inferential thinking

ConditionalProbabilityInteractInspiredbyPeterNorvig'sAConcreteIntroductiontoProbability

Probabilityisthestudyoftheobservationsthataregeneratedbysamplingfromaknowndistribution.Thestatisticalmethodswehavestudiedsofarallowustoreasonabouttheworldfromdata.Probabilityallowsustoreasonintheoppositedirections:whatobservationsresultfromknowfactsabouttheworld.

Inpractice,werarelyknowpreciselyhowrandomprocessesintheworldwork.Eventhesimplestrandomprocesssuchasflippingacoinisnotassimpleasonemightassume.Nonetheless,ourabilitytoreasonaboutwhatdatawillresultfromaknownrandomprocessisusefulinmanyaspectsofdatascience.Fundamentally,therulesofprobabilityallowustoreasonabouttheconsequencesofassumptionswemake.

ProbabilityDistributions¶Probabilityusesthefollowingvocabularytodescribetherelationshipbetweenknowndistributionsandtheirobservedoutcomes.

Experiment:Anoccurrencewithanuncertainoutcomethatwecanobserve.Forexample,theresultofrollingtwodiceinsequence.Outcome:Theresultofanexperiment;oneparticularstateoftheworld.Forexample:thefirstdiecomesup4andthesecondcomesup2.Thisoutcomecouldbesummarizedas(4,2).SampleSpace:Thesetofallpossibleoutcomesfortheexperiment.Forexample,(1,1),(1,2),(1,3),...,(1,6),(2,1),(2,2),...(6,4),(6,5),(6,6)areallpossibleoutcomesofrollingtwodiceinsequence.Thereare6*6=36differentoutcomes.Event:Asubsetofpossibleoutcomesthattogetherhavesomepropertyweareinterestedin.Forexample,theevent"thetwodicesumto5"isthesetofoutcomes(1,4),(2,3),(3,2),(4,1).Probability:Theproportionofexperimentsforwhichtheeventoccurs.Forexample,theprobabilitythatthetwodicesumto5is4/36or1/9.Distribution:Theprobabilityofallevents.

ComputationalandInferentialThinking

413ConditionalProbability

Page 414: Computational and inferential thinking

Theimportantpartofthisterminologyisthatthesamplespaceisthesetofalloutcomes,andaneventisasubsetofthesamplespace.Foranoutcomethatisanelementofanevent,wesaythatitisanoutcomeforwhichthateventoccurs.

MultinomialDistributions¶

Therearemanykindsofdistributions,butwewillfocusonthecasewherethesamplespaceisafixed,finitesetofmutuallyexclusiveoutcomes,calledamultinomialdistribution.

Forexample,eitherthetwodicecomeup(2,1)ortheycomeup(4,5)inasingleroll(butbothcannotoccur),andthereareexactly36outcomepossibilitiesthateachoccurwithequalchance.

two_dice=Table(['First','Second','Chance'])

forfirstinnp.arange(1,7):

forsecondinnp.arange(1,7):

two_dice.append([first,second,1/36])

two_dice.set_format('Chance',PercentFormatter(1))

First Second Chance

1 1 2.8%

1 2 2.8%

1 3 2.8%

1 4 2.8%

1 5 2.8%

1 6 2.8%

2 1 2.8%

2 2 2.8%

2 3 2.8%

2 4 2.8%

...(26rowsomitted)

Theoutcomesarenotalwaysequallylikely.Intheexampleoftwodicerolledinsequence,thereare36differentoutcomesthateachhavea1/36chanceofoccurring.Iftwodicearerolledtogetherandonlythesumoftheresultisobserved,insteadthereare11outcomesthathavevariousdifferentchances.

ComputationalandInferentialThinking

414ConditionalProbability

Page 415: Computational and inferential thinking

two_dice_sums=Table(['Sum','Chance']).with_rows([

[2,1/36],[3,2/36],[4,3/36],[5,4/36],[6,5/36],[

[12,1/36],[11,2/36],[10,3/36],[9,4/36],[8,5/36],

]).sort(0)

two_dice_sums.set_format('Chance',PercentFormatter(1))

Sum Chance

2 2.8%

3 5.6%

4 8.3%

5 11.1%

6 13.9%

7 16.7%

8 13.9%

9 11.1%

10 8.3%

11 5.6%

...(1rowsomitted)

OutcomesandEvents¶

Thesetwodifferentdicedistributionsarecloselyrelatedtoeachother.Let'sseehow.

Inamultinomialdistribution,eachoutcomealoneisaneventwithitsownchance(orprobability).Thechancesofalloutcomessumto1.Theprobabilityofanyeventisthesumoftheprobabilitiesoftheoutcomesinthatevent.Therefore,bywritingdownthechanceofeachoutcome,weimplicitlydescribetheprobabilityofeveryeventforthewholedistribution.

Let'sconsidertheeventthatthesumofthediceintwosequentialrollsis5.Wecanrepresentthateventasthematchingrowsinthetwo_dicetable.

dice_sums=two_dice.column('First')+two_dice.column('Second')

sum_of_5=two_dice.where(dice_sums==5)

sum_of_5

ComputationalandInferentialThinking

415ConditionalProbability

Page 416: Computational and inferential thinking

First Second Chance

1 4 2.8%

2 3 2.8%

3 2 2.8%

4 1 2.8%

Theprobabilityoftheeventisthesumoftheresultingchances.

sum(sum_of_5.column('Chance'))

0.1111111111111111

IfweconsistentlystoretheprobabilitiesofindividualoutcomesinacolumncalledChance,thenwecandefinetheprobabilityofanyeventusingthePfunction.

defP(event):

returnsum(event.column('Chance'))

P(sum_of_5)

0.1111111111111111

Onedistribution'seventscanbeanotherdistribution'soutcomes,asisthecasewiththetwo_diceandtwo_dice_sumdistributions.Thereare11differentpossiblesumsinthetwo_dicetable,andtheseare11differentmutuallyexclusiveevents.

with_sums=two_dice.with_column('Sum',dice_sums)

with_sums

ComputationalandInferentialThinking

416ConditionalProbability

Page 417: Computational and inferential thinking

First Second Chance Sum

1 1 2.8% 2

1 2 2.8% 3

1 3 2.8% 4

1 4 2.8% 5

1 5 2.8% 6

1 6 2.8% 7

2 1 2.8% 3

2 2 2.8% 4

2 3 2.8% 5

2 4 2.8% 6

...(26rowsomitted)

Groupingtheseoutcomesbytheirsumshowshowmanydifferentoutcomesappearineachevent.

with_sums.group('Sum')

Sum count

2 1

3 2

4 3

5 4

6 5

7 6

8 5

9 4

10 3

11 2

...(1rowsomitted)

Finally,thechanceofeachsumeventisthetotalchanceofalloutcomesthatmatchthesumevent.Inthisway,wecangeneratethetwo_dice_sumsdistributionviaaddition.

ComputationalandInferentialThinking

417ConditionalProbability

Page 418: Computational and inferential thinking

grouped=with_sums.select(['Sum','Chance']).group('Sum',sum)

grouped.relabeled(1,'Chance').set_format('Chance',PercentFormatter

Sum Chance

2 2.8%

3 5.6%

4 8.3%

5 11.1%

6 13.9%

7 16.7%

8 13.9%

9 11.1%

10 8.3%

11 5.6%

...(1rowsomitted)

Bothdistributionswillassignthesameprobabilitytocertainevents.Forexample,theprobabilityoftheeventthatthesumofthediceisabove8canbecomputedfromeitherdistribution.

P(with_sums.where('Sum',8))

0.1388888888888889

P(two_dice_sums.where('Sum',8))

0.1388888888888889

U.S.BirthTimes¶

Here'sanexamplebasedondatacollectedintheworld.

ComputationalandInferentialThinking

418ConditionalProbability

Page 419: Computational and inferential thinking

MorebabiesintheUnitedStatesareborninthemorningthanthenight.In2013,theCenterforDiseaseControlmeasuredthefollowingproportionsofbabiesforeachhouroftheday.TheTimecolumnisadescriptionofeachhour,andtheHourcolumnisitsnumericequivalent.

birth=Table.read_table('birth_time.csv').select(['Time','Hour',

birth.set_format('Chance',PercentFormatter(1)).show()

ComputationalandInferentialThinking

419ConditionalProbability

Page 420: Computational and inferential thinking

Time Hour Chance

6a.m. 6 2.9%

7a.m. 7 4.5%

8a.m. 8 6.3%

9a.m. 9 5.0%

10a.m. 10 5.0%

11a.m. 11 5.0%

Noon 12 6.0%

1p.m. 13 5.7%

2p.m. 14 5.1%

3p.m. 15 4.8%

4p.m. 16 4.9%

5p.m. 17 5.0%

6p.m. 18 4.5%

7p.m. 19 4.0%

8p.m. 20 4.0%

9p.m. 21 3.7%

10p.m. 22 3.5%

11p.m. 23 3.3%

Midnight 0 2.9%

1a.m. 1 2.9%

2a.m. 2 2.8%

3a.m. 3 2.7%

4a.m. 4 2.7%

5a.m. 5 2.8%

Ifweassumethatthisdistributiondescribestheworldcorrectly,whatisthechancethatababywillbebornbetween8a.m.and6p.m.?Itturnsoutthatmorethanhalfofallbabiesarebornduring"businesshours",eventhoughthistimeintervalonlycovers5/12oftheday.

business_hours=birth.where('Hour',are.between(8,18))

business_hours

ComputationalandInferentialThinking

420ConditionalProbability

Page 421: Computational and inferential thinking

Time Hour Chance

8a.m. 8 6.3%

9a.m. 9 5.0%

10a.m. 10 5.0%

11a.m. 11 5.0%

Noon 12 6.0%

1p.m. 13 5.7%

2p.m. 14 5.1%

3p.m. 15 4.8%

4p.m. 16 4.9%

5p.m. 17 5.0%

P(business_hours)

0.52800000000000002

Ontheotherhand,birthslateatnightareuncommon.Thechanceofhavingababybetweenmidnightand6a.m.ismuchlessthan25%,eventhoughthistimeintervalcovers1/4oftheday.

P(birth.where('Hour',are.between(0,6)))

0.16799999999999998

ConditionalDistributions¶Acommonwaytotransformonedistributionintoanotheristoconditiononanevent.Conditioningmeansassumingthattheeventoccurs.Theconditionaldistributiongivenaneventtakesthechancesofalloutcomesforwhichtheeventoccursandscalesuptheirchancessothattheysumto1.

Toscaleupthechances,wedividethechanceofeachoutcomeintheeventbytheprobabilityoftheevent.Forexample,ifweconditionontheeventthatthesumoftwodiceisabove8,wearriveatthefollowingconditionaldistribution.

ComputationalandInferentialThinking

421ConditionalProbability

Page 422: Computational and inferential thinking

above_8=two_dice_sums.where('Sum',are.above(8))

given_8=above_8.with_column('Chance',above_8.column('Chance')/

given_8

Sum Chance

9 40.0%

10 30.0%

11 20.0%

12 10.0%

Aconditionaldistributiondescribesthechancesundertheoriginaldistribution,updatedtoreflectanewpieceofinformation.Inthiscase,weseethechancesundertheoriginaltwo_dice_sumsdistribution,giventhatthesumisabove8.

Theconditioningoperationcanbeexpressedbythegivenfunction,whichtakesaneventtableandreturnsthecorrespondingconditionaldistributon.

defgiven(event):

returnevent.with_column('Chance',event.column('Chance')/P(event

given(two_dice_sums.where('Sum',are.above(8)))

Sum Chance

9 40.0%

10 30.0%

11 20.0%

12 10.0%

GiventhatababyisbornintheUSduringbusinesshours(between8a.m.and6p.m.),whatarethechancesthatitisbornbeforenoon?Toanswerthisquestion,wefirstformtheconditionaldistributiongivenbusinesshours,thencomputetheprobabilityoftheeventthatthebabyisbornbeforenoon.

given(business_hours)

ComputationalandInferentialThinking

422ConditionalProbability

Page 423: Computational and inferential thinking

Time Hour Chance

8a.m. 8 11.9%

9a.m. 9 9.5%

10a.m. 10 9.5%

11a.m. 11 9.5%

Noon 12 11.4%

1p.m. 13 10.8%

2p.m. 14 9.7%

3p.m. 15 9.1%

4p.m. 16 9.3%

5p.m. 17 9.5%

P(given(business_hours).where('Hour',are.below(12)))

0.40340909090909094

Thisresultisnotthesameasthechancethatababyisbornbetween8a.m.andnooningeneral.

morning=birth.where('Hour',are.between(8,12))

P(morning)

0.21300000000000002

Norisitthechancethatababyisbornbeforenooningeneral.

P(birth.where('Hour',are.below(12)))

0.45500000000000007

Instead,itistheprobabilitythatboththeeventandtheconditioningeventaretrue(i.e.,thatthebirthisinthemorning),dividedbytheprobabilityoftheconditioningevent(i.e.,thatthebirthisduringbusinesshours).

ComputationalandInferentialThinking

423ConditionalProbability

Page 424: Computational and inferential thinking

P(morning)/P(business_hours)

0.40340909090909094

Themorningeventisajointeventthatcanbedescribedasbothduringbusinesshoursandbeforenoon.Ajointeventisjustanevent,butonethatisdescribedbytheintersectionoftwootherevents.

business_hours.where('Hour',are.below(12))

Time Hour Chance

8a.m. 8 6.3%

9a.m. 9 5.0%

10a.m. 10 5.0%

11a.m. 11 5.0%

Ingeneral,theprobabilityofaneventB(e.g.,beforenoon)conditionedonaneventA(e.g.,duringbusinesshours)istheprobabilityofthejointeventofAandBdividedbytheprobabilityoftheconditioningeventA.

P(business_hours.where('Hour',are.below(12)))/P(business_hours)

0.40340909090909094

Thestandardnotationforthisrelationshipusesaverticalbarforconditioningandacommaforajointevent.

JointDistributions¶Ajointeventisoneinwhichtwodifferenteventsbothoccur.Similarly,ajointoutcomeisanoutcomethatiscomposedoftwodifferentoutcomes.Forexample,theoutcomesofthetwo_dicedistributionareeachjointoutcomesofthefirstandseconddicerolls.

ComputationalandInferentialThinking

424ConditionalProbability

Page 425: Computational and inferential thinking

two_dice

First Second Chance

1 1 2.8%

1 2 2.8%

1 3 2.8%

1 4 2.8%

1 5 2.8%

1 6 2.8%

2 1 2.8%

2 2 2.8%

2 3 2.8%

2 4 2.8%

...(26rowsomitted)

Anothercommonwaytotransformdistributionsistoformajointdistributionfromconditionaldistributions.Thedefinitionofaconditionalprobabilityprovidesaformulaforcomputingajointprobability,simplybymultiplyingbothsidesoftheequationaboveby .

Thechanceofeachoutcomeofajointdistributioncanbecomputedfromtheprobabilityofthefirstoutcomeandtheconditionalprobabilityofthesecondoutcomegiventhefirstoutcome.Inthetwo_dicetableabove,thechanceofanyfirstoutcomeis1/6,andthechanceofanysecondoutcomegivenanyfirstoutcomeis1/6,sothechanceofanyjointoutcomeis1/36or2.8%.

U.S.BirthDays¶

TheCDCalsopublishedthatin2013,78.25%ofbabiesbornonweekdays(MondaythroughFriday)and21.75%ofbabieswerebornonweekends(SaturdayandSunday).Notably,thisdistributionisnot5/7(71.4%)to2/7(28.6%),butinsteadfavorsweekdays.

Moreover,theCDCpublishedtheconditionaldistributionsof2013birthtimes,givenaweekdaybirth(green)andgivenaweekendbirth(gray).Theblacklineisthedistributionofallbirths.

ComputationalandInferentialThinking

425ConditionalProbability

Page 426: Computational and inferential thinking

Alloftheproportionsusedtogeneratethischartappearinthebirth_daytable.

birth_day=Table.read_table('birth_time.csv').drop('Chance')

birth_day.set_format([2,3],PercentFormatter(1))

Time Hour Weekday Weekend

6a.m. 6 2.7% 3.6%

7a.m. 7 4.7% 3.8%

8a.m. 8 6.7% 4.6%

9a.m. 9 5.1% 5.0%

10a.m. 10 5.0% 5.0%

11a.m. 11 5.0% 4.9%

Noon 12 6.3% 5.0%

1p.m. 13 5.9% 4.7%

2p.m. 14 5.3% 4.6%

3p.m. 15 4.9% 4.6%

...(14rowsomitted)

TheWeekdayandWeekendcolumnsarethechancesoftwodifferentconditionaldistributions.

ComputationalandInferentialThinking

426ConditionalProbability

Page 427: Computational and inferential thinking

weekday=birth_day.select(['Hour','Weekday']).relabeled(1,'Chance'

weekend=birth_day.select(['Hour','Weekend']).relabeled(1,'Chance'

Thejointdistributioniscreatedbymultiplyingeachconditionalchancecolumnbythecorrespondingprobabilityoftheconditioningevent.Inthiscase,wemultiplyweekdaychancesbythechancethatthebabywasbornonaweekday(78.25%);likewise,wemultiplyweekendchancesbythechancethatthebabywasbornonaweekend(21.75%).

birth_joint=Table(['Day','Hour','Chance'])

forrowinweekday.rows:

birth_joint.append(['Weekday',row.item('Hour'),row.item('Chance'

forrowinweekend.rows:

birth_joint.append(['Weekend',row.item('Hour'),row.item('Chance'

birth_joint.set_format('Chance',PercentFormatter(1))

Day Hour Chance

Weekday 6 2.1%

Weekday 7 3.7%

Weekday 8 5.2%

Weekday 9 4.0%

Weekday 10 3.9%

Weekday 11 3.9%

Weekday 12 4.9%

Weekday 13 4.6%

Weekday 14 4.1%

Weekday 15 3.8%

...(38rowsomitted)

Thisjointdistributionhas48jointoutcomes,andthetotalchancesumsto1.

P(birth_joint)

1.0

ComputationalandInferentialThinking

427ConditionalProbability

Page 428: Computational and inferential thinking

Now,wecancomputetheprobabilityofanyeventthatinvolvesdaysandhours.Forexample,thechanceofaweekdaymorningbirthisquitehigh.

P(birth_joint.where('Day','Weekday').where('Hour',are.between(8,

0.17058499999999999

Wehaveseenthatlessthan22%ofbabiesarebornonweekends.However,ifweknowthatababywasbornbetween5a.m.and6a.m.,thenthechancesthatitisaweekendbabyarequiteabithigherthan22%.

early_morning=birth_joint.where('Hour',5)

early_morning

Day Hour Chance

Weekday 5 2.0%

Weekend 5 0.8%

P(given(early_morning).where('Day','Weekend'))

0.28584466551063248

Bayes'Rule¶Inthefinalexampleabove,webeganwithadistributionoverweekdayandweekendbirths,

,alongwithconditionaldistributionsoverhours, .Wewereabletocomputetheprobabilityofaweekendbirthconditionedonanhour,anoutcomeinthedistribution

.Bayes'ruleistheformulathatexpressestherelationshipbetweenallofthesequantities.

ComputationalandInferentialThinking

428ConditionalProbability

Page 429: Computational and inferential thinking

Thenumeratorontheright-handsideisjustthejointprobabilityofAandB.Bayes'rulewritesthisprobabilityinitsexpandedformbecausesooftenwearegiventhesetwocomponentsandmustformthejointdistributionthroughmultiplication.

Thedenominatorontheright-handsideisaneventthatcanbecomputedfromthejointprobability.Forexample,theearly_morningeventintheexampleaboveisaneventjustabouthours,butitincludesalljointoutcomeswherethecorrecthouroccurs.

Eachprobabilityinthisequationhasaname.ThenamesarederivedfromthemosttypicalapplicationofBayes'rule,whichistoupdateone'sbeliefsbasedonnewevidence.

iscalledtheprior;itistheprobabilityofeventAbeforeanyevidenceisobserved.

iscalledthelikelihood;itistheconditionalprobabilityoftheevidenceeventBgiventheeventA.

iscalledtheevidence;itistheprobabilityoftheevidenceeventBforanyoutcome.

iscalledtheposterior;itistheprobabilytofeventAafterevidenceeventBisobserved.

DependingonthejointdistributionofAandB,observingsomeBcanmakeAmoreorlesslikely.

DiagnosticExample¶

Inapopulation,thereisararedisease.Researchershavedevelopedamedicaltestforthedisease.Mostly,thetestcorrectlyidentifieswhetherornotthetestedpersonhasthedisease.Butsometimes,thetestiswrong.Herearetherelevantproportions.

1%ofthepopulationhasthediseaseIfapersonhasthedisease,thetestreturnsthecorrectresultwithchance99%.Ifapersondoesnothavethedisease,thetestreturnsthecorrectresultwithchance99.5%.

Onepersonispickedatrandomfromthepopulation.Giventhatthepersontestspositive,whatisthechancethatthepersonhasthedisease?

Webeginbypartitioningthepopulationintofourcategoriesinthetreediagrambelow.

ComputationalandInferentialThinking

429ConditionalProbability

Page 430: Computational and inferential thinking

ByBayes'Rule,thechancethatthepersonhasthediseasegiventhatheorshehastestedpositiveisthechanceofthetop"TestPositive"branchrelativetothetotalchanceofthetwo"TestPositive"branches.Theansweris

#Thepersonispickedatrandomfromthepopulation.

#ByBayes'Rule:

#Chancethatthepersonhasthedisease,giventhattestwas+

(0.01*0.99)/(0.01*0.99+0.99*0.005)

0.6666666666666666

Thisis2/3,anditseemsrathersmall.Thetesthasveryhighaccuracy,99%orhigher.Yetisouranswersayingthatifapatientgetstestedandthetestresultispositive,thereisonlya2/3chancethatheorshehasthedisease?

Tounderstandouranswer,itisimportanttorecallthechancemodel:ourcalculationisvalidforapersonpickedatrandomfromthepopulation.Amongallthepeopleinthepopulation,thepeoplewhotestpositivesplitintotwogroups:thosewhohavethediseaseandtestpositive,andthosewhodon'thavethediseaseandtestpositive.Thelattergroupiscalledthegroupoffalsepositives.Theproportionoftruepositivesistwiceashighasthatof

ComputationalandInferentialThinking

430ConditionalProbability

Page 431: Computational and inferential thinking

thefalsepositives– comparedto –andhencethechanceofatruepositivegivenapositivetestresultis .Thechanceisaffectedbothbytheaccuracyofthetestandalsobytheprobabilitythatthesampledpersonhasthedisease.

Thesameresultcanbecomputedusingatable.Below,webeginwiththejointdistributionovertruehealthandtestresults.

rare=Table(['Health','Test','Chance']).with_rows([

['Diseased','Positive',0.01*0.99],

['Diseased','Negative',0.01*0.01],

['NotDiseased','Positive',0.99*0.005],

['NotDiseased','Negative',0.99*0.995]

])

rare

Health Test Chance

Diseased Positive 0.0099

Diseased Negative 0.0001

NotDiseased Positive 0.00495

NotDiseased Negative 0.98505

Thechancethatapersonselectedatrandomisdiseased,giventhattheytestedpositive,iscomputedfromthefollowingexpression.

positive=rare.where('Test','Positive')

P(given(positive).where('Health','Diseased'))

0.66666666666666663

Butapatientwhogoestogettestedforadiseaseisnotwellmodeledasarandommemberofthepopulation.Peoplegettestedbecausetheythinktheymighthavethedisease,orbecausetheirdoctorthinksso.Insuchacase,sayingthattheirchanceofhavingthediseaseis1%isnotjustified;theyarenotpickedatrandomfromthepopulation.

So,whileouransweriscorrectforarandommemberofthepopulation,itdoesnotanswerthequestionforapersonwhohaswalkedintoadoctor'sofficetobetested.

ComputationalandInferentialThinking

431ConditionalProbability

Page 432: Computational and inferential thinking

Toanswerthequestionforsuchaperson,wemustfirstaskourselveswhatistheprobabilitythatthepersonhasthedisease.Itisnaturaltothinkthatitislargerthan1%,asthepersonhassomereasontobelievethatheorshemighthavethedisease.Buthowmuchlarger?

Thiscannotbedecidedbasedonrelativefrequencies.Theprobabilitythataparticularindividualhasthediseasehastobebasedonasubjectiveopinion,andisthereforecalledasubjectiveprobability.Someresearchersinsistthatallprobabilitiesmustberelativefrequencies,butsubjectiveprobabilitiesabound.Thechancethatacandidatewinsthenextelection,thechancethatabigearthquakewillhittheBayAreainthenextdecade,thechancethataparticularcountrywinsthenextsoccerWorldCup:noneofthesearebasedonrelativefrequenciesorlongrunfrequencies.Eachonecontainsasubjectiveelement.

Itisfinetoworkwithsubjectiveprobabilitiesaslongasyoukeepinmindthattherewillbeasubjectiveelementinyouranswer.Beawarealsothatdifferentpeoplecanhavedifferentsubjectiveprobabilitiesofthesameevent.Forexample,thepatient'ssubjectiveprobabilitythatheorshehasthediseasecouldbequitedifferentfromthedoctor'ssubjectiveprobabilityofthesameevent.Herewewillworkfromthepatient'spointofview.

Supposethepatientassignedanumbertohis/herdegreeofuncertaintyaboutwhetherhe/shehadthedisease,beforeseeingthetestresult.Thisnumberisthepatient'ssubjectivepriorprobabilityofhavingthedisease.

Ifthatprobabilitywere10%,thentheprobabilitiesontheleftsideofthetreediagramwouldchangeaccordingly,withthe0.1and0.9nowinterpretedassubjectiveprobabilities:

Thechangehasanoticeableeffectontheanswer,asyoucanseebyrunningthecellbelow.

ComputationalandInferentialThinking

432ConditionalProbability

Page 433: Computational and inferential thinking

#Subjectivepriorprobabilityof10%thatthepersonhasthedisease

#ByBayes'Rule:

#Chancethatthepersonhasthedisease,giventhattestwas+

(0.1*0.99)/(0.1*0.99+0.9*0.005)

0.9565217391304347

Ifthepatient'spriorprobabilityofhavingthediseaseis10%,thenafterapositivetestresultthepatientmustupdatethatprobabilitytoover95%.Thisupdatedprobabilityiscalledaposteriorprobability.Itiscalculatedafterlearningthetestresult.

Ifthepatient'spriorprobabilityofhavngthediseaseis50%,thentheresultchangesyetagain.

ComputationalandInferentialThinking

433ConditionalProbability

Page 434: Computational and inferential thinking

#Subjectivepriorprobabilityof50%thatthepersonhasthedisease

#ByBayes'Rule:

#Chancethatthepersonhasthedisease,giventhattestwas+

(0.5*0.99)/(0.5*0.99+0.5*0.005)

0.9949748743718593

Startingoutwitha50-50subjectivechanceofhavingthedisease,thepatientmustupdatethatchancetoabout99.5%aftergettingapositivetestresult.

ComputationalNote.Inthecalculationabove,thefactorof0.5iscommontoallthetermsandcancelsout.Hencearithmeticallyitisthesameasacalculationwherethepriorprobabilitiesareapparentlymissing:

0.99/(0.99+0.005)

0.9949748743718593

Butinfact,theyarenotmissing.Theyhavejustcanceledout.Whenthepriorprobabilitiesarenotallequal,thentheyareallvisibleinthecalculationaswehaveseenearlier.

Inmachinelearningapplicationssuchasspamdetection,Bayes'Ruleisusedtoupdateprobabilitiesofmessagesbeingspam,basedonnewmessagesbeinglabeledSpamorNotSpam.Youwillneedmoreadvancedmathematicstocarryoutallthecalculations.Butthefundamentalmethodisthesameaswhatyouhaveseeninthissection.

ComputationalandInferentialThinking

434ConditionalProbability