computational and inferential thinking

TableofContentsIntroduction

DataScience

Introduction

ComputationalTools

StatisticalTechniques

WhyDataScience?

Example:PlottingtheClassics

CausalityandExperiments

ObservationandVisualization:JohnSnowandtheBroadStreetPump

Snow’s“GrandExperiment”

EstablishingCausality

Randomization

Endnote

Expressions

Example:GrowthRates

CallExpressions

DataTypes

Comparisons

Sequences

Arrays

Ranges

Tables

Functions

FunctionsandTables

Visualization

BarCharts

Histograms

Sampling

ComputationalandInferentialThinking

Iteration

Estimation

Center

Spread

TheNormalDistribution

Exploration:Privacy

Prediction

Correlation

Regression

Prediction

Higher-OrderFunctions

Errors

MultipleRegression

Classification

Features

Inference

ConfidenceIntervals

DistanceBetweenDistributions

HypothesisTesting

Permutation

A/BTesting

RegressionInference

RegressionDiagnostics

Probability

ConditionalProbability

TheFoundationsofDataScienceByAniAdhikariandJohnDeNero

ContributionsbyDavidWagner

ThisisthetextbookfortheFoundationsofDataScienceclassatUCBerkeley.

ViewthistextbookonlineonGitbooks.

3Introduction

WhatisDataScienceDataScienceisaboutdrawingusefulconclusionsfromlargeanddiversedatasetsthroughexploration,prediction,andinference.Explorationinvolvesidentifyingpatternsininformation.Predictioninvolvesusinginformationweknowtomakeinformedguessesaboutvalueswewishweknew.Inferenceinvolvesquantifyingourdegreeofcertainty:willthosepatternswefoundalsoappearinnewobservations?Howaccurateareourpredictions?Ourprimarytoolsforexplorationarevisualizations,forpredictionaremachinelearningandoptimization,andforinferencearestatisticaltestsandmodels.

Statisticsisacentralcomponentofdatasciencebecausestatisticsstudieshowtomakerobustconclusionswithincompleteinformation.Computingisacentralcomponentbecauseprogrammingallowsustoapplyanalysistechniquestothelargeanddiversedatasetsthatariseinreal-worldapplications:notjustnumbers,buttext,images,videos,andsensorreadings.Datascienceisallofthesethings,butitmorethanthesumofitspartsbecauseoftheapplications.Throughunderstandingaparticulardomain,datascientistslearntoaskappropriatequestionsabouttheirdataandcorrectlyinterprettheanswersprovidedbyourinferentialandcomputationaltools.

4DataScience

Chapter1:IntroductionDataaredescriptionsoftheworldaroundus,collectedthroughobservationandstoredoncomputers.Computersenableustoinferpropertiesoftheworldfromthesedescriptions.Datascienceisthedisciplineofdrawingconclusionsfromdatausingcomputation.Therearethreecoreaspectsofeffectivedataanalysis:exploration,prediction,andinference.Thistextdevelopsaconsistentapproachtoallthree,introducingstatisticalideasandfundamentalideasincomputerscienceconcurrently.Wefocusonaminimalsetofcoretechniquesthattheyapplytoavastrangeofreal-worldapplications.Afoundationindatasciencerequiresnotonlyunderstandingstatisticalandcomputationaltechniques,butalsorecognizinghowtheyapplytorealscenarios.

Forwhateveraspectoftheworldwewishtostudy---whetherit'stheEarth'sweather,theworld'smarkets,politicalpolls,orthehumanmind---datawecollecttypicallyofferanincompletedescriptionofthesubjectathand.Thecentralchallengeofdatascienceistomakereliableconclusionsusingthispartialinformation.

Inthisendeavor,wewillcombinetwoessentialtools:computationandrandomization.Forexample,wemaywanttounderstandclimatechangetrendsusingtemperatureobservations.Computerswillallowustouseallavailableinformationtodrawconclusions.Ratherthanfocusingonlyontheaveragetemperatureofaregion,wewillconsiderthewholerangeoftemperaturestogethertoconstructamorenuancedanalysis.Randomnesswillallowustoconsiderthemanydifferentwaysinwhichincompleteinformationmightbecompleted.Ratherthanassumingthattemperaturesvaryinaparticularway,wewilllearntouserandomnessasawaytoimaginemanypossiblescenariosthatareallconsistentwiththedataweobserve.

Applyingthisapproachrequireslearningtoprogramacomputer,andsothistextinterleavesacompleteintroductiontoprogrammingthatassumesnopriorknowledge.Readerswithprogrammingexperiencewillfindthatwecoverseveraltopicsincomputationthatdonotappearinatypicalintroductorycomputersciencecurriculum.Datasciencealsorequirescarefulreasoningaboutquantities,butthistextdoesnotassumeanybackgroundinmathematicsorstatisticsbeyondbasicalgebra.Youwillfindveryfewequationsinthistext.Instead,techniquesaredescribedtoreadersinthesamelanguageinwhichtheyaredescribedtothecomputersthatexecutethem---aprogramminglanguage.

5Introduction

ComputationalToolsThistextusesthePython3programminglanguage,alongwithastandardsetofnumericalanddatavisualizationtoolsthatareusedwidelyincommercialapplications,scientificexperiments,andopen-sourceprojects.Pythonhasrecruitedenthusiastsfrommanyprofessionsthatusedatatodrawconclusions.BylearningthePythonlanguage,youwilljoinamillion-person-strongcommunityofsoftwaredevelopersanddatascientists.

GettingStarted.TheeasiestandrecommendedwaytostartwritingprogramsinPythonistologintothecompanionsiteforthistext,data8.berkeley.edu.Ifyouareenrolledinacourseofferingthatusesthistext,youshouldhavefullaccesstotheprogrammingenvironmenthostedonthatsite.Thefirsttimeyoulogin,youwillfindabrieftutorialdescribinghowtoproceed.

Youarenotatallrestrictedtousingthisweb-basedprogrammingenvironment.APythonprogramcanbeexecutedbyanycomputer,regardlessofitsmanufactureroroperatingsystem,providedthatsupportforthelanguageisinstalled.IfyouwishtoinstalltheversionofPythonanditsaccompanyinglibrariesthatwillmatchthistext,werecommendtheAnacondadistributionthatpackagestogetherthePython3languageinterpreter,IPythonlibraries,andtheJupyternotebookenvironment.

Thistextincludesacompleteintroductiontoallofthesecomputationaltools.Youwilllearntowriteprograms,generateimagesfromdata,andworkwithreal-worlddatasetsthatarepublishedonline.

6ComputationalTools

StatisticalTechniquesThedisciplineofstatisticshaslongaddressedthesamefundamentalchallengeasdatascience:howtodrawconclusionsabouttheworldusingincompleteinformation.Oneofthemostimportantcontributionsofstatisticsisaconsistentandprecisevocabularyfordescribingtherelationshipbetweenobservationsandconclusions.Thistextcontinuesinthesametradition,focusingonasetofcoreinferentialproblemsfromstatistics:testinghypotheses,estimatingconfidence,andpredictingunknownquantities.

Datascienceextendsthefieldofstatisticsbytakingfulladvantageofcomputing,datavisualization,andaccesstoinformation.ThecombinationoffastcomputersandtheInternetgivesanyonetheabilitytoaccessandanalyzevastdatasets:millionsofnewsarticles,fullencyclopedias,databasesforanydomain,andmassiverepositoriesofmusic,photos,andvideo.

Applicationstorealdatasetsmotivatethestatisticaltechniquesthatwedescribethroughoutthetext.Realdataoftendonotfollowregularpatternsormatchstandardequations.Theinterestingvariationinrealdatacanbelostbyfocusingtoomuchattentiononsimplisticsummariessuchasaveragevalues.Computersenableafamilyofmethodsbasedonresamplingthatapplytoawiderangeofdifferentinferenceproblems,takeintoaccountallavailableinformation,andrequirefewassumptionsorconditions.Althoughthesetechniqueshaveoftenbeenreservedforgraduatecoursesinstatistics,theirflexibilityandsimplicityareanaturalfitfordatascienceapplications.

7StatisticalTechniques

WhyDataScience?Mostimportantdecisionsaremadewithonlypartialinformationanduncertainoutcomes.However,thedegreeofuncertaintyformanydecisionscanbereducedsharplybypublicaccesstolargedatasetsandthecomputationaltoolsrequiredtoanalyzethemeffectively.Data-drivendecisionmakinghasalreadytransformedatremendousbreadthofindustries,includingfinance,advertising,manufacturing,andrealestate.Atthesametime,awiderangeofacademicdisciplinesareevolvingrapidlytoincorporatelarge-scaledataanalysisintotheirtheoryandpractice.

Studyingdatascienceenablesindividualstobringthesetechniquestobearontheirwork,theirscientificendeavors,andtheirpersonaldecisions.Criticalthinkinghaslongbeenahallmarkofarigorouseducation,butcritiquesareoftenmosteffectivewhensupportedbydata.Acriticalanalysisofanyaspectoftheworld,mayitbebusinessorsocialscience,involvesinductivereasoning---conclusionscanrarelybeenprovenoutright,onlysupportedbytheavailableevidence.Datascienceprovidesthemeanstomakeprecise,reliable,andquantitativeargumentsaboutanysetofobservations.Withunprecedentedaccesstoinformationandcomputing,criticalthinkingaboutanyaspectoftheworldthatcanbemeasuredwouldbeincompletewithouttheinferentialtechniquesthatarecoretodatascience.

Theworldhastoomanyunansweredquestionsanddifficultchallengestoleavethiscriticalreasoningtoonlyafewspecialists.However,alleducatedadultscanbuildthecapacitytoreasonaboutdata.Thetools,techniques,anddatasetsareallreadilyavailable;thistextaimstomakethemaccessibletoeveryone.

8WhyDataScience?

Example:PlottingtheClassicsInteractInthisexample,wewillexplorestatisticsfortwoclassicnovels:TheAdventuresofHuckleberryFinnbyMarkTwain,andLittleWomenbyLouisaMayAlcott.Thetextofanybookcanbereadbyacomputeratgreatspeed.Bookspublishedbefore1923arecurrentlyinthepublicdomain,meaningthateveryonehastherighttocopyorusethetextinanyway.ProjectGutenbergisawebsitethatpublishespublicdomainbooksonline.UsingPython,wecanloadthetextofthesebooksdirectlyfromtheweb.

ThefeaturesofPythonusedinthisexamplewillbeexplainedindetaillaterinthecourse.Thisexampleismeanttoillustratesomeofthebroadthemesofthistext.Don'tworryifthedetailsoftheprogramdon'tyetmakesense.Instead,focusoninterpretingtheimagesgeneratedbelow.The"Expressions"sectionlaterinthischapterwilldescribemostofthefeaturesofthePythonprogramminglanguageusedbelow.

First,wereadthetextofbothbooksintolistsofchapters,calledhuck_finn_chaptersandlittle_women_chapters.InPython,anamecannotcontainanyspaces,andsowewillofoftenuseanunderscore_tostandinforaspace.The=inthelinesbelowgiveanameonthelefttotheresultofsomecomputationdescribedontheright.AuniformresourcelocatororURLisanaddressontheInternetforsomecontent;inthiscase,thetextofabook.

#Readtwobooks,fast!

huck_finn_url='http://www.gutenberg.org/cache/epub/76/pg76.txt'

huck_finn_text=read_url(huck_finn_url)

huck_finn_chapters=huck_finn_text.split('CHAPTER')[44:]

little_women_url='http://www.gutenberg.org/cache/epub/514/pg514.txt'

little_women_text=read_url(little_women_url)

little_women_chapters=little_women_text.split('CHAPTER')[1:]

Whileacomputercannotunderstandthetextofabook,wecanstilluseittoprovideuswithsomeinsightintothestructureofthetext.Thenamehuck_finn_chaptersiscurrentlyboundtoalistofallthechaptersinthebook.Wecanplacethosechaptersintoatabletoseehoweachbegins.

9Example:PlottingtheClassics

Table([huck_finn_chapters],['Chapters'])

Chapters

I.YOUdon'tknowaboutmewithoutyouhavereadabook...

II.WEwenttiptoeingalongapathamongstthetreesbac...

III.WELL,Igotagoodgoing-overinthemorningfromo...

IV.WELL,threeorfourmonthsrunalong,anditwaswel...

V.Ihadshutthedoorto.ThenIturnedaroundandther...

VI.WELL,prettysoontheoldmanwasupandaroundagai...

VII."GITup!Whatyou'bout?"Iopenedmyeyesandlook...

VIII.THEsunwasupsohighwhenIwakedthatIjudged...

IX.Iwantedtogoandlookataplacerightaboutthem...

X.AFTERbreakfastIwantedtotalkaboutthedeadmana...

...(33rowsomitted)

EachchapterbeginswithachapternumberinRomannumerals,followedbythefirstsentenceofthechapter.ProjectGutenberghasprintedthefirstwordofeachchapterinuppercase.

ThebookdescribesajourneythatHuckandJimtakealongtheMississippiRiver.TomSawyerjoinsthemtowardstheendastheactionheatsup.Wecanquicklyvisualizehowmanytimesthesecharactershaveeachbeenmentionedatanypointinthebook.

counts=Table([np.char.count(huck_finn_chapters,"Jim"),

np.char.count(huck_finn_chapters,"Tom"),

np.char.count(huck_finn_chapters,"Huck")],

["Jim","Tom","Huck"])

counts.cumsum().plot(overlay=True)

Intheplotabove,thehorizontalaxisshowschapternumbersandtheverticalaxisshowshowmanytimeseachcharacterhasbeenmentionedsofar.YoucanseethatJimisacentralcharacterbythelargenumberoftimeshisnameappears.NoticehowTomishardlymentionedformuchofthebookuntilafterChapter30.HiscurveandJim'srisesharplyatthatpoint,astheactioninvolvingbothofthemintensifies.AsforHuck,hisnamehardlyappearsatall,becauseheisthenarrator.

LittleWomenisastoryoffoursistersgrowinguptogetherduringthecivilwar.Inthisbook,chapternumbersarespelledoutandchaptertitlesarewritteninallcapitalletters.

Table('Chapters':little_women_chapters)

Chapters

ONEPLAYINGPILGRIMS"Christmaswon'tbeChristmaswitho...

TWOAMERRYCHRISTMASJowasthefirsttowakeinthegr...

THREETHELAURENCEBOY"Jo!Jo!Whereareyou?"criedMe...

FOURBURDENS"Oh,dear,howharditdoesseemtotakeup...

FIVEBEINGNEIGHBORLY"Whatintheworldareyougoingt...

SIXBETHFINDSTHEPALACEBEAUTIFULThebighousedidpr...

SEVENAMY'SVALLEYOFHUMILIATION"Thatboyisaperfect...

EIGHTJOMEETSAPOLLYON"Girls,whereareyougoing?"as...

NINEMEGGOESTOVANITYFAIR"Idothinkitwasthemost...

TENTHEP.C.ANDP.O.Asspringcameon,anewsetofam...

...(37rowsomitted)

Wecantrackthementionsofmaincharacterstolearnabouttheplotofthisbookaswell.TheprotagonistJointeractswithhersistersMeg,Beth,andAmyregularly,upuntilChapter27whenshemovestoNewYorkalone.

people=["Meg","Jo","Beth","Amy","Laurie"]

people_counts=Table(pp:np.char.count(little_women_chapters,pp)

people_counts.cumsum().plot(overlay=True)

Laurieisayoungmanwhomarriesoneofthegirlsintheend.Seeifyoucanusetheplotstoguesswhichone.

Inadditiontovisualizingdatatofindpatterns,datascienceisoftenconcernedwithwhethertwosimilarpatternsarereallythesameordifferent.Inthecontextofnovels,theword"character"hasasecondmeaning:aprintedsymbolsuchasaletterornumber.Below,wecomparethecountsofthistypeofcharacterinthefirstchapterofLittleWomen.

fromcollectionsimportCounter

chapter_one=little_women_chapters[0]

counts=Counter(chapter_one.lower())

letters=Table([counts.keys(),counts.values()],['Letters','Chapter1Count'

letters.sort('Letters').show(20)

Letters Chapter1Count

a 1352

e 1981

h 1069

i 1071

...(16rowsomitted)

HowdothesecountscomparewiththecharactercountsinChapter2?Byplottingbothsetsofcountstogether,wecanseeslightvariationsforeverycharacter.Isthatdifferencemeaningful?Howsurecanwebe?Suchquestionswillbeaddressedpreciselyinthistext.

counts=Counter(little_women_chapters[1].lower())

two_letters=Table([counts.keys(),counts.values()],['Letters','Chapter2Count'

compare=letters.join('Letters',two_letters)

compare.barh('Letters',overlay=True)

Insomecases,therelationshipsbetweenquantitiesallowustomakepredictions.Thistextwillexplorehowtomakeaccuratepredictionsbasedonincompleteinformationanddevelopmethodsforcombiningmultiplesourcesofuncertaininformationtomakedecisions.

Asanexample,letusfirstusethecomputertogetussomeinformationthatwouldbereallytedioustoacquirebyhand:thenumberofcharactersandthenumberofperiodsineachchapterofbothbooks.

chlen_per_hf=Table([[len(s)forsinhuck_finn_chapters],

np.char.count(huck_finn_chapters,'.')],

["HFChapterLength","Numberofperiods"])

chlen_per_lw=Table([[len(s)forsinlittle_women_chapters],

np.char.count(little_women_chapters,'.')],

["LWChapterLength","Numberofperiods"])

HerearethedataforHuckleberryFinn.Eachrowofthetablebelowcorrespondstoonechapterofthenovelanddisplaysthenumberofcharactersaswellasthenumberofperiodsinthechapter.Notsurprisingly,chapterswithfewercharactersalsotendtohavefewerperiods,ingeneral;theshorterthechapter,thefewersentencestheretendtobe,andviceversa.Therelationisnotentirelypredictable,however,assentencesareofvaryinglengthsandcaninvolveotherpunctuationsuchasquestionmarks.

chlen_per_hf

HFChapterLength Numberofperiods

7026 66

11982 117

8529 72

6799 84

8166 91

14550 125

13218 127

22208 249

8081 71

7036 70

...(33rowsomitted)

HerearethecorrespondingdataforLittleWomen.

chlen_per_lw

LWChapterLength Numberofperiods

21759 189

22148 188

20558 231

25526 195

23395 255

14622 140

14431 131

22476 214

33767 337

18508 185

...(37rowsomitted)

YoucanseethatthechaptersofLittleWomenareingenerallongerthanthoseofHuckleberryFinn.Letusseeifthesetwosimplevariables–thelengthandnumberofperiodsineachchapter–cantellusanythingmoreaboutthetwoauthors.Onewayforustodothisistoplotbothsetsofdataonthesameaxes.

Intheplotbelow,thereisadotforeachchapterineachbook.BluedotscorrespondtoHuckleberryFinnandyellowdotstoLittleWomen.Thehorizontalaxisrepresentsthenumberofcharactersandtheverticalaxisrepresentsthenumberofperiods.

plots.figure(figsize=(10,10))

plots.scatter([len(s)forsinhuck_finn_chapters],np.char.count(huck_finn_chapters

plots.scatter([len(s)forsinlittle_women_chapters],np.char.count

<matplotlib.collections.PathCollectionat0x109ebee10>

TheplotshowsusthatmanybutnotallofthechaptersofLittleWomenarelongerthanthoseofHuckleberryFinn,aswehadobservedbyjustlookingatthenumbers.Butitalsoshowsussomethingmore.Noticehowthebluepointsareroughlyclusteredaroundastraightline,asaretheyellowpoints.Moreover,itlooksasthoughbothcolorsofpointsmightbeclusteredaroundthesamestraightline.

Indeed,itappearsfromlookingattheplotthatonaveragebothbookstendtohavesomewherebetween100and150charactersbetweenperiods.Perhapsthesetwogreat19thcenturynovelsweresignalingsomethingsoveryfamiliarusnow:the140-characterlimitofTwitter.

CausalityandExperiments"Theseproblemsare,andwillprobablyeverremain,amongtheinscrutablesecretsofnature.Theybelongtoaclassofquestionsradicallyinaccessibletothehumanintelligence."—TheTimesofLondon,September1849,onhowcholeraiscontractedandspread

Doesthedeathpenaltyhaveadeterrenteffect?Ischocolategoodforyou?Whatcausesbreastcancer?

Allofthesequestionsattempttoassignacausetoaneffect.Acarefulexaminationofdatacanhelpshedlightonquestionslikethese.Inthissectionyouwilllearnsomeofthefundamentalconceptsinvolvedinestablishingcausality.

Observationisakeytogoodscience.Anobservationalstudyisoneinwhichscientistsmakeconclusionsbasedondatathattheyhaveobservedbuthadnohandingenerating.Indatascience,manysuchstudiesinvolveobservationsonagroupofindividuals,afactorofinterestcalledatreatment,andanoutcomemeasuredoneachindividual.

Itiseasiesttothinkoftheindividualsaspeople.Inastudyofwhetherchocolateisgoodforthehealth,theindividualswouldindeedbepeople,thetreatmentwouldbeeatingchocolate,andtheoutcomemightbeameasureofbloodpressure.Butindividualsinobservationalstudiesneednotbepeople.Inastudyofwhetherthedeathpenaltyhasadeterrenteffect,theindividualscouldbethe50statesoftheunion.Astatelawallowingthedeathpenaltywouldbethetreatment,andanoutcomecouldbethestate’smurderrate.

Thefundamentalquestioniswhetherthetreatmenthasaneffectontheoutcome.Anyrelationbetweenthetreatmentandtheoutcomeiscalledanassociation.Ifthetreatmentcausestheoutcometooccur,thentheassociationiscausal.Causalityisattheheartofallthreequestionsposedatthestartofthissection.Forexample,oneofthequestionswaswhetherchocolatedirectlycausesimprovementsinhealth,notjustwhethertherethereisarelationbetweenchocolateandhealth.

Theestablishmentofcausalityoftentakesplaceintwostages.First,anassociationisobserved.Next,amorecarefulanalysisleadstoadecisionaboutcausality.

18CausalityandExperiments

ObservationandVisualization:JohnSnowandtheBroadStreetPumpOneoftheearliestexamplesofastuteobservationeventuallyleadingtotheestablishmentofcausalitydatesbackmorethan150years.Togetyourmindintotherighttimeframe,trytoimagineLondoninthe1850’s.Itwastheworld’swealthiestcitybutmanyofitspeopleweredesperatelypoor.CharlesDickens,thenattheheightofhisfame,waswritingabouttheirplight.Diseasewasrifeinthepoorerpartsofthecity,andcholerawasamongthemostfeared.Itwasnotyetknownthatgermscausedisease;theleadingtheorywasthat“miasmas”werethemainculprit.Miasmasmanifestedthemselvesasbadsmells,andwerethoughttobeinvisiblepoisonousparticlesarisingoutofdecayingmatter.PartsofLondondidsmellverybad,especiallyinhotweather.Toprotectthemselvesagainstinfection,thosewhocouldaffordtoheldsweet-smellingthingstotheirnoses.

Forseveralyears,adoctorbythenameofJohnSnowhadbeenfollowingthedevastatingwavesofcholerathathitEnglandfromtimetotime.Thediseasearrivedsuddenlyandwasalmostimmediatelydeadly:peoplediedwithinadayortwoofcontractingit,hundredscoulddieinaweek,andthetotaldeathtollinasinglewavecouldreachtensofthousands.Snowwasskepticalofthemiasmatheory.Hehadnoticedthatwhileentirehouseholdswerewipedoutbycholera,thepeopleinneighboringhousessometimesremainedcompletelyunaffected.Astheywerebreathingthesameair—andmiasmas—astheirneighbors,therewasnocompellingassociationbetweenbadsmellsandtheincidenceofcholera.

Snowhadalsonoticedthattheonsetofthediseasealmostalwaysinvolvedvomitinganddiarrhea.Hethereforebelievedthatthatinfectionwascarriedbysomethingpeopleateordrank,notbytheairthattheybreathed.Hisprimesuspectwaswatercontaminatedbysewage.

AttheendofAugust1854,cholerastruckintheovercrowdedSohodistrictofLondon.Asthedeathsmounted,Snowrecordedthemdiligently,usingamethodthatwentontobecomestandardinthestudyofhowdiseasesspread:hedrewamap.Onastreetmapofthedistrict,herecordedthelocationofeachdeath.

HereisSnow’soriginalmap.Eachblackbarrepresentsonedeath.Theblackdiscsmarkthelocationsofwaterpumps.Themapdisplaysastrikingrevelation–thedeathsareroughlyclusteredaroundtheBroadStreetpump.

19ObservationandVisualization:JohnSnowandtheBroadStreetPump

Snowstudiedhismapcarefullyandinvestigatedtheapparentanomalies.AllofthemimplicatedtheBroadStreetpump.Forexample:

ThereweredeathsinhousesthatwerenearertheRupertStreetpumpthantheBroadStreetpump.ThoughtheRupertStreetpumpwascloserasthecrowflies,itwaslessconvenienttogettobecauseofdeadendsandthelayoutofthestreets.TheresidentsinthosehousesusedtheBroadStreetpumpinstead.Therewerenodeathsintwoblocksjusteastofthepump.ThatwasthelocationoftheLionBrewery,wheretheworkersdrankwhattheybrewed.Iftheywantedwater,thebreweryhaditsownwell.TherewerescattereddeathsinhousesseveralblocksawayfromtheBroadStreetpump.ThosewerechildrenwhodrankfromtheBroadStreetpumpontheirwaytoschool.Thepump’swaterwasknowntobecoolandrefreshing.

ThefinalpieceofevidenceinsupportofSnow’stheorywasprovidedbytwoisolateddeathsintheleafyandgenteelHampsteadarea,quitefarfromSoho.SnowwaspuzzledbytheseuntilhelearnedthatthedeceasedwereMrs.SusannahEley,whohadoncelivedinBroad

Street,andherniece.Mrs.EleyhadwaterfromtheBroadStreetpumpdeliveredtoherinHampsteadeveryday.Shelikeditstaste.

LateritwasdiscoveredthatacesspitthatwasjustafewfeetawayfromthewelloftheBroadStreetpumphadbeenleakingintothewell.Thusthepump’swaterwascontaminatedbysewagefromthehousesofcholeravictims.

SnowusedhismaptoconvincelocalauthoritiestoremovethehandleoftheBroadStreetpump.Thoughthecholeraepidemicwasalreadyonthewanewhenhedidso,itispossiblethatthedisablingofthepumppreventedmanydeathsfromfuturewavesofthedisease.

TheremovaloftheBroadStreetpumphandlehasbecomethestuffoflegend.AttheCentersforDiseaseControl(CDC)inAtlanta,whenscientistslookforsimpleanswerstoquestionsaboutepidemics,theysometimesaskeachother,“Whereisthehandletothispump?”

Snow’smapisoneoftheearliestandmostpowerfulusesofdatavisualization.Diseasemapsofvariouskindsarenowastandardtoolfortrackingepidemics.

TowardsCausality

ThoughthemapgaveSnowastrongindicationthatthecleanlinessofthewatersupplywasthekeytocontrollingcholera,hewasstillalongwayfromaconvincingscientificargumentthatcontaminatedwaterwascausingthespreadofthedisease.Tomakeamorecompellingcase,hehadtousethemethodofcomparison.

Scientistsusecomparisontoidentifyanassociationbetweenatreatmentandanoutcome.Theycomparetheoutcomesofagroupofindividualswhogotthetreatment(thetreatmentgroup)totheoutcomesofagroupwhodidnot(thecontrolgroup).Forexample,researcherstodaymightcomparetheaveragemurderrateinstatesthathavethedeathpenaltywiththeaveragemurderrateinstatesthatdon’t.

Iftheresultsaredifferent,thatisevidenceforanassociation.Todeterminecausation,however,evenmorecareisneeded.

Snow’s“GrandExperiment”EncouragedbywhathehadlearnedinSoho,Snowcompletedamorethoroughanalysisofcholeradeaths.Forsometime,hehadbeengatheringdataoncholeradeathsinanareaofLondonthatwasservedbytwowatercompanies.TheLambethwatercompanydrewitswaterupriverfromwheresewagewasdischargedintotheRiverThames.Itswaterwasrelativelyclean.ButtheSouthwarkandVauxhall(S&V)companydrewitswaterbelowthesewagedischarge,andthusitssupplywascontaminated.

SnownoticedthattherewasnosystematicdifferencebetweenthepeoplewhoweresuppliedbyS&VandthosesuppliedbyLambeth.“Eachcompanysuppliesbothrichandpoor,bothlargehousesandsmall;thereisnodifferenceeitherintheconditionoroccupationofthepersonsreceivingthewaterofthedifferentCompanies…thereisnodifferencewhateverinthehousesorthepeoplereceivingthesupplyofthetwoWaterCompanies,orinanyofthephysicalconditionswithwhichtheyaresurrounded…”

Theonlydifferencewasinthewatersupply,“onegroupbeingsuppliedwithwatercontainingthesewageofLondon,andamongstit,whatevermighthavecomefromthecholerapatients,theothergrouphavingwaterquitefreefromimpurity.”

Confidentthathewouldbeabletoarriveataclearconclusion,Snowsummarizedhisdatainthetablebelow.

SupplyArea Numberofhouses

choleradeaths

deathsper10,000houses

S&V 40,046 1,263 315

Lambeth 26,107 98 37

RestofLondon 256,423 1,422 59

ThenumberspointedaccusinglyatS&V.ThedeathratefromcholeraintheS&VhouseswasalmosttentimestherateinthehousessuppliedbyLambeth.

22Snow’s“GrandExperiment”

EstablishingCausalityInthelanguagedevelopedearlierinthesection,youcanthinkofthepeopleintheS&Vhousesasthetreatmentgroup,andthoseintheLambethhousesatthecontrolgroup.AcrucialelementinSnow’sanalysiswasthatthepeopleinthetwogroupswerecomparabletoeachother,apartfromthetreatment.

Inordertoestablishwhetheritwasthewatersupplythatwascausingcholera,Snowhadtocomparetwogroupsthatweresimilartoeachotherinallbutoneaspect–theirwatersupply.Onlythenwouldhebeabletoascribethedifferencesintheiroutcomestothewatersupply.Ifthetwogroupshadbeendifferentinsomeotherwayaswell,itwouldhavebeendifficulttopointthefingeratthewatersupplyasthesourceofthedisease.Forexample,ifthetreatmentgroupconsistedoffactoryworkersandthecontrolgroupdidnot,thendifferencesbetweentheoutcomesinthetwogroupscouldhavebeenduetothewatersupply,ortofactorywork,orboth,ortoanyothercharacteristicthatmadethegroupsdifferentfromeachother.Thefinalpicturewouldhavebeenmuchmorefuzzy.

Snow’sbrilliancelayinidentifyingtwogroupsthatwouldmakehiscomparisonclear.Hehadsetouttoestablishacausalrelationbetweencontaminatedwaterandcholerainfection,andtoagreatextenthesucceeded,eventhoughthemiasmatistsignoredandevenridiculedhim.Ofcourse,Snowdidnotunderstandthedetailedmechanismbywhichhumanscontractcholera.Thatdiscoverywasmadein1883,whentheGermanscientistRobertKochisolatedtheVibriocholerae,thebacteriumthatentersthehumansmallintestineandcausescholera.

InfacttheVibriocholeraehadbeenidentifiedin1854byFilippoPaciniinItaly,justaboutwhenSnowwasanalyzinghisdatainLondon.BecauseofthedominanceofthemiasmatistsinItaly,Pacini’sdiscoverylanguishedunknown.Butbytheendofthe1800’s,themiasmabrigadewasinretreat.SubsequenthistoryhasvindicatedPaciniandJohnSnow.Snow’smethodsledtothedevelopmentofthefieldofepidemiology,whichisthestudyofthespreadofdiseases.

Confounding

Letusnowreturntomoremoderntimes,armedwithanimportantlessonthatwehavelearnedalongtheway:

Inanobservationalstudy,ifthetreatmentandcontrolgroupsdifferinwaysotherthanthetreatment,itisdifficulttomakeconclusionsaboutcausality.

Anunderlyingdifferencebetweenthetwogroups(otherthanthetreatment)iscalledaconfoundingfactor,becauseitmightconfoundyou(thatis,messyouup)whenyoutrytoreachaconclusion.

23EstablishingCausality

Example:Coffeeandlungcancer.Studiesinthe1960’sshowedthatcoffeedrinkershadhigherratesoflungcancerthanthosewhodidnotdrinkcoffee.Becauseofthis,somepeopleidentifiedcoffeeasacauseoflungcancer.Butcoffeedoesnotcauselungcancer.Theanalysiscontainedaconfoundingfactor–smoking.Inthosedays,coffeedrinkerswerealsolikelytohavebeensmokers,andsmokingdoescauselungcancer.Coffeedrinkingwasassociatedwithlungcancer,butitdidnotcausethedisease.

Confoundingfactorsarecommoninobservationalstudies.Goodstudiestakegreatcaretoreduceconfounding.

24EstablishingCausality

RandomizationAnexcellentwaytoavoidconfoundingistoassignindividualstothetreatmentandcontrolgroupsatrandom,andthenadministerthetreatmenttothosewhowereassignedtothetreatmentgroup.Randomizationkeepsthetwogroupssimilarapartfromthetreatment.

Ifyouareabletorandomizeindividualsintothetreatmentandcontrolgroups,youarerunningarandomizedcontrolledexperiment,alsoknownasarandomizedcontrolledtrial(RCT).Sometimes,people’sresponsesinanexperimentareinfluencedbytheirknowingwhichgrouptheyarein.Soyoumightwanttorunablindexperimentinwhichindividualsdonotknowwhethertheyareinthetreatmentgrouporthecontrolgroup.Tomakethiswork,youwillhavetogivethecontrolgroupaplacebo,whichissomethingthatlooksexactlylikethetreatmentbutinfacthasnoeffect.

Randomizedcontrolledexperimentshavelongbeenagoldstandardinthemedicalfield,forexampleinestablishingwhetheranewdrugworks.Theyarealsobecomingmorecommonlyusedinotherfieldssuchaseconomics.

Example:WelfaresubsidiesinMexico.InMexicanvillagesinthe1990’s,childreninpoorfamilieswereoftennotenrolledinschool.Oneofthereasonswasthattheolderchildrencouldgotoworkandthushelpsupportthefamily.SantiagoLevy,aministerinMexicanMinistryofFinance,setouttoinvestigatewhetherwelfareprogramscouldbeusedtoincreaseschoolenrollmentandimprovehealthconditions.HeconductedanRCTonasetofvillages,selectingsomeofthematrandomtoreceiveanewwelfareprogramcalledPROGRESA.Theprogramgavemoneytopoorfamiliesiftheirchildrenwenttoschoolregularlyandthefamilyusedpreventivehealthcare.Moremoneywasgivenifthechildrenwereinsecondaryschoolthaninprimaryschool,tocompensateforthechildren’slostwages,andmoremoneywasgivenforgirlsattendingschoolthanforboys.Theremainingvillagesdidnotgetthistreatment,andformedthecontrolgroup.Becauseoftherandomization,therewerenoconfoundingfactorsanditwaspossibletoestablishthatPROGRESAincreasedschoolenrollment.Forboys,theenrollmentincreasedfrom73%inthecontrolgroupto77%inthePROGRESAgroup.Forgirls,theincreasewasevengreater,from67%inthecontrolgrouptoalmost75%inthePROGRESAgroup.Duetothesuccessofthisexperiment,theMexicangovernmentsupportedtheprogramunderthenewnameOPORTUNIDADES,asaninvestmentinahealthyandwelleducatedpopulation.

Insomesituationsitmightnotbepossibletocarryoutarandomizedcontrolledexperiment,evenwhentheaimistoinvestigatecausality.Forexample,supposeyouwanttostudytheeffectsofalcoholconsumptionduringpregnancy,andyourandomlyassignsomepregnant

25Randomization

womentoyour“alcohol”group.Youshouldnotexpectcooperationfromthemifyoupresentthemwithadrink.Insuchsituationsyouwillalmostinvariablybeconductinganobservationalstudy,notanexperiment.Bealertforconfoundingfactors.

26Randomization

EndnoteIntheterminologyofthatwehavedeveloped,JohnSnowconductedanobservationalstudy,notarandomizedexperiment.Buthecalledhisstudya“grandexperiment”because,ashewrote,“Nofewerthanthreehundredthousandpeople…weredividedintotwogroupswithouttheirchoice,andinmostcases,withouttheirknowledge…”

StudiessuchasSnow’saresometimescalled“naturalexperiments.”However,truerandomizationdoesnotsimplymeanthatthetreatmentandcontrolgroupsareselected“withouttheirchoice.”

Themethodofrandomizationcanbeassimpleastossingacoin.Itmayalsobequiteabitmorecomplex.Buteverymethodofrandomizationconsistsofasequenceofcarefullydefinedstepsthatallowchancestobespecifiedmathematically.Thishastwoimportantconsequences.

1. Itallowsustoaccount–mathematically–forthepossibilitythatrandomizationproducestreatmentandcontrolgroupsthatarequitedifferentfromeachother.

2. Itallowsustomakeprecisemathematicalstatementsaboutdifferencesbetweenthetreatmentandcontrolgroups.Thisinturnhelpsusmakejustifiableconclusionsaboutwhetherthetreatmenthasanyeffect.

Inthiscourse,youwilllearnhowtoconductandanalyzeyourownrandomizedexperiments.Thatwillinvolvemoredetailthanhasbeenpresentedinthissection.Fornow,justfocusonthemainidea:totrytoestablishcausality,runarandomizedcontrolledexperimentifpossible.Ifyouareconductinganobservationalstudy,youmightbeabletoestablishassociationbutnotcausation.Beextremelycarefulaboutconfoundingfactorsbeforemakingconclusionsaboutcausalitybasedonanobservationalstudy.

Terminology

observationalstudytreatmentoutcomeassociationcausalassociationcausalitycomparisontreatmentgroupcontrolgroupepidemiology

27Endnote

confoundingrandomizationrandomizedcontrolledexperimentrandomizedcontrolledtrial(RCT)blindplacebo

Funfacts

1. JohnSnowissometimescalledthefatherofepidemiology,buthewasananesthesiologistbyprofession.OneofhispatientswasQueenVictoria,whowasanearlyrecipientofanestheticsduringchildbirth.

2. FlorenceNightingale,theoriginatorofmodernnursingpracticesandfamousforherworkintheCrimeanWar,wasadie-hardmiasmatist.Shehadnotimefortheoriesaboutcontagionandgerms,andwasnotoneformincingherwords.“Thereisnoendtotheabsurditiesconnectedwiththisdoctrine,”shesaid.“Sufficeittosaythatintheordinarysenseoftheword,thereisnoproofsuchaswouldbeadmittedinanyscientificenquirythatthereisanysuchthingascontagion.”

3. AlaterRCTestablishedthattheconditionsonwhichPROGRESAinsisted–childrengoingtoschool,preventivehealthcare–werenotnecessarytoachieveincreasedenrollment.Justthefinancialboostofthewelfarepaymentswassufficient.

Goodreads

TheStrangeCaseoftheBroadStreetPump:JohnSnowandtheMysteryofCholerabySandraHempel,publishedbyourownUniversityofCaliforniaPress,readslikeawhodunit.Itwasoneofthemainsourcesforthissection'saccountofJohnSnowandhiswork.Awordofwarning:someofthecontentsofthebookarestomach-churning.

PoorEconomics,thebestsellerbyAbhijitV.BanerjeeandEstherDufloofMIT,isanaccessibleandlivelyaccountofwaystofightglobalpoverty.ItincludesnumerousexamplesofRCTs,includingthePROGRESAexampleinthissection.

28Endnote

ExpressionsInteractProgrammingcandramaticallyimproveourabilitytocollectandanalyzeinformationabouttheworld,whichinturncanleadtodiscoveriesthroughthekindofcarefulreasoningdemonstratedintheprevioussection.Indatascience,thepurposeofwritingaprogramistoinstructacomputertocarryoutthestepsofananalysis.Computerscannotstudytheworldontheirown.Peoplemustdescribepreciselywhatstepsthecomputershouldtakeinordertocollectandanalyzedata,andthosestepsareexpressedthroughprograms.

Tolearntoprogram,wewillbeginbylearningsomeofthegrammarrulesofaprogramminglanguage.Programsaremadeupofexpressions,whichdescribetothecomputerhowtocombinepiecesofdata.Forexample,amultiplicationexpressionconsistsofa*symbolbetweentwonumericalexpressions.Expressions,suchas3*4,areevaluatedbythecomputer.Thevalue(theresultofevaluation)ofthelastexpressionineachcell,12inthiscase,isdisplayedbelowthecell.

Therulesofaprogramminglanguagearerigid.InPython,the*symbolcannotappeartwiceinarow.Thecomputerwillnottrytointerpretanexpressionthatdiffersfromitsprescribedexpressionstructures.Instead,itwillshowaSyntaxErrorerror.TheSyntaxofalanguageisitssetofgrammarrules,andaSyntaxErrorindicatesthatsomerulehasbeenviolated.

File"<ipython-input-4-d90564f70db7>",line1

SyntaxError:invalidsyntax

29Expressions

Smallchangestoanexpressioncanchangeitsmeaningentirely.Below,thespacebetweenthe*'shasbeenremoved.Because**appearsbetweentwonumericalexpressions,theexpressionisawell-formedexponentiationexpression(thefirstnumberraisedtothepowerofthesecond:3times3times3times3).Thesymbols*and**arecalledoperators,andthevaluestheycombinearecalledoperands.

CommonOperators.Datascienceofteninvolvescombiningnumericalvalues,andthesetofoperatorsinaprogramminglanguagearedesignedtosothatexpressionscanbeusedtoexpressanysortofarithmetic.InPython,thefollowingoperatorsareessential.

ExpressionType Operator Example Value

Addition + 2+3 5

Subtraction - 2-3 -1

Multiplication * 2*3 6

Division / 7/3 2.66667

Remainder % 7%3 1

Exponentiation ** 2**0.5 1.41421

Pythonexpressionsobeythesamefamiliarrulesofprecedenceasinalgebra:multiplicationanddivisionoccurbeforeadditionandsubtraction.Parenthesescanbeusedtogrouptogethersmallerexpressionswithinalargerexpression.

1+2*3*4*5/6**3+7+8

16.555555555555557

1+2*((3*4*5/6)**3)+7+8

2016.0

30Expressions

Thischapterintroducesmanytypesofexpressions.Learningtoprograminvolvestryingouteverythingyoulearnincombination,investigatingthebehaviorofthecomputer.Whathappensifyoudividebyzero?Whathappensifyoudividetwiceinarow?Youdon'talwaysneedtoaskanexpert(ortheInternet);manyofthesedetailscanbediscoveredbytryingthemoutyourself.

31Expressions

NamesInteractNamesaregiventovaluesinPythonusinganassignmentexpression.Inanassignment,anameisfollowedby=,whichisfollowedbyanotherexpression.Thevalueoftheexpressiontotherightof=isassignedtothename.Onceanamehasavalueassignedtoit,thevaluewillbesubstitutedforthatnameinfutureexpressions.

Apreviouslyassignednamecanbeusedintheexpressiontotherightof=.

quarter=2

whole=4*quarter

However,onlythecurrentvalueofanexpressionisassigedtoaname.Ifthatvaluechangeslater,namesthatweredefinedintermsofthatvaluewillnotchangeautomatically.

quarter=4

Namesmuststartwithaletter,butcancontainbothlettersandnumbers.Anamecannotcontainaspace;instead,itiscommontouseanunderscorecharacter_toreplaceeachspace.Namesareonlyasusefulasyoumakethem;it'suptotheprogrammertochoose

32Names

namesthatareeasytointerpret.Typically,moremeaningfulnamescanbeinventedthanaandb.Forexample,todescribethesalestaxona$5purchaseinBerkeley,CA,thefollowingnamesclarifythemeaningofthevariousquantitiesinvolved.

purchase_price=5

state_tax_rate=0.075

county_tax_rate=0.02

city_tax_rate=0

sales_tax_rate=state_tax_rate+county_tax_rate+city_tax_rate

sales_tax=purchase_price*sales_tax_rate

sales_tax

33Names

Example:GrowthRatesInteractTherelationshipbetweentwomeasurementsofthesamequantitytakenatdifferenttimesisoftenexpressedasagrowthrate.Forexample,theUnitedStatesfederalgovernmentemployed2,766,000peoplein2002and2,814,000peoplein2012\footnotehttp://www.bls.gov/opub/mlr/2013/article/industry-employment-and-output-projections-to-2022-1.htm.Tocomputeagrowthrate,wemustfirstdecidewhichvaluetotreatastheinitialamount.Forvaluesovertime,theearliervalueisanaturalchoice.Then,wedividethedifferencebetweenthechangedandinitialamountbytheinitialamount.

initial=2766000

changed=2814000

(changed-initial)/initial

0.01735357917570499

Itismoretypicaltosubtractonefromtheratioofthetwomeasurements,whichyieldsthesamevalue.

(changed/initial)-1

0.017353579175704903

Thisvalueisthegrowthrateover10years.Ausefulpropertyofgrowthratesisthattheydon'tchangeevenifthevaluesareexpressedindifferentunits.So,forexample,wecanexpressthesamerelationshipbetweenthousandsofpeoplein2002and2012.

initial=2766

changed=2814

(changed/initial)-1

0.017353579175704903

34Example:GrowthRates

In10years,thenumberofemployeesoftheUSFederalGovernmenthasincreasedbyonly1.74%.Inthattime,thetotalexpendituresoftheUSFederalGovernmentincreasedfrom$2.37trillionto$3.38trillionin2012.

initial=2.37

changed=3.38

(changed/initial)-1

0.4261603375527425

A42.6%increaseinthefederalbudgetismuchlargerthanthe1.74%increaseinfederalemployees.Infact,thenumberoffederalemployeeshasgrownmuchmoreslowlythanthepopulationoftheUnitedStates,whichincreased9.21%inthesametimeperiodfrom287.6millionpeoplein2002to314.1millionin2012.

initial=287.6

changed=314.1

(changed/initial)-1

0.09214186369958277

Agrowthratecanbenegative,representingadecreaseinsomevalue.Forexample,thenumberofmanufacturingjobsintheUSdecreasedfrom15.3millionin2002to11.9millionin2012,a-22.2%growthrate.

initial=15.3

changed=11.9

(changed/initial)-1

-0.2222222222222222

Anannualgrowthrateisagrowthrateofsomequantityoverasingleyear.Anannualgrowthrateof0.035,accumulatedeachyearfor10years,givesamuchlargerten-yeargrowthrateof0.41(or41%).

1.035*1.035*1.035*1.035*1.035*1.035*1.035*1.035*1.035

0.410598760621121

Thissamecomputationcanbeexpressedusingnamesandexponentsforclarity.

annual_growth_rate=0.035

ten_year_growth_rate=(1+annual_growth_rate)**10-1

ten_year_growth_rate

0.410598760621121

Likewise,aten-yeargrowthratecanbeusedtocomputeanequivalentannualgrowthrate.Below,tisthenumberofyearsthathavepassedbetweenmeasurements.Thefollowingcomputestheannualgrowthrateoffederalexpendituresoverthelast10years.

initial=2.37

changed=3.38

(changed/initial)**(1/t)-1

0.03613617208346853

Thetotalgrowthover10yearsisequivalenttoa3.6%increaseeachyear.

CallExpressionsInteractCallexpressionsinvokefunctions,whicharenamedoperations.Thenameofthefunctionappearsfirst,followedbyexpressionsinparentheses.

abs(-12)

round(5-1.3)

max(2,2+3,4)

Inthislastexample,themaxfunctioniscalledonthreearguments:2,5,and4.Thevalueofeachexpressionwithinparenthesesispassedtothefunction,andthefunctionreturnsthefinalvalueofthefullcallexpression.Themaxfunctioncantakeanynumberofargumentsandreturnsthemaximum.

Afewfunctionsareavailablebydefault,suchasabsandround,butmostfunctionsthatarebuiltintothePythonlanguagearestoredinacollectionoffunctionscalledamodule.Animportstatementisusedtoprovideaccesstoamodule,suchasmathoroperator.

importmath

importoperator

math.sqrt(operator.add(4,5))

37CallExpressions

Anequivalentexpressioncouldbeexpressedusingthe+and**operatorsinstead.

(4+5)**0.5

Operatorsandcallexpressionscanbeusedtogetherinanexpression.Thepercentdifferencebetweentwovaluesisusedtocomparevaluesforwhichneitheroneisobviouslyinitialorchanged.Forexample,in2014Floridafarmsproduced2.72billioneggswhileIowafarmsproduced16.25billioneggs\footnotehttp://quickstats.nass.usda.gov/.Thepercentdifferenceis100timestheabsolutevalueofthedifferencebetweenthevalues,dividedbytheiraverage.Inthiscase,thedifferenceislargerthantheaverage,andsothepercentdifferenceisgreaterthan100.

florida=2.72

iowa=16.25

100*abs(florida-iowa)/((florida+iowa)/2)

142.6462836056932

Learninghowdifferentfunctionsbehaveisanimportantpartoflearningaprogramminglanguage.AJupyternotebookcanassistinrememberingthenamesandeffectsofdifferentfunctions.Wheneditingacodecell,pressthetabkeyaftertypingthebeginningofanametobringupalistofwaystocompletethatname.Forexample,presstabaftermath.toseeallofthefunctionsavailableinthemathmodule.Typingwillnarrowdownthelistofoptions.Tolearnmoreaboutafunction,placea?afteritsname.Forexample,typingmath.log?willbringupadescriptionofthelogfunctioninthemathmodule.

math.log?

log(x[,base])

Returnthelogarithmofxtothegivenbase.

Ifthebasenotspecified,returnsthenaturallogarithm(basee)ofx.

Thesquarebracketsintheexamplecallindicatethatanargumentisoptional.Thatis,logcanbecalledwitheitheroneortwoarguments.

38CallExpressions

math.log(16,2)

math.log(16)/math.log(2)

39CallExpressions

DataTypesInteractPythonincludesseveraltypesofvaluesthatwillallowustocomputeaboutmorethanjustnumbers.

TypeName PythonClass Example1 Example2

Integer int 1 -2

Floating-PointNumber float 1.5 -2.0

Boolean bool True False

String str 'helloworld' "False"

Thefirsttwotypesrepresentnumbers.A"floating-point"numberdescribesitsrepresentationbyacomputer,whichisatopicforanothertext.Practically,thedifferencebetweenanintegerandafloating-pointnumberisthatthelattercanrepresentfractionsinadditiontowholevalues.

Booleanvalues,namedforthelogicianGeorgeBoole,representtruthandtakeonlytwopossiblevalues:TrueandFalse.

Strings¶Stringsarethemostflexibletypeofall.Theycancontainarbitrarytext.Astringcandescribeanumberoratruthvalue,aswellasanywordorsentence.Muchoftheworld'sdataistext,andtextisrepresentedincomputersasstrings.

Themeaningofanexpressiondependsbothuponitsstructureandthetypesofvaluesthatarebeingcombined.So,forinstance,addingtwostringstogetherproducesanotherstring.Thisexpressionisstillanadditionexpression,butitiscombiningadifferenttypeofvalue.

"data"+"science"

'datascience'

Additioniscompletelyliteral;itcombinesthesetwostringstogetherwithoutregardfortheircontents.Itdoesn'taddaspacebecausethesearedifferentwords;that'suptotheprogrammer(you)tospecify.

40DataTypes

"data"+"sc"+"ienc"+"e"

'datascience'

Singleanddoublequotescanbothbeusedtocreatestrings:'hi'and"hi"areidenticalexpressions.Doublequotesareoftenpreferredbecausetheyallowyoutoincludeapostrophesinsideofstrings.

"Thiswon'tworkwithasingle-quotedstring!"

Whynot?Tryitout.

Thestrfunctionreturnsastringrepresentationofanyvalue.Usingthisfunction,stringscanbeconstructedthathaveembeddedvalues.

"That's"+str(1+1)+''+str(True)

"That's2True"

StringMethods¶

Fromanexistingstring,relatedstringscanbeconstructedusingstringmethods,whicharefunctionsthatoperateonstrings.Thesemethodsarecalledbyplacingadotafterthestring,thencallingthefunction.

Forexample,thefollowingmethodgeneratesanuppercasedversionofastring.

"loud".upper()

'LOUD'

Perhapsthemostimportantmethodisreplace,whichreplacesallinstancesofasubstringwithinthestring.Thereplacemethodtakestwoarguments,thetexttobereplacedanditsreplacement.

41DataTypes

'hitchhiker'.replace('hi','ma')

'matchmaker'

Stringmethodscanalsobeinvokedusingvariablenames,aslongasthosenamesareboundtostrings.So,forinstance,thefollowingtwo-stepprocessgeneratestheword"degrade"startingfrom"train"byfirstcreating"ingrain"andthenapplyingasecondreplacement.

s="train"

t=s.replace('t','ing')

u=t.replace('in','de')

'degrade'

42DataTypes

ComparisonsInteractBooleanvaluesmostoftenarisefromcomparisonoperators.Pythonincludesavarietyofoperatorsthatcomparevalues.Forexample,3islargerthan1+1.

ThevalueTrueindicatesthatthecomparisonisvalid;Pythonhasconfirmedthissimplefactabouttherelationshipbetween3and1+1.Thefullsetofcommoncomparisonoperatorsarelistedbelow.

Comparison Operator Trueexample FalseExample

Lessthan < 2<3 2<2

Greaterthan > 3>2 3>3

Lessthanorequal <= 2<=2 3<=2

Greaterorequal >= 3>=3 2>=3

Equal == 3==3 3==2

Notequal != 3!=2 2!=2

Anexpressioncancontainmultiplecomparisons,andtheyallmustholdinorderforthewholeexpressiontobeTrue.Forexample,wecanexpressthat1+1isbetween1and3usingthefollowingexpression.

1<1+1<3

Theaverageoftwonumbersisalwaysbetweenthesmallernumberandthelargernumber.Weexpressthisrelationshipforthenumbersxandybelow.Youcantrydifferentvaluesofxandytoconfirmthisrelationship.

43Comparisons

min(x,y)<=(x+y)/2<=max(x,y)

Stringscanalsobecompared,andtheirorderisalphabetical.Ashorterstringislessthanalongerstringthatbeginswiththeshorterstring.

"Dog">"Catastrophe">"Cat"

44Comparisons

SequencesInteractValuescanbegroupedtogetherintocollections,whichallowsprogrammerstoorganizethosevaluesandrefertoallofthemwithasinglename.Separatingexpressionsbycommaswithinsquarebracketsplacesthevaluesofthoseexpressionsintoalist,whichisabuilt-insequentialcollection.Afteracomma,it'sokaytostartanewline.Below,wecollectfourdifferenttemperaturesintoalistcalledtemps.ThesearetheestimatedaveragedailyhightemperaturesoveralllandonEarth(indegreesCelsius)forthedecadessurrounding1850,1900,1950,and2000,respectively,expressedasdeviationsfromtheaverageabsolutehightemperaturebetween1951and1980,whichwas14.48degrees.\footnotehttp://berkeleyearth.lbl.gov/regions/global-land

baseline_high=14.48

highs=[baseline_high-0.880,baseline_high-0.093,

baseline_high+0.105,baseline_high+0.684]

[13.6,14.387,14.585,15.164]

Listsallowustopassmultiplevaluesintoafunctionusingasinglename.Forinstance,thesumfunctioncomputesthesumofallvaluesinalist,andthelenfunctioncomputesthelengthofthelist.Usingthemtogether,wecancomputetheaverageofthelist.

sum(highs)/len(highs)

14.434000000000001

Tuples.Valuescanalsobegroupedtogetherintotuplesinadditiontolists.Atupleisexpressedbyseparatingelementsusingcommaswithinparentheses:(1,2,3).Inmanycases,listsandtuplesareinterchangeable.Wewilluseatupletoholdtheaveragedailylowtempuraturesforthesamefourtimeranges.Theonlychangeinthecodeistheswitchtoparenthesesinsteadofbrackets.

45Sequences

baseline_low=3.00

lows=(baseline_low-0.872,baseline_low-0.629,

baseline_low-0.126,baseline_low+0.728)

(2.128,2.371,2.874,3.7279999999999998)

Thecompletechartofdailyhighandlowtemperaturesappearsbelow.

MeanofDailyHighTemperature¶

MeanofDailyLowTemperature¶

46Sequences

47Sequences

ArraysInteractManyexperimentsanddatasetsinvolvemultiplevaluesofthesametype.Anarrayisacollectionofvaluesthatallhavethesametype.Thenumpypackage,abbreviatednpinprograms,providesPythonprogrammerswithconvenientandpowerfulfunctionsforcreatingandmanipulatingarrays.

importnumpyasnp

Anarrayiscreatedusingthenp.arrayfunction,whichtakesalistortupleasanargument.Here,wecreatearraysofaveragedailyhighandlowtemperaturesforthedecadessurrounding1850,1900,1950,and2000.

baseline_high=14.48

highs=np.array([baseline_high-0.880,

baseline_high-0.093,

baseline_high+0.105,

baseline_high+0.684])

array([13.6,14.387,14.585,15.164])

baseline_low=3.00

lows=np.array((baseline_low-0.872,baseline_low-0.629,

baseline_low-0.126,baseline_low+0.728))

array([2.128,2.371,2.874,3.728])

Arraysdifferfromlistsandtuplesbecausetheycanbeusedinarithmeticexpressionstocomputeovertheircontents.Whenanarrayiscombinedwithasinglenumber,thatnumberiscombinedwitheachelementofthearray.Therefore,wecanconvertallofthesetemperaturestoFarenheitbywritingthefamiliarconversionformula.

48Arrays

(9/5)*highs+32

array([56.48,57.8966,58.253,59.2952])

Whentwoarraysarecombinedtogetherusinganarithmeticoperator,theirindividualvaluesarecombined.Forexample,wecancomputethedailytemperaturerangesbysubtracting.

highs-lows

array([11.472,12.016,11.711,11.436])

Arraysalsohavemethods,whicharefunctionsthatoperateonthearrayvalues.Themeanofacollectionofnumbersisitsaveragevalue:thesumdividedbythelength.Eachpairofparenthesesintheexamplebelowispartofacallexpression;it'scallingafunctionwithnoargumentstoperformacomputationonthearraycalledtemps.

[highs.size,highs.sum(),highs.mean()]

[4,57.736000000000004,14.434000000000001]

It'scommonthatsmallroundingerrorsappearinarithmeticonacomputer.Forinstance,themean(average)aboveisinfact14.434,butanextra0.0000000000000001hasbeenaddedduringthecomputationoftheaverage.Arithmeticroundingerrorsareanimportanttopicinscientificcomputing,butwon'tbeafocusofthistext.Everyonewhoworkswithdatamustbeawarethatsomeoperationswillbeapproximatedsothatvaluescanbestoredandmanipulatedefficiently.

FunctionsonArrays¶

Inadditiontobasicarithmeticandmethodssuchasminandmax,manipulatingarraysofteninvolvescallingfunctionsthatarepartofthenppackage.Forexample,thedifffunctioncomputesthedifferencebetweeneachadjacentpairofelementsinanarray.Thefirstelementofthediffisthesecondelementminusthefirst.

np.diff(highs)

49Arrays

array([0.787,0.198,0.579])

ThefullNumpyreferenceliststhesefunctionsexhaustively,butonlyasmallsubsetareusedcommonlyfordataprocessingapplications.Thesearegroupedintodifferentpackageswithinnp.LearningthisvocabularyisanimportantpartoflearningthePythonlanguage,soreferbacktothislistoftenasyouworkthroughexamplesandproblems.

Eachofthesefunctionstakesanarrayasanargumentandreturnsasinglevalue.

Function Description

np.prod Multiplyallelementstogether

np.sum Addallelementstogether

np.allTestwhetherallelementsaretruevalues(non-zeronumbersaretrue)

np.anyTestwhetheranyelementsaretruevalues(non-zeronumbersaretrue)

np.count_nonzero Countthenumberofnon-zeroelements

Eachofthesefunctionstakesanarrayasanargumentandreturnsanarrayofvalues.

np.diff Differencebetweenadjacentelements

np.round Roundeachnumbertothenearestinteger(wholenumber)

np.cumprod Acumulativeproduct:foreachelement,multiplyallelementssofar

np.cumsum Acumulativesum:foreachelement,addallelementssofar

np.exp Exponentiateeachelement

np.log Takethenaturallogarithmofeachelement

np.sqrt Takethesquarerootofeachelement

np.sort Sorttheelements

Eachofthesefunctionstakesanarrayofstringsandreturnsanarray.

50Arrays

np.char.lower Lowercaseeachelement

np.char.upper Uppercaseeachelement

np.char.strip Removespacesatthebeginningorendofeachelement

np.char.isalpha Whethereachelementisonlyletters(nonumbersorsymbols)

np.char.isnumeric Whethereachelementisonlynumeric(noletters)

Eachofthesefunctionstakesbothanarrayofstringsandasearchstring;eachreturnsanarray.

np.char.countCountthenumberoftimesasearchstringappearsamongtheelementsofanarray

np.char.findThepositionwithineachelementthatasearchstringisfoundfirst

np.char.rfindThepositionwithineachelementthatasearchstringisfoundlast

np.char.startswith Whethereachelementstartswiththesearchstring

51Arrays

RangesInteractArangeisanarrayofnumbersinincreasingordecreasingorder,eachseparatedbyaregularinterval.Rangesaredefinedusingthenp.arangefunction,whichtakeseitherone,two,orthreearguments.

np.arange(end):Anarraystartingwith0ofincreasingintegersuptoend

np.arange(start,end):Anarrayofincreasingintegersfromstartuptoend

np.arange(start,end,step):Arangewithstepbetweeneachpairofconsecutivevalues

Arangealwaysincludesitsstartvalue,butdoesnotincludeitsendvalue.Thestepcanbeeitherpositiveornegativeandmaybeawholenumberorafraction.

by_four=np.arange(1,100,4)

by_four

array([1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,

69,73,77,81,85,89,93,97])

Rangeshavemanyuses.Forinstance,arangecanbeusedtocomputepartoftheLeibnizformulaforπ,whichistypicallywrittenas

4*sum(1/by_four-1/(by_four+2))

3.1215946525910097

TheBirthdayProblem¶

Therearekstudentsinaclass.Whatisthechancethatatleasttwoofthestudentshavethesamebirthday?

Assumptions

52Ranges

1. Noleapyears;everyyearhas365days2. Birthsaredistributedevenlythroughouttheyear3. Nostudent'sbirthdayisaffectedbyanyother(e.g.twins)

Let'sstartwithaneasycase:kis4.We'llfirstfindthechancethatallfourpeoplehavedifferentbirthdays.

all_different=(364/365)*(363/365)*(362/365)

Thechancethatthereisatleastonepairwiththesamebirthdayisequivalenttothechancethatthebirthdaysarenotalldifferent.Giventhechanceofsomeeventoccurring,thechancethattheeventdoesnotoccurisoneminusthechancethatitdoes.Withonly4people,thechancethatanytwohavethesamebirthdayislessthan2%.

1-all_different

0.016355912466550215

Usingarange,wecanexpressthissamecomputationcompactly.Webeginwiththenumeratorsofeachfactor.

numerators=np.arange(364,365-k,-1)

numerators

array([364,363,362])

Then,wedivideeachnumeratorby365toformthefactors,multiplythemalltogether,andsubtractfrom1.

1-np.prod(numerators/365)

0.016355912466550215

Ifkis40,thechanceoftwobirthdaysbeingthesameismuchhigher:almost90%!

53Ranges

numerators=np.arange(364,365-k,-1)

1-np.prod(numerators/365)

0.89123180981794903

Usingranges,wecaninvestigatehowthischancechangesaskincreases.Thenp.cumprodfunctioncomputesthecumulativeproductofanarray.Thatis,itcomputesanewarraythathasthesamelengthasitsinput,buttheithelementistheproductofthefirstiterms.Below,thefourthtermoftheresultis1*2*3*4.

ten=np.arange(1,9)

np.cumprod(ten)

array([1,2,6,24,120,720,5040,40320])

Thefollowingcellcomputesthechanceofmatchingbirthdaysforeveryclasssizefrom2upto365.Scrollingthroughtheresult,youwillseethataskincreases,thechanceofmatchingbirthdaysreaches1longbeforetheendofthearray.Infact,foranyksmallerthan365,thereisachancethatallkstudentsinaclasscanhavedifferentbirthdays.Thechanceissosmall,however,thatthedifferencefrom1isroundedawaybythecomputer.

numerators=np.arange(364,0,-1)

chances=1-np.cumprod(numerators/365)

chances

array([0.00273973,0.00820417,0.01635591,0.02713557,0.04046248,

0.0562357,0.07433529,0.09462383,0.11694818,0.14114138,

0.16702479,0.19441028,0.22310251,0.25290132,0.28360401,

0.31500767,0.34691142,0.37911853,0.41143838,0.44368834,

0.47569531,0.50729723,0.53834426,0.5686997,0.59824082,

0.62685928,0.65446147,0.68096854,0.70631624,0.73045463,

0.75334753,0.77497185,0.79531686,0.81438324,0.83218211,

0.84873401,0.86406782,0.87821966,0.89123181,0.90315161,

0.91403047,0.92392286,0.93288537,0.9409759,0.94825284,

0.9547744,0.96059797,0.96577961,0.97037358,0.97443199,

54Ranges

0.97800451,0.98113811,0.98387696,0.98626229,0.98833235,

0.99012246,0.99166498,0.99298945,0.99412266,0.9950888,

0.99590957,0.99660439,0.99719048,0.99768311,0.9980957,

0.99844004,0.99872639,0.99896367,0.99915958,0.99932075,

0.99945288,0.99956081,0.99964864,0.99971988,0.99977744,

0.99982378,0.99986095,0.99989067,0.99991433,0.99993311,

0.99994795,0.99995965,0.99996882,0.999976,0.99998159,

0.99998593,0.99998928,0.99999186,0.99999385,0.99999537,

0.99999652,0.9999974,0.99999806,0.99999856,0.99999893,

0.99999922,0.99999942,0.99999958,0.99999969,0.99999978,

0.99999984,0.99999988,0.99999992,0.99999994,0.99999996,

0.99999997,0.99999998,0.99999998,0.99999999,0.99999999,

0.99999999,1.,1.,1.,1.,

1.,1.,1.,1.,1.,

55Ranges

1.,1.,1.,1.,1.,

1.,1.,1.,1.])

Theitemmethodofanarrayallowsustoselectaparticularelementfromthearraybyitsposition.Thestartingpositionofanarrayisitem(0),sofindingthechanceofmatchingbirthdaysina40-personclassinvolvesextractingitem(40-2)(becausethestartingitemisfora2-personclass).

chances.item(40-2)

0.891231809817949

56Ranges

TablesInteractTablesareafundamentalobjecttypeforrepresentingdatasets.Atablecanbeviewedintwoways.Tablesareasequenceofnamedcolumnsthateachdescribeasingleaspectofallentriesinadataset.Tablesarealsoasequenceofrowsthateachcontainallinformationaboutasingleentryinadataset.

Inordertousetables,importallofthemodulecalleddatascience,amodulecreatedforthistext.

fromdatascienceimport*

EmptytablescanbecreatedusingtheTablefunction,whichoptionallytakesalistofcolumnlabels.Thewith_rowandwith_rowsmethodsofatablereturnnewtableswithoneormoreadditionalrows.Forexample,wecouldcreateatableofOddandEvennumbers.

Table(['Odd','Even']).with_row([3,4])

Odd Even

Table(['Odd','Even']).with_rows([[3,4],[5,6],[7,8]])

Odd Even

Tablescanalsobeextendedwithadditionalcolumns.Thewith_columnandwith_columnsmethodsreturnnewtableswithadditionallabeledcolumns.Below,webegineachexamplewithanemptytablethathasnocolumns.

Table().with_column('Odd',[3,5,7])

57Tables

Table().with_columns([

'Odd',[3,5,7],

'Even',[4,6,8]

Odd Even

Tablesaremoretypicallycreatedfromfilesthatcontaincomma-separatedvalues,calledCSVfiles.Thefilebelowcontains"AnnualEstimatesoftheResidentPopulationbySingleYearofAgeandSexfortheUnitedStates."

census_url='http://www.census.gov/popest/data/national/asrh/2014/files/NC-EST2014-AGESEX-RES.csv'

full_census_table=Table.read_table(census_url)

full_census_table

SEX AGE CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010

0 0 3944153 3944160 3951330

0 1 3978070 3978090 3957888

0 2 4096929 4096939 4090862

0 3 4119040 4119051 4111920

0 4 4063170 4063186 4077552

0 5 4056858 4056872 4064653

0 6 4066381 4066412 4073013

0 7 4030579 4030594 4043047

0 8 4046486 4046497 4025604

0 9 4148353 4148369 4125415

58Tables

...(296rowsomitted)

Adescriptionofthetableappearsonline.TheSEXcolumncontainsnumericcodes:0standsforthetotal,1formale,and2forfemale.TheAGEcolumncontainsages,butthespecialvalue999isasumofthetotalpopulation.TherestofthecolumnscontainestimatesoftheUSpopulation.

Typically,apublictablewillcontainmoreinformationthannecessaryforaparticularinvestigationoranalysis.Inthiscase,letussupposethatweareonlyinterestedinthepopulationchangesfrom2010to2014.Wecanselectonlyasubsetofthecolumnsusingtheselectmethod.

partial_census_table=full_census_table.select(['SEX','AGE','POPESTIMATE2010'

partial_census_table

SEX AGE POPESTIMATE2010 POPESTIMATE2014

0 0 3951330 3948350

0 1 3957888 3962123

0 2 4090862 3957772

0 3 4111920 4005190

0 4 4077552 4003448

0 5 4064653 4004858

0 6 4073013 4134352

0 7 4043047 4154000

0 8 4025604 4119524

0 9 4125415 4106832

...(296rowsomitted)

Therelabeledmethodcreatesanalternativeversionofthetablethatreplacesacolumnlable.Usingthismethod,wecansimplifythelabelsoftheselectedcolumns.

simple=partial_census_table.relabeled('POPESTIMATE2010','2010').

simple

59Tables

SEX AGE 2010 2014

0 0 3951330 3948350

0 1 3957888 3962123

0 2 4090862 3957772

0 3 4111920 4005190

0 4 4077552 4003448

0 5 4064653 4004858

0 6 4073013 4134352

0 7 4043047 4154000

0 8 4025604 4119524

0 9 4125415 4106832

...(296rowsomitted)

Thecolumnmethodreturnsanarrayofthevaluesinaparticularcolumn.Eachcolumnofatableisanarrayofthesamelength,andsocolumnscanbecombinedusingarithmetic.

simple.column('2014')-simple.column('2010')

array([-2980,4235,-133090,...,6717,13410,4662996])

Atablewithanadditionalcolumncanbecreatedfromanexistingtableusingthewith_columnmethod,whichtakesalabelandasequenceofvaluesasarguments.Theremustbeeitherasmanyvaluesasthereareexistingrowsinthetableorasinglevalue(whichfillsthecolumnwiththatvalue).

change=simple.column('2014')-simple.column('2010')

annual_growth_rate=(simple.column('2014')/simple.column('2010'))

census=simple.with_columns(['Change',change,'Growth',annual_growth_rate

census

60Tables

SEX AGE 2010 2014 Change Growth

0 0 3951330 3948350 -2980 -0.000188597

0 1 3957888 3962123 4235 0.000267397

0 2 4090862 3957772 -133090 -0.00823453

0 3 4111920 4005190 -106730 -0.0065532

0 4 4077552 4003448 -74104 -0.00457471

0 5 4064653 4004858 -59795 -0.00369821

0 6 4073013 4134352 61339 0.00374389

0 7 4043047 4154000 110953 0.00679123

0 8 4025604 4119524 93920 0.00578232

0 9 4125415 4106832 -18583 -0.00112804

...(296rowsomitted)

Althoughthecolumnsofthistablearesimplyarraysofnumbers,theformatofthosenumberscanbechangedtoimprovetheinterpretabilityofthetable.Theset_formatmethodtakesFormatterobjects,whichexistfordates(DateFormatter),currencies(CurrencyFormatter),numbers,andpercentages.

census.set_format('Growth',PercentFormatter)

census.set_format(['2010','2014','Change'],NumberFormatter)

0 0 3,951,330 3,948,350 -2,980 -0.02%

0 1 3,957,888 3,962,123 4,235 0.03%

0 2 4,090,862 3,957,772 -133,090 -0.82%

0 3 4,111,920 4,005,190 -106,730 -0.66%

0 4 4,077,552 4,003,448 -74,104 -0.46%

0 5 4,064,653 4,004,858 -59,795 -0.37%

0 6 4,073,013 4,134,352 61,339 0.37%

0 7 4,043,047 4,154,000 110,953 0.68%

0 8 4,025,604 4,119,524 93,920 0.58%

0 9 4,125,415 4,106,832 -18,583 -0.11%

...(296rowsomitted)

61Tables

Theinformationinatablecanbeaccessedinmanyways.Theattributesdemonstratedbelowcanbeusedforanytable.

census.labels

('SEX','AGE','2010','2014','Change','Growth')

census.num_columns

census.num_rows

census.row(5)

Row(SEX=0,AGE=5,2010=4064653,2014=4004858,Change=-59795,Growth=-0.0036982077956316806)

census.column(2)

array([3951330,3957888,4090862,...,26074,45058,

157257573])

Anindividualiteminatablecanbeaccessedviaaroworacolumn.

census.row(0).item(2)

3951330

62Tables

census.column(2).item(0)

3951330

Let'stakealookatthegrowthratesofthetotalnumberofmalesandfemalesbyselectingonlytherowsthatsumoverallages.Thissumisexpressedwiththespecialvalue999accordingtothisdatasetdescription.

census.where('AGE',999)

0 999 309,347,057 318,857,056 9,509,999 0.76%

1 999 152,089,484 156,936,487 4,847,003 0.79%

2 999 157,257,573 161,920,569 4,662,996 0.73%

Whatagesofmalesaredrivingthisrapidgrowth?Wecanfirstfilterthecensustabletokeeponlythemaleentries,thensortbygrowthrateindecreasingorder.

males=census.where('SEX',1)

males.sort('Growth',descending=True)

1 99 6,104 9,037 2,933 10.31%

1 100 9,351 13,729 4,378 10.08%

1 98 9,504 13,649 4,145 9.47%

1 93 60,182 85,980 25,798 9.33%

1 96 22,022 31,235 9,213 9.13%

1 94 43,828 62,130 18,302 9.12%

1 97 14,775 20,479 5,704 8.50%

1 95 31,736 42,824 11,088 7.78%

1 91 104,291 138,080 33,789 7.27%

1 92 83,462 109,873 26,411 7.12%

...(92rowsomitted)

63Tables

ThefactthattherearemoremenwithAGEof100than99lookssuspicious;shouldn'ttherebefewer?Acarefullookatthedescriptionofthedatasetrevealsthatthe100categoryactuallyincludesallmenwhoare100orolder.Thegrowthratesinmenattheseveryoldagescouldhaveseveralexplanations,suchasalargeinfluxfromanothercountry,butthemostnaturalexplanationisthatpeoplearesimplylivinglongerin2014than2010.

Thewheremethodcanalsotakeanarrayofbooleanvalues,constructedbyapplyingsomecomparisonoperatortoacolumnofthetable.Forexample,wecanfindalloftheagegroupsamongmalesforwhichtheabsoluteChangeissubstantial.Theshowmethoddisplaysallrowswithoutabbreviating.

males.where(males.column('Change')>200000).show()

1 23 2,151,095 2,399,883 248,788 2.77%

1 24 2,161,380 2,391,398 230,018 2.56%

1 34 1,908,761 2,192,455 283,694 3.52%

1 57 1,910,028 2,110,149 200,121 2.52%

1 59 1,779,504 2,006,900 227,396 3.05%

1 64 1,291,843 1,661,474 369,631 6.49%

1 65 1,272,693 1,607,688 334,995 6.02%

1 66 1,239,805 1,589,127 349,322 6.40%

1 67 1,270,148 1,653,257 383,109 6.81%

1 71 903,263 1,169,356 266,093 6.67%

1 999 152,089,484 156,936,487 4,847,003 0.79%

Byfarthelargestchangesareclumpedtogetherinthe64-67range.In2014,thesepeoplewouldbebornbetween1947and1951,theheightofthepost-WWIIbabyboomintheUnitedStates.

Itispossibletospecifymultipleconditionsusingthefunctionsnp.logical_andandnp.logical_or.Whentwoconditionsarecombinedwithlogical_and,bothmustbetrueforarowtoberetained.Whenconditionsarecombinedwithlogical_or,theneitheroneofthemmustbetrueforarowtoberetained.Herearetwodifferentwaystoselect18and19yearolds.

64Tables

both=census.where(census.column('SEX')!=0)

both.where(np.logical_or(both.column('AGE')==18,

both.column('AGE')==19))

1 18 2,305,733 2,165,062 -140,671 -1.56%

1 19 2,334,906 2,220,790 -114,116 -1.24%

2 18 2,185,272 2,060,528 -124,744 -1.46%

2 19 2,236,479 2,105,604 -130,875 -1.50%

both.where(np.logical_and(both.column('AGE')>=18,

both.column('AGE')<=19))

1 18 2,305,733 2,165,062 -140,671 -1.56%

1 19 2,334,906 2,220,790 -114,116 -1.24%

2 18 2,185,272 2,060,528 -124,744 -1.46%

2 19 2,236,479 2,105,604 -130,875 -1.50%

Here,weobservearatherdramaticdecreaseinthenumberof18and19yearoldsintheUnitedStates;thechildrenofthebabyboomgenerationarenowintheir20's,leavingfewerteenagersinAmerica.

AggregationandGrouping.Thisparticulardatasetincludesentriesforsumsacrossallagesandsexes,usingthespecialvalues999and0respectively.However,iftheserowsdidnotexist,wewouldbeabletoaggregatetheremainingentriesinordertocomputethosesameresults.

both.where('AGE',999).select(['SEX','2014'])

SEX 2014

1 156,936,487

2 161,920,569

65Tables

no_sums=both.select(['SEX','AGE','2014']).where(both.column('AGE'

females=no_sums.where('SEX',2).select(['AGE','2014'])

females

AGE 2014

0 1,930,493

1 1,938,870

2 1,935,270

3 1,956,572

4 1,959,950

5 1,961,391

6 2,024,024

7 2,031,760

8 2,014,402

9 2,009,560

...(91rowsomitted)

sum(females['2014'])

161920569

Somecolumnsexpresscategories,suchasthesexoragegroupinthecaseofthecensustable.Aggregationcanalsobeperformedoneveryvalueforacategoryusingthegroupandgroupsmethods.Thegroupmethodtakesasinglecolumn(orcolumnname)asanargumentandcollectsallvaluesassociatedwitheachuniquevalueinthatcolumn.

Thesecondargumenttogroup,thenameofafunction,specifieshowtoaggregatetheresultingvalues.Forexample,thesumfunctioncanbeusedtoaddtogetherthepopulationsforeachage.Inthisresult,theSEXsumcolumnismeaninglessbecausethevaluesweresimplycodesfordifferentcategories.However,the2014sumcolumndoesinfactcontainthetotalnumberacrossallsexesforeachagecategory.

no_sums.select(['AGE','2014']).group('AGE',sum)

66Tables

AGE 2014sum

0 3948350

1 3962123

2 3957772

3 4005190

4 4003448

5 4004858

6 4134352

7 4154000

8 4119524

9 4106832

...(91rowsomitted)

JoiningTables.Therearemanysituationsindataanalysisinwhichtwodifferentrowsneedtobeconsideredtogether.Twotablescanbejoinedintoone,anoperationthatcreatesonelongrowoutoftwomatchingrows.Thesematchingrowscanbefromthesametableordifferenttables.

Forexample,wemightwanttoestimatewhichagecategoriesareexpectedtochangesignificantlyinthefuture.Someonewhois14yearsoldin2014willbe20yearsoldin2020.Therefore,oneestimateofthenumberof20yearoldsin2020isthenumberof14yearoldsin2014.Betweentheagesof1and50,annualmortalityratesareverylow(lessthan0.5%formenandlessthan0.3%forwomen[1]).Immigrationalsoaffectspopulationchanges,butfornowwewillignoreitsinfluence.Let'sconsiderjustfemalesinthisanalysis.

females['AGE+6']=females['AGE']+6

females

67Tables

AGE 2014 AGE+6

0 1,930,493 6

1 1,938,870 7

2 1,935,270 8

3 1,956,572 9

4 1,959,950 10

5 1,961,391 11

6 2,024,024 12

7 2,031,760 13

8 2,014,402 14

9 2,009,560 15

...(91rowsomitted)

Inordertorelatetheagein2014totheagein2020,wewilljointhistablewithitself,matchingvaluesintheAGEcolumnwithvaluesintheAGEin2020column.

joined=females.join('AGE',females,'AGE+6')

joined

AGE 2014 AGE+6 AGE_2 2014_2

6 2024024 12 0 1930493

7 2031760 13 1 1938870

8 2014402 14 2 1935270

9 2009560 15 3 1956572

10 2015380 16 4 1959950

11 2001949 17 5 1961391

12 1993547 18 6 2024024

13 2041159 19 7 2031760

14 2068252 20 8 2014402

15 2035299 21 9 2009560

...(85rowsomitted)

68Tables

Theresultingtablehasfivecolumns.Thefirstthreearethesameasbefore.ThetwonewcolumsarethevaluesforAGEand2014thatappearedinadifferentrow,theoneinwhichthatAGEappearedintheAGE+6column.Forinstance,thefirstrowcontainsthenumberof6yearoldsin2014andanestimateofthenumberof6yearoldsin2020(whowere0in2014).Somerelabelingandselectingmakesthistablemoreclear.

future=joined.select(['AGE','2014','2014_2']).relabeled('2014_2'

future

AGE 2014 2020(est)

6 2024024 1930493

7 2031760 1938870

8 2014402 1935270

9 2009560 1956572

10 2015380 1959950

11 2001949 1961391

12 1993547 2024024

13 2041159 2031760

14 2068252 2014402

15 2035299 2009560

...(85rowsomitted)

AddingaChangecolumnandsortingbythatchangedescribessomeofthemajorchangesinagecategoriesthatwecanexpectintheUnitedStatesforpeopleunder50.Accordingtothissimplisticanalysis,therewillbesubstantiallymorepeopleintheirlate30'sby2020.

predictions=future.where(future['AGE']<50)

predictions['Change']=predictions['2020(est)']-predictions['2014'

predictions.sort('Change',descending=True)

69Tables

AGE 2014 2020(est) Change

40 1940627 2170440 229813

38 1936610 2154232 217622

30 2110672 2301237 190565

37 1995155 2148981 153826

39 1993913 2135416 141503

29 2169563 2298701 129138

35 2046713 2169563 122850

36 2009423 2110672 101249

28 2144666 2244480 99814

41 1977497 2046713 69216

...(34rowsomitted)

70Tables

FunctionsInteractWearebuildingupausefulinventoryoftechniquesforidentifyingpatternsandthemesinadataset.Sortingandgroupingrowsofatablecanfocusourattention.Thenextapproachtoanalysiswewillconsiderinvolvesgroupingrowsofatablebyarbitrarycriteria.Todoso,wewillexploretwocorefeaturesofthePythonprogramminglanguage:functiondefinitionandconditionalstatements.

Wehaveusedfunctionsextensivelyalreadyinthistext,butneverdefinedafunctionofourown.Thepurposeofdefiningafunctionistogiveanametoacomputationalprocessthatmaybeappliedmultipletimes.Therearemanysituationsincomputingthatrequirerepeatedcomputation.Forexample,itisoftenthecasethatwewillwanttoperformthesamemanipulationoneveryvalueinacolumn.

AfunctionisdefinedinPythonusingadefstatement,whichisamulti-linestatementthatbeginswithaheaderlinegivingthenameofthefunctionandnamesfortheargumentsofthefunction.Therestofthedefstatement,calledthebody,mustbeindentedbelowtheheader.

Afunctionexpressesarelationshipbetweenitsinputs(calledarguments)anditsoutputs(calledreturnvalues).Thenumberofargumentsrequiredtocallafunctionisthenumberofnamesthatappearwithinparenthesesinthedefstatementheader.Thevaluesthatarereturneddependonthebody.Wheneverafunctioniscalled,itsbodyisexecuted.Wheneverareturnstatementwithinthebodyisexecuted,thecalltothefunctioncompletesandthevalueofthereturnexpressionisreturnedasthevalueofthefunctioncall.

Thedefinitionofthepercentfunctionbelowmultipliesanumberby100androundstheresulttotwodecimalplaces.

defpercent(x):

returnround(100*x,2)

Theprimarydifferencebetweendefiningapercentfunctionandsimplyevaluatingitsreturnexpression,round(100*x,2),isthatwhenafunctionisdefined,itsreturnexpressionisnotimmediatelyevaluated.Itcannotbe,becausethevalueforxisnotyetdefined.Instead,thereturnexpressionisevaluatedwheneverthispercentfunctioniscalledbyplacingparenthesesafterthenamepercentandplacinganexpressiontocomputeitsargumentinparentheses.

71Functions

percent(1/6)

percent(1/6000)

percent(1/60000)

Thethreeexpressionsaboveareallcallexpressions.Inthefirst,thevalueof1/6iscomputedandthenpassedastheargumentnamedxtothepercentfunction.Whenthepercentfunctioniscalledinthisway,itsbodyisexecuted.Thebodyofpercenthasonlyasingleline:returnround(100*x,2).Executingthisreturnstatementcompletesexecutionofthepercentfunction'sbodyandcomputesthevalueofthecallexpression.

Thesameresultiscomputedbypassinganamedvalueasanargument.Thepercentfunctiondoesnotknoworcarehowitsargumentiscomputedorstored;itsonlyjobistoexecuteitsownbodyusingtheargumentspassedtoit.

sixth=1/6

percent(sixth)

ConditionalStatements.Thebodyofafunctioncanhavemorethanonelineandmorethanonereturnstatement.Aconditionalstatementisamulti-linestatementthatallowsPythontochooseamongdifferentalternativesbasedonthetruthvalueofanexpression.Whileconditionalstatementscanappearanywhere,theyappearmostoftenwithinthebodyofafunctioninordertoexpressalternativebehaviordependingonargumentvalues.

72Functions

Aconditionalstatementalwaysbeginswithanifheader,whichisasinglelinefollowedbyanindentedbody.Thebodyisonlyexecutediftheexpressiondirectlyfollowingif(calledtheifexpression)evaluatestoatruevalue.Iftheifexpressionevaluatestoafalsevalue,thenthebodyoftheifisskipped.

Forexample,wecanimproveourpercentfunctionsothatitdoesn'troundverysmallnumberstozerosoreadily.Thebehaviorofpercent(1/6)isunchanged,butpercent(1/60000)providesamoreusefulresult.

defpercent(x):

ifx<0.00005:

return100*x

percent(1/6)

percent(1/6000)

percent(1/60000)

0.0016666666666666668

Aconditionalstatementcanalsohavemultipleclauseswithmultiplebodies,andonlyoneofthosebodiescaneverbeexecuted.Thegeneralformatofamulti-clauseconditionalstatementappearsbelow.

73Functions

if<ifexpression>:

elif<elifexpression0>:

elif<elifexpression1>:

Thereisalwaysexactlyoneifclause,buttherecanbeanynumberofelifclauses.Pythonwillevaluatetheifandelifexpressionsintheheadersinorderuntiloneisfoundthatisatruevalue,thenexecutethecorrespondingbody.Theelseclauseisoptional.Whenanelseheaderisprovided,itselsebodyisexecutedonlyifnoneoftheheaderexpressionsofthepreviousclausesaretrue.Theelseclausemustalwayscomeattheend(ornotatall).

Letuscontinuetorefineourpercentfunction.Perhapsforsomeanalysis,anyvaluebelowshouldbeconsideredcloseenoughto0thatitcanbeignored.Thefollowingfunction

definitionhandlesthiscaseaswell.

defpercent(x):

ifx<1e-8:

return0.0

elifx<0.00005:

return100*x

percent(1/6)

percent(1/6000)

74Functions

percent(1/60000)

0.0016666666666666668

percent(1/60000000000)

Awell-composedfunctionhasanamethatevokesitsbehavior,aswellasadocstring—adescriptionofitsbehaviorandexpectationsaboutitsarguments.Thedocstringcanalsoshowexamplecallstothefunction,wherethecallisprecededby>>>.

Adocstringcanbeanystringthatimmediatelyfollowstheheaderlineofadefstatement.Docstringsaretypicallydefinedusingtriplequotationmarksatthestartandend,whichallowsthestringtospanmultiplelines.Thefirstlineisconventionallyacompletebutshortdescriptionofthefunction,whilefollowinglinesprovidefurtherguidancetofutureusersofthefunction.

Amorecompletedefinitionofpercentthatincludesadocstringappearsbelow.

75Functions

defpercent(x):

"""Convertxtoapercentagebymultiplyingby100.

Percentagesareconventionallyroundedtotwodecimalplaces,

butprecisionisretainedforanyxabove1e-8thatwould

otherwiseroundto0.

>>>percent(1/6)

>>>perent(1/6000)

>>>perent(1/60000)

0.0016666666666666668

>>>percent(1/60000000000)

ifx<1e-8:

return0.0

elifx<0.00005:

return100*x

76Functions

FunctionsandTablesInteractInthissection,wewillcontinuetousetheUSCensusdataset.Wewillfocusonlyonthe2014populationestimate.

census_2014=census.select(['SEX','AGE','2014']).where(census.column

census_2014

SEX AGE 2014

0 0 3,948,350

0 1 3,962,123

0 2 3,957,772

0 3 4,005,190

0 4 4,003,448

0 5 4,004,858

0 6 4,134,352

0 7 4,154,000

0 8 4,119,524

0 9 4,106,832

...(293rowsomitted)

Functionscanbeusedtocomputenewcolumnsinatablebasedonexistingcolumnvalues.Forexample,wecantransformthecodesintheSEXcolumntostringsthatareeasiertointerpret.

defmale_female(code):

ifcode==0:

return'Total'

elifcode==1:

return'Male'

elifcode==2:

return'Female'

77FunctionsandTables

Thisfunctiontakesanindividualcode—0,1,or2—andreturnsastringthatdescribesitsmeaning.

male_female(0)

'Total'

male_female(1)

'Male'

male_female(2)

'Female'

Wecouldalsotransformagesintoagecategories.

defage_group(age):

ifage<2:

return'Baby'

elifage<13:

return'Child'

elifage<20:

return'Teen'

return'Adult'

age_group(15)

'Teen'

Apply.Theapplymethodofatablecallsafunctiononeachelementofacolumn,forminganewarrayofreturnvalues.Toindicatewhichfunctiontocall,justnameit(withoutquotationmarks).Thenameofthecolumnofinputvaluesmuststillappearwithinquotation

marks.

census_2014.apply(male_female,'SEX')

array(['Total','Total','Total',...,'Female','Female','Female'],

dtype='<U6')

Thisarray,whichhasthesamelengthastheoriginalSEXcolumnofthepopulationtable,canbeusedasthevaluesinanewcolumncalledMale/FemalealongsidetheexistingAGEandPopulationcolumns.

population=Table().with_columns([

'Male/Female',census_2014.apply(male_female,'SEX'),

'AgeGroup',census_2014.apply(age_group,'AGE'),

'Population',census_2014.column('2014')

population

Male/Female AgeGroup Population

Total Baby 3948350

Total Baby 3962123

Total Child 3957772

Total Child 4005190

Total Child 4003448

Total Child 4004858

Total Child 4134352

Total Child 4154000

Total Child 4119524

Total Child 4106832

...(293rowsomitted)

Groups.Thegroupmethodwithasingleargumentcountsthenumberofrowsforeachcategoryinacolumn.Theresultcontainsonerowperuniquevalueinthegroupedcolumn.

population.group('AgeGroup')

AgeGroup count

Adult 243

Baby 6

Child 33

Teen 21

Theoptionalsecondargumentnamesthefunctionthatwillbeusedtoaggregatevaluesinothercolumnsforallofthoserows.Forinstance,sumwillsumupthepopulationsinallrowsthatmatcheachcategory.Thisresultalsocontainsonerowperuniquevalueinthegroupedcolumn,butithasthesamenumberofcolumnsastheoriginaltable.

totals=population.where(0,'Total').select(['AgeGroup','Population'

totals

AgeGroup Population

Baby 3948350

Baby 3962123

Child 3957772

Child 4005190

Child 4003448

Child 4004858

Child 4134352

Child 4154000

Child 4119524

Child 4106832

...(91rowsomitted)

totals.group('AgeGroup',sum)

AgeGroup Populationsum

Adult 236721454

Baby 7910473

Child 44755656

Teen 29469473

Thegroupsmethodbehavesinthesameway,butacceptsalistofcolumnsasitsfirstargument.Theresultingtablehasonerowforeveryuniquecombinationofvaluesthatappeartogetherinthegroupedcolumns.Again,asingleargument(alist,inthiscase)givesrowcounts.

population.groups(['Male/Female','AgeGroup'])

Male/Female AgeGroup count

Female Adult 81

Female Baby 2

Female Child 11

Female Teen 7

Male Adult 81

Male Baby 2

Male Child 11

Male Teen 7

Total Adult 81

Total Baby 2

...(2rowsomitted)

Asecondargumenttogroupsaggregatesallothercolumnsthatdonotappearinthelistofgroupedcolumns.

population.groups(['Male/Female','AgeGroup'],sum)

Male/Female AgeGroup Populationsum

Female Adult 121754366

Female Baby 3869363

Female Child 21903805

Female Teen 14393035

Male Adult 114967088

Male Baby 4041110

Male Child 22851851

Male Teen 15076438

Total Adult 236721454

Total Baby 7910473

...(2rowsomitted)

Pivot.Thepivotmethodiscloselyrelatedtothegroupsmethod:itgroupstogetherrowsthatshareacombinationofvalues.Itdiffersfromgroupsbecauseitorganizestheresultingvaluesinagrid.Thefirstargumenttopivotisacolumnthatcontainsthevaluesthatwillbeusedtoformnewcolumnsintheresult.Thesecondargumentisacolumnusedforgroupingrows.Theresultgivesthecountofallrowsthatsharethecombinationofcolumnandrowvalues.

population.pivot('Male/Female','AgeGroup')

AgeGroup Female Male Total

Adult 81 81 81

Baby 2 2 2

Child 11 11 11

Teen 7 7 7

Anoptionalthirdargumentindicatesacolumnofvaluesthatwillreplacethecountsineachcellofthegrid.Thefourthargumentindicateshowtoaggregateallofthevaluesthatmatchthecombinationofcolumnandrowvalues.

pivoted=population.pivot('Male/Female','AgeGroup','Population'

pivoted

AgeGroup Female Male Total

Adult 121754366 114967088 236721454

Baby 3869363 4041110 7910473

Child 21903805 22851851 44755656

Teen 14393035 15076438 29469473

Theadvantageofpivotisthatitplacesgroupedvaluesintoadjacentcolumns,sothattheycanbecombined.Forinstance,thispivotedtableallowsustocomputetheproportionofeachagegroupthatismale.Wefindthesurprisingresultthatyoungeragegroupsarepredominantlymale,butamongadultstherearesubstantiallymorefemales.

pivoted.with_column('MalePercentage',pivoted.column('Male')/pivoted

AgeGroup Female Male Total MalePercentage

Adult 121754366 114967088 236721454 48.57%

Baby 3869363 4041110 7910473 51.09%

Child 21903805 22851851 44755656 51.06%

Teen 14393035 15076438 29469473 51.16%

Chapter2

VisualizationsWhilewehavebeenabletomakeseveralinterestingobservationsaboutdatabysimplyrunningoureyesdownthenumbersinatable,ourtaskbecomesmuchharderastablesbeenlarger.Indatascience,apicturecanbeworthathousandnumbers.

Wesawanexampleofthisearlierinthetext,whenweexaminedJohnSnow'smapofcholeradeathsinLondonin1854.

Snowshowedeachdeathasablackmarkatthelocationwherethedeathoccurred.Indoingso,heplottedthreevariables—thenumberofdeaths,aswellastwocoordinatesforeachlocation—inasinglegraphwithoutanyornamentsorflourishes.Thesimplicityofhis

84Visualization

presentationfocusestheviewer'sattentiononhismainpoint,whichwasthatthedeathswerecenteredaroundtheBroadStreetpump.

In1869,aFrenchcivilengineernamedCharlesJosephMinardcreatedwhatisstillconsideredoneofthegreatestgraphofalltime.ItshowsthedecimationofNapoleon'sarmyduringitsretreatfromMoscow.In1812,NapoleonhadsetouttoconquerRussia,withover400,000meninhisarmy.TheydidreachMoscow,butwereplaguedbylossesalongtheway;theRussianarmykeptretreatingfartherandfartherintoRussia,deliberatelyburningfieldsanddestroyingvillagesasitretreated.ThislefttheFrencharmywithoutfoodorshelterasthebrutalRussianwinterbegantosetin.TheFrencharmyturnedbackwithoutadecisivevictoryinMoscow.Theweathergotcolder,andmoremendied.Only10,000returned.

ThegraphisdrawnoveramapofeasternEurope.ItstartsatthePolish-Russianborderattheleftend.ThelightbrownbandrepresentsNapoleon'sarmymarchingtowardsMoscow,andtheblackbandrepresentsthearmyreturning.Ateachpointofthegraph,thewidthofthebandisproportionaltothenumberofsoldiersinthearmy.Atthebottomofthegraph,Minardincludesthetemperaturesonthereturnjourney.

Noticehownarrowtheblackbandbecomesasthearmyheadsback.ThecrossingoftheBerezinariverwasparticularlydevastating;canyouspotitonthegraph?

Thegraphisremarkableforitssimplicityandpower.Inasinglegraph,Minardshowssixvariables:

thenumberofsoldiersthedirectionofthemarchthetwocoordinatesoflocationthetemperatureonthereturnjourneythelocationonspecificdatesinNovemberandDecember

85Visualization

EdwardTufte,ProfessoratYaleandoneoftheworld'sexpertsonvisualizingquantitativeinformation,saysthatMinard'sgraphis"probablythebeststatisticalgraphiceverdrawn."

Thetechnologyofourtimesallowstoincludeanimationandcolor.Usedjudiciously,withoutexcess,thesecanbeextremelyinformative,asinthisanimationbyGapminder.orgofannualcarbondioxideemissionsintheworldoverthepasttwocenturies.

Acaptionsays,"ClickPlaytoseehowUSAbecomesthelargestemitterofCO2from1900onwards."ThegraphoftotalemissionsshowsthatChinahashigheremissionsthantheUnitedStates.However,inthegraphofpercapitaemissions,theUSishigherthanChina,becauseChina'spopulationismuchlargerthanthatoftheUnitedStates.

Technologycanbeahelpaswellasahindrance.Sometimes,theabilitytocreateafancypictureleadstoalackofclarityinwhatisbeingdisplayed.Inaccuraterepresentationofnumericalinformation,inparticular,canleadtomisleadingmessages.

Here,fromtheWashingtonPostintheearly1980s,isagraphthatattemptstocomparetheearningsofdoctorswiththeearningsofotherprofessionalsoverafewdecades.Dowereallyneedtoseetwoheads(onewithastethoscope)oneachbar?Tuftscoinedtheterm"chartjunk"forsuchunnecessaryembellishments.Healsodeploresthe"lowdata-to-inkratio"whichthisgraphunfortunatelypossesses.

86Visualization

Mostimportantly,thehorizontalaxisofthegraphisisnotdrawntoscale.Thishasasignificanteffectontheshapeofthebargraphs.Whendrawntoscaleandshornofdecoration,thegraphsrevealtrendsthatarequitedifferentfromtheapparentlylineargrowthintheoriginal.TheelegantgraphbelowisduetoRossIhaka,oneoftheoriginatorsofthestatisticalsystemR.

87Visualization

88Visualization

BarChartsInteract

CategoricalDataandBarCharts¶Nowthatwehaveexaminedseveralgraphicsproducedbyothers,itistimetoproducesomeofourown.Wewillstartwithbarcharts,atypeofgraphwithwhichyoumightalreadybefamiliar.Abarchartshowsthedistributionofacategoricalvariable,thatis,avariablewhosevaluesarecategories.Inhumanpopulations,examplesofcategoricalvariablesincludegender,ethnicity,maritalstatus,countryofcitizenship,andsoon.

Abarchartconsistsofasequenceofrectangularbars,onecorrespondingtoeachcategory.Thelengthofeachbarisproportionaltothenumberofentriesinthecorrespondingcategory.

TheKaiserFamilyFoundationhascompliedCensusdataonthedistributionofraceandethnicityintheUnitedStates.

http://kff.org/other/state-indicator/distribution-by-raceethnicity/

http://kff.org/other/state-indicator/children-by-raceethnicity/

Herearesomeofthedata,arrangedbystate.Thetablechildrencontainsdataforpeoplewhowereyoungerthan18yearsoldin2014;everyonecontainsthedataforpeopleofallagesin2014.Somemissingvaluesinthesetablesarerepresentedas0.

everyone=Table.read_table('kaiser_ethnicity_everyone.csv')

children=Table.read_table('kaiser_ethnicity_children.csv')

children.set_format(2,NumberFormatter)

everyone.set_format(2,NumberFormatter)

89BarCharts

State Race/Ethnicity Population

Alabama White 3,167,600

Alabama Black 1,269,200

Alabama Hispanic 191,000

Alabama Asian 77,300

Alabama AmericanIndian/AlaskaNative 0

Alabama TwoOrMoreRaces 56,100

Alabama Total 4,768,000

Alaska White 396,400

Alaska Black 17,000

Alaska Hispanic 59,200

...(347rowsomitted)

Themethodbarhtakestwoarguments,acolumncontainingcategoriesandacolumncontainingnumericquantities.Itgeneratesahorizontalbarchart.ThischartfocusesonCalifornia.

everyone.where('State','California').barh('Race/Ethnicity','Population'

ThetableofchildrenalsocontainsinformationaboutCalifornia,althoughtherearefewercategoriesofrace/ethnicity.

children.where('State','California')

90BarCharts

State Race/Ethnicity Population

California White 2,786,000

California Black 484,200

California Hispanic 4,849,400

California Other 1,590,100

California Total 9,709,700

Again,wecandispalythesedatainahorizontalbarchart.Thebarhmethodcantakecolumnindicesaswellaslabels.Thischartlookssimilar,butinvolvesfewercategoriesbecausetheKaiserFamilyFoundationonlyprovidesthesedataaboutchildren.

children.where('State','California').barh(1,2)

It'snotobvioushowtocomparethesetwocharts.Tohelpusdrawconclusions,wewillcombinethesetwotablesintoone,whichcontainsbotheveryoneandchildrenasseparatecolumns.Thistransformationwillrequireseveralsteps.First,toensurethatvaluesarecomparable,wewilldivideeachpopulationbythetotalforthestatetocomputestateproportions.

totals=everyone.join('State',everyone.where(1,'Total').relabeled

totals

91BarCharts

State Race/Ethnicity Population StateTotal

Alabama White 3167600 4768000

Alabama Black 1269200 4768000

Alabama Hispanic 191000 4768000

Alabama Asian 77300 4768000

Alabama AmericanIndian/AlaskaNative 0 4768000

Alabama TwoOrMoreRaces 56100 4768000

Alabama Total 4768000 4768000

Alaska White 396400 695700

Alaska Black 17000 695700

Alaska Hispanic 59200 695700

...(347rowsomitted)

proportions=everyone.select([0,1]).with_column('Proportion',totals

proportions.set_format(2,PercentFormatter)

State Race/Ethnicity Proportion

Alabama White 66.43%

Alabama Black 26.62%

Alabama Hispanic 4.01%

Alabama Asian 1.62%

Alabama AmericanIndian/AlaskaNative 0.00%

Alabama TwoOrMoreRaces 1.18%

Alabama Total 100.00%

Alaska White 56.98%

Alaska Black 2.44%

Alaska Hispanic 8.51%

...(347rowsomitted)

Thissametransformationcanbeappliedtothechildrentable.Below,wechaintogetherseveraltablemethodsinordertoexpressthetransformationinasinglelongline.Typically,it'sbesttoexpresstransformationsinpieces(above),butprogramminglanguagesarequite

92BarCharts

flexibleinallowingprogrammerstocombineexpressions(below).

children_proportions=children.select([0,1]).with_column(

'ChildrenProportion',children.column(2)/children.join('State'

children_proportions.set_format(2,PercentFormatter)

State Race/Ethnicity ChildrenProportion

Alabama White 59.99%

Alabama Black 30.50%

Alabama Hispanic 6.41%

Alabama Other 3.09%

Alabama Total 100.00%

Alaska White 46.00%

Alaska Black 2.60%

Alaska Hispanic 11.99%

Alaska Other 39.36%

Alaska Total 100.00%

...(245rowsomitted)

Thefollowingfunctiontakesthenameofastate.Itjoinstogethertheproportionsforeveryoneandchildrenintoasingletable.Whenthejoinmethodcombinestwocolumnswithdifferentvalues,onlythematchingelementsareretained.Inthiscase,anyrace/ethnicitynotrepresentedinbothtableswillbeexcluded.

defcompare(state):

prop_for_state=proportions.where('State',state).drop(0)

children_prop_for_state=children_proportions.where('State',state

joined=prop_for_state.join('Race/Ethnicity',children_prop_for_state

joined.set_format([1,2],PercentFormatter)

returnjoined.sort('Proportion')

compare('California')

93BarCharts

Race/Ethnicity Proportion ChildrenProportion

Black 5.34% 4.99%

Hispanic 38.21% 49.94%

White 39.20% 28.69%

Total 100.00% 100.00%

Thebarhmethodcandisplaythedatafrommultiplecolumnsonasinglechart.Whencalledwithonlyoneargumentindicatingthecolumnofcategories,allothercolumnsaredisplayedassetsofbars.

compare('California').barh('Race/Ethnicity')

ThischartshowsthechangingcompositionofthestateofCalifornia.In2014,fewerthan40%ofCalifornianswereHispanic.But50%ofthechildrenwere,whichisthecrucialingredientinpredictionsthatHispanicswilleventuallybeamajorityinCalifornia.

Whataboutotherstates?Ourfunctionallowsquickvisualcomparisonsforanystate.

compare('Minnesota').barh('Race/Ethnicity')

94BarCharts

compare('Vermont').barh('Race/Ethnicity')

95BarCharts

HistogramsInteract

QuantitativeDataandHistograms¶Manyofthevariablesthatdatascientistsstudyarequantitative.Forinstance,wecanstudytheamountofrevenueearnedbymoviesinrecentdecades.OursourceistheInternetMovieDatabase,anonlinedatabasethatcontainsinformationaboutmovies,televisionshows,videogames,andsoon.

Thetabletopconsistsof[U.S.A.'stopgrossingmovies(http://www.boxofficemojo.com/alltime/adjusted.htm)ofalltime.Thefirstcolumncontainsthetitleofthemovie;StarWars:TheForceAwakenshasthetoprank,withaboxofficegrossamountofmorethan900milliondollarsintheUnitedStates.Thesecondcolumncontainsthenameofthestudio;thethirdcontainstheU.S.boxofficegrossindollars;thefourthcontainsthegrossamountthatwouldhavebeenearnedfromticketsalesat2016prices;andthefifthcontainsthereleaseyearofthemovie.

Thereare200moviesonthelist.Herearethetopten.

top=Table.read_table('top_movies.csv')

top.set_format([2,3],NumberFormatter)

96Histograms

Title Studio Gross Gross(Adjusted) Year

StarWars:TheForceAwakens

BuenaVista(Disney) 906,723,418 906,723,400 2015

Avatar Fox 760,507,625 846,120,800 2009

Titanic Paramount 658,672,302 1,178,627,900 1997

JurassicWorld Universal 652,270,625 687,728,000 2015

Marvel'sTheAvengers BuenaVista(Disney) 623,357,910 668,866,600 2012

TheDarkKnight WarnerBros. 534,858,444 647,761,600 2008

StarWars:EpisodeI-ThePhantomMenace Fox 474,544,677 785,715,000 1999

StarWars Fox 460,998,007 1,549,640,500 1977

Avengers:AgeofUltron BuenaVista(Disney) 459,005,868 465,684,200 2015

TheDarkKnightRises WarnerBros. 448,139,099 500,961,700 2012

...(190rowsomitted)

Three-digitnumbersareeasiertoworkwiththannine-digitnumbers.So,wewilladdanadditionalmillionscolumncontainingtheadjustedgrossboxofficerevenueinmillions.

millions=top.select(0).with_column('Gross',np.round(top.column(3

millions

97Histograms

Title Gross

StarWars:TheForceAwakens 906.72

Avatar 846.12

Titanic 1178.63

JurassicWorld 687.73

Marvel'sTheAvengers 668.87

TheDarkKnight 647.76

StarWars:EpisodeI-ThePhantomMenace 785.72

StarWars 1549.64

Avengers:AgeofUltron 465.68

TheDarkKnightRises 500.96

...(190rowsomitted)

Thehistmethodgeneratesahistogramofthevaluesinacolumn.Theoptionalunitargumentisusedtolabeltheaxes.

millions.hist('Gross',unit="MillionDollars")

Thefigureaboveshowsthedistributionoftheamountsgrossed,inmillionsofdollars.Theamountshavebeengroupedintocontiguousintervalscalledbins.Althoughinthisdatasetnomoviegrossedanamountthatisexactlyontheedgebetweentwobins,itisworthnotingthathisthasanendpointconvention:binsincludethedataattheirleftendpoint,butnotthedataattheirrightendpoint.Sometimes,adjustmentshavetobemadeinthefirstorlastbin,

98Histograms

toensurethatthesmallestandlargestvaluesofthevariableareincluded.YousawanexampleofsuchanadjustmentintheCensusdatausedintheTablessection,whereanageof"100"yearsactuallymeant"100yearsoldorolder."

Wecanseethatthereare10bins(somebarsaresolowthattheyarehardtosee),andthattheyallhavethesamewidth.Wecanalsoseethattherethelistcontainsnomoviethatgrossedfewerthan300milliondollars;thatisbecauseweareconsideringonlythetopgrossingmoviesofalltime.Itisalittlehardertoseeexactlywheretheedgesofthebinsareplaced.Forexample,itisnotclearexactlywherethevalue500liesonthehorizontalaxis,andsoitishardtojudgeexactlywherethefirstbarendsandthesecondbegins.

Theoptionalargumentbinscanbeusedwithhisttospecifytheedgesofthebars.Itmustconsistofasequenceofnumbersthatincludestheleftendofthefirstbarandtherightendofthelastbar.Asthehighestgrossamountissomewhatover760onthehorizontalscale,wewillstartbysettingbinstobethearrayconsistingofthenumbers300,400,500,andsoon,endingwith2000.

millions.hist('Gross',unit="MillionDollars",bins=np.arange(300,2001

Thisfigureiseasiertoread.Onthehorizontalaxis,thelabels200,400,600,andsoonarecenteredatthecorrespondingvalues.Thetallestbarisformoviesthatgrossedbetween300andmillionand400million(adjusted)dollars;thebarfor400to500millionisabout5/8thashigh.Theheightofeachbarisproportionaltothenumberofmoviesthatfallwithinthex-axisrangeofthebar.

Averysmallnumbergrossedmorethan700milliondollars.Thisresultsinthefigurebeing"skewedtotheright,",or,lessformally,having"alongrighthandtail."Distributionsofvariableslikeincomeorrentoftenhavethiskindofshape.

99Histograms

Thecountsofvaluesindifferentrangescanbecomputedfromatableusingthebinmethod,whichtakesacolumnlabelorindexandanoptionalsequenceornumberofbins.Theresultisatabularformofahistogram;itliststhecountsofallvaluesinthe'Millions'columnthataregreaterthanorequaltotheindicatedbin,butlessthanthenextbin'sstartingvalue.

bins=millions.bin('Gross',bins=np.arange(300,2001,100))

bin Grosscount

300 81

400 52

500 28

600 16

1000 1

1100 3

1200 2

...(8rowsomitted)

Theverticalaxisofahistogramisthedensityscale.Theheightofeachbaristheproportionofelementsthatfallintothecorrespondingbin,dividedbythewidthofthebin.Forexample,theheightofthebinfrom300to400,whichcontains81ofthe200movies,is

=0.405%.

starts=bins.column(0)

counts=bins.column(1)

counts.item(0)/sum(counts)/(starts.item(1)-starts.item(0))

0.0040500000000000006

100Histograms

Thetotalareaofthisbaris0.405,100timesgreaterbecauseitswidthis100.ThisareaistheproportionofallvaluesintheMillionscolumnthatfallintothe300-400bin.Theareaofallbarsinahistogramsumsto1.

Ahistogrammaycontainbinsofunequalsize,buttheareaofallbarswillstillsumto1.Below,thevaluesintheMillionscolumnarebinnedintofourcategories.

millions.hist('Gross',unit="MillionDollars",bins=[300,400,600,

Althoughtheranges300-400and400-600havenearlyidenticalcounts,theformeristwiceastallasthelatterbecauseitisonlyhalfaswide.Thedensityofvaluesinthatrangeistwiceashigh.Histogramsdisplayinvisualformwhereonthenumberlinevaluesaremostconcentrated.

millions.bin('Gross',bins=[300,400,600,1500])

bin Grosscount

300 81

400 80

600 37

1500 0

HistogramofCounts.Itispossibletodisplaycountsdirectlyinachart,usingthenormed=Falseoptionofthehistmethod.Theresultingcharthasthesameshapewhenbinsallhaveequalwidths,butthebarareasdonotsumto1.

101Histograms

millions.hist('Gross',bins=np.arange(300,2001,100),normed=False)

Whilethecountscaleisperhapsmorenaturaltointerpretthanthedensityscale,thechartbecomeshighlymisleadingwhenbinshavedifferentwidths.Below,itappears(duetothecountscale)thathigh-grossingmoviesarequitecommon,wheninfactwehaveseenthattheyarerelativelyrare.

millions.hist('Gross',bins=[300,400,500,600,1500],normed=False

Eventhoughthemethodusediscalledhist,thefigureaboveisNOTAHISTOGRAM.Itgivestheimpressionthatmoviesareevenlydistributedinthe300-600range,buttheyarenot.Moreover,itexaggeratesthedensityofmoviesgrossingmorethan600million.The

102Histograms

heightofeachbarissimplyplottedthenumberofmoviesinthebin,withoutaccountingforthedifferenceinthewidthsofthebins.Thepicturebecomesevenmoreskewedifthelasttwobinsarecombined.

millions.hist('Gross',bins=[300,400,1500],normed=False)

Inthiscount-basedhistogram,theshapeofthedistributionofmoviesislostentirely.

Sowhatisahistogram?

Thefigureaboveshowsthatwhattheeyeperceivesas"big"isarea,notjustheight.Thisisparticularlyimportantwhenthebinsareofdifferentwidths.

Thatiswhyahistogramhastwodefiningproperties:

1. Thebinsarecontiguous(thoughsomemightbeempty)andaredrawntoscale.2. Theareaofeachbarisproportionaltothenumberofentriesinthebin.

Property2isthekeytodrawingahistogram,andisusuallyachievedasfollows:

Whendrawnusingthismethod,thehistogramissaidtobedrawnonthedensityscale,andthetotalareaofthebarsisequalto1.

Tocalculatetheheightofeachbar,usethefactthatthebarisarectangle:

103Histograms

Thelevelofdetail,andtheflattopsofthebars

Eventhoughthedensityscalecorrectlyrepresentsproportionsusingarea,somedetailislostbyplacingdifferentvaluesintothesamebins.

Takeanotherlookatthe[300,400)bininthefigurebelow.Theflattopofthebar,atalevelnear0.4%,hidesthefactthatthemoviesaresomewhatunevenlydistributedacrossthebin.

millions.hist('Gross',bins=[300,400,600,1500])

Toseethis,letussplitthe[150,200)binintotennarrowerbinsofwidth10milliondollarseach:

millions.hist('Gross',bins=[300,310,320,330,340,350,360,370

104Histograms

Someoftheskinnybarsaretallerthan0.4%andothersareshorter.Byputtingaflattopat0.4%overthewholebin,wearedecidingtoignorethefinerdetailandusetheflatlevelasaroughapproximation.Often,thoughnotalways,thisissufficientforunderstandingthegeneralshapeofthedistribution.

Noticethatbecausewehavetheentiredataset,wecandrawthehistograminasfinealevelofdetailasthedataandourpatiencewillallow.However,ifyouarelookingatahistograminabookoronawebsite,andyoudon'thaveaccesstotheunderlyingdataset,thenitbecomesimportanttohaveaclearunderstandingofthe"roughapproximation"oftheflattops.

Thedensityscale

Theheightofeachbarisaproportiondividedbyabinwidth.Thus,forthisdatset,thevaluesontheverticalaxisare"proportionspermilliondollars."Tounderstandthisbetter,lookagainatthe[150,200)bin.Thebinis50milliondollarswide.Sowecanthinkofitasconsistingof50narrowbinsthatareeach1milliondollarswide.Thebar'sheightofroughly"0.004permilliondollars"meansthatineachofthose50skinnybinsofwidth1milliondollars,theproportionofmoviesisroughly0.004.

Thustheheightofahistogrambarisaproportionperunitonthehorizontalaxis,andcanbethoughtofasthedensityofentriesperunitwidth.

millions.hist('Gross',bins=[300,350,400,450,1500])

105Histograms

DensityQ&A

Lookagainatthehistogram,andthistimecomparethe[400,450)binwiththe[450,1500)bin.

Q:Whichhasmoremoviesinit?

A:The[450,1500)bin.Ithas92movies,comparedwith25moviesinthe[400,450)bin.

Q:Thenwhyisthe[450,1500)barsomuchshorterthanthe[400,450)bar?

A:Becauseheightrepresentsdensityperunitwidth,notthenumberofmoviesinthebin.The[450,1500)binhasmoremoviesthanthe[400,450)bin,butitisalsoawholelotwider.Sothedensityismuchlower.

millions.bin('Gross',bins=[300,350,400,450,1500])

bin Grosscount

300 32

350 49

400 25

450 92

1500 0

Barchartorhistogram?

Barchartsdisplaythedistributionsofcategoricalvariables.Allthebarsinabarcharthavethesamewidth.Thelengths(orheights,ifthebarsaredrawnvertically)ofthebarsareproportionaltothenumberofentries.

106Histograms

Histogramsdisplaythedistributionsofquantitativevariables.Thebarscanhavedifferentwidths.Theareasofthebarsareproportionaltothenumberofentries.

Multiplehistograms

Inalltheexamplesinthissection,wehavedrawnasinglehistogram.However,ifadatatablecontainsseveralcolumns,thenbarhandhistcanbeusedtodrawseveralchartsatonce.Wewillcoverthisfeatureinalatersection.

107Histograms

Chapter3

SamplingSamplingisapowerfulcomputationaltechniqueindataanalysis.Itallowsustoinvestigatebothissuesofprobabilityandstatistics.Probabilityisthestudyoftheoutcomesofrandomprocesses;processesthatincludechance.Statisticsisjusttheopposite:itisthestudyofrandomprocessesgivenobservationsoftheiroutcomes.Putanotherway,probabilityfocusesonwhatwillbeobservedgivenwhatweknowabouttheworld.Statisticsfocusesonwhatwecaninferabouttheworldgivenwhatwehaveobserved.RandomSamplingisacentraltoolinbothdirectionsofinference.

108Sampling

SamplingInteractInthissection,wewillcontinuetousethetop_movies.csvdataset.

top=Table.read_table('top_movies.csv')

top.set_format([2,3],NumberFormatter)

StarWars:TheForceAwakens

Avatar Fox 760,507,625 846,120,800 2009

Titanic Paramount 658,672,302 1,178,627,900 1997

Marvel'sTheAvengers BuenaVista(Disney) 623,357,910 668,866,600 2012

TheDarkKnight WarnerBros. 534,858,444 647,761,600 2008

StarWars:EpisodeI-ThePhantomMenace Fox 474,544,677 785,715,000 1999

StarWars Fox 460,998,007 1,549,640,500 1977

Avengers:AgeofUltron BuenaVista(Disney) 459,005,868 465,684,200 2015

TheDarkKnightRises WarnerBros. 448,139,099 500,961,700 2012

...(190rowsomitted)

Samplingrowsofatable¶DeterministicSamples

Whenyousimplyspecifywhichelementsofasetyouwanttochoose,withoutanychancesinvolved,youcreateadeterministicsample.

109Sampling

Adeterminsiticsamplefromtherowsofatablecanbeconstructedusingthetakemethod.Itsargumentisasequenceofintegers,anditreturnsatablecontainingthecorrespondingrowsoftheoriginaltable.

Thecodebelowreturnsatableconsistingoftherowsindexed3,18,and100inthetabletop.Sincetheoriginaltableissortedbyrank,whichbeginsat1,thezero-indexedrowsalwayshavearankthatisonegreaterthantheindex.However,thetakemethoddoesnotinspecttheinformationintherowstoconstructitssample.

top.take([3,18,100])

Spider-Man Sony 403,706,375 604,517,300 2002

GonewiththeWind MGM 198,676,459 1,757,788,200 1939

Wecanselectevenlyspacedrowsbycallingnp.arangeandpassingtheresulttotake.Intheexamplebelow,westartwiththefirstrow(index0)oftop,andchooseevery40throwafterthatuntilwereachtheendofthetable.

top.take(np.arange(0,top.num_rows,40))

StarWars:TheForceAwakens BuenaVista(Disney) 906,723,418 906,723,400 2015

ShrektheThird Paramount/Dreamworks 322,719,944 408,090,600 2007

BruceAlmighty Universal 242,829,261 350,350,700 2003

ThreeMenandaBaby BuenaVista(Disney) 167,780,960 362,822,900 1987

SaturdayNightFever Paramount 94,213,184 353,261,200 1977

ProbabilitySamples¶Muchofdatascienceconsistsofmakingconclusionsbasedonthedatainrandomsamples.Correctlyinterpretinganalysesbasedonrandomsamplesrequiresdatascientiststoexamineexactlywhatrandomsamplesare.

110Sampling

Apopulationisthesetofallelementsfromwhomasamplewillbedrawn.

Aprobabilitysampleisoneforwhichitispossibletocalculate,beforethesampleisdrawn,thechancewithwhichanysubsetofelementswillenterthesample.

Inaprobabilitysample,allelementsneednothavethesamechanceofbeingchosen.Forexample,supposeyouchoosetwopeoplefromapopulationthatconsistsofthreepeopleA,B,andC,accordingtothefollowingscheme:

PersonAischosenwithprobability1.OneofPersonsBorCischosenaccordingtothetossofacoin:ifthecoinlandsheads,youchooseB,andifitlandstailsyouchooseC.

Thisisaprobabilitysampleofsize2.Herearethechancesofentryforallnon-emptysubsets:

AB:1/2

AC:1/2

PersonAhasahigherchanceofbeingselectedthanPersonsBorC;indeed,PersonAiscertaintobeselected.Sincethesedifferencesareknownandquantified,theycanbetakenintoaccountwhenworkingwiththesample.

Todrawaprobabilitysample,weneedtobeabletochooseelementsaccordingtoaprocessthatinvolveschance.Abasictoolforthispurposeisarandomnumbergenerator.ThereareseveralinPython.Herewewilluseonethatispartofthemodulerandom,whichinturnispartofthemodulenumpy.

Themethodrandint,whengiventwoargumentslowandhigh,returnsanintegerpickeduniformlyatrandombetweenlowandhigh,includinglowbutexcludinghigh.Runthecodebelowseveraltimestoseethevariabilityintheintegersthatarereturned.

np.random.randint(3,8)#selectonceatrandomfrom3,4,5,6,7

ASystematicSample

111Sampling

Imaginealltheelementsofthepopulationlistedinasequence.Onemethodofsamplingstartsbychoosingarandompositionearlyinthelist,andthenevenlyspacedpositionsafterthat.Thesampleconsistsoftheelementsinthosepositions.Suchasampleiscalledasystematicsample.

Herewewillchooseasystematicsampleoftherowsoftop.Wewillstartbypickingoneofthefirst10rowsatrandom,andthenwewillpickevery10throwafterthat.

"""Choosearandomstartamongrows0through9;

thentakeevery10throw."""

start=np.random.randint(0,10)

top.take(np.arange(start,top.num_rows,10))

StarWars Fox 460,998,007 1,549,640,500 1977

TheHungerGames Lionsgate 408,010,692 442,510,400 2012

ThePassionoftheChrist NM 370,782,930 519,432,100 2004

AliceinWonderland(2010)

StarWars:EpisodeII-AttackoftheClones Fox 310,676,740 465,175,700 2002

TheSixthSense BuenaVista(Disney) 293,506,292 500,938,400 1999

TheHangover WarnerBros. 277,322,503 323,343,300 2009

RaidersoftheLostArk Paramount 248,159,971 770,183,000 1981

TheLostWorld:JurassicPark Universal 229,086,679 434,216,600 1997

AustinPowers:TheSpyWhoShaggedMe NewLine 206,040,086 352,863,900 1999

...(10rowsomitted)

Runthecodeafewtimestoseehowtheoutputvaries.

Thissystematicsampleisaprobabilitysample.Inthisscheme,allrowsdohavethesamechanceofbeingchosen.Butthatisnottrueofothersubsetsoftherows.Becausetheselectedrowsareevenlyspaced,mostsubsetsofrowshavenochanceofbeingchosen.

112Sampling

Theonlysubsetsthatarepossiblearethosethatconsistofrowsallseparatedbymultiplesof10.Anyofthosesubsetsisselectedwithchance1/10.

Randomsamplewithreplacement

Someofthesimplestprobabilitysamplesareformedbydrawingrepeatedly,uniformlyatrandom,fromthelistofelementsofthepopulation.Ifthedrawsaremadewithoutchangingthelistbetweendraws,thesampleiscalledarandomsamplewithreplacement.Youcanimaginemakingthefirstdrawatrandom,replacingtheelementdrawn,andthendrawingagain.

Inarandomsamplewithreplacement,eachelementinthepopulationhasthesamechanceofbeingdrawn,andeachcanbedrawnmorethanonceinthesample.

Ifyouwanttodrawasampleofpeopleatrandomtogetsomeinformationaboutapopulation,youmightnotwanttosamplewithreplacement–drawingthesamepersonmorethanoncecanleadtoalossofinformation.Butindatascience,randomsampleswithreplacementariseintwomajorareas:

studyingprobabilitiesbysimulatingtossesofacoin,rollsofadie,orgamblinggames

creatingnewsamplesfromasampleathand

Thesecondoftheseareaswillbecoveredlaterinthecourse.Fornow,letusstudysomelong-runpropertiesofprobabilities.

Wewillstartwiththetablediewhichcontainsthenumbersofspotsonthefacesofadie.Allthenumbersappearexactlyonce,asweareassumingthatthedieisfair.

die=Table().with_column('Face',[1,2,3,4,5,6])

Drawingthehistogramofthissimplesetofnumbersyieldsanunsettlingfigureashistmakesadefaultchoiceofbins:

113Sampling

die.hist()

Thenumbers1,2,3,4,5,6areintegers,sothebinschosenbyhistonlyhaveentriesattheedges.Insuchasituation,itisabetterideatoselectbinssothattheyarecenteredontheintegers.Thisisoftentrueofhistogramsofdatathatarediscrete,thatis,variableswhosesuccessivevaluesareseparatedbythesamefixedamount.Therealadvantageofthismethodofbinselectionwillbecomemoreclearwhenwestartimaginingsmoothcurvesdrawnoverhistograms.

dice_bins=np.arange(0.5,7,1)

die.hist(bins=dice_bins)

Noticehoweachbinhaswidth1andiscenteredonaninteger.Noticealsothatbecausethewidthofeachbinis1,theheightofeachbaris ,thechancethatthecorrespondingface

appears.

114Sampling

Thehistogramshowstheprobabilitywithwhicheachfaceappears.Itiscalledaprobabilityhistogramoftheresultofonerollofadie.

Thehistogramwasdrawnwithoutrollinganydiceorgeneratinganyrandomnumbers.Wewillnowusethecomputertomimicactuallyrollingadie.Theprocessofusingacomputerprogramtoproducetheresultsofachanceexperimentiscalledsimulation.

Torollthedie,wewilluseamethodcalledsample.Thismethodreturnsanewtableconsistingofrowsselecteduniformlyatrandomfromatable.Itsfirstargumentisthenumberofrowstobereturned.Itssecondargumentiswhetherornotthesamplingshouldbedonewithreplacement.

Thecodebelowsimulates10rollsofthedie.Aswithallsimulationsofchanceexperiments,youshouldrunthecodeseveraltimesandnoticethevariabilityinwhatisreturned.

die.sample(10,with_replacement=True)

Wewillnowrollthedieseveraltimesanddrawthehistogramoftheobservedresults.Thehistogramofobservedresultsiscalledanempiricalhistogram.

Theresultsofrollingthediewillallbeintegersintherange1through6,sowewillwanttousethesamebinsasweusedfortheprobabilityhistogram.Toavoidwritingoutthesamebinargumenteverytimewedrawahistogram,letusdefineafunctioncalleddice_histthatwillperformthetaskforus.Thefunctionwilltakeoneargument:thenumbernofdicetoroll.

115Sampling

defdice_hist(n):

rolls=die.sample(n,with_replacement=True)

rolls.hist(bins=dice_bins)

dice_hist(20)

Asweincreasethenumberofrollsinthesimulation,theproportionsgetcloserto .

dice_hist(10000)

Thebehaviorwehaveobservedisaninstanceofageneralrule.

TheLawofAverages¶

116Sampling

Ifachanceexperimentisrepeatedindependentlyandunderidenticalconditions,then,inthelongrun,theproportionoftimesthataneventoccursgetscloserandclosertothetheoreticalprobabilityoftheevent.

Forexample,inthelongrun,theproportionoftimesthefacewithfourspotsappearsgetscloserandcloserto1/6.

Here"independentlyandunderidenticalconditions"meansthateveryrepetitionisperformedinthesamewayregardlessoftheresultsofalltheotherrepetitions.

Convergenceofempiricalhistograms¶Wehavealsoobservedthatarandomquantity(suchasthenumberofspotsononerollofadie)isassociatedwithtwohistograms:

aprobabilityhistogram,thatshowsallthepossiblevaluesofthequantityandalltheirchances

anempirialhistogram,createdbysimulatingtherandomquantityrepeatedlyanddrawingahistogramoftheobservedresults

Wehaveseenanexampleofthelong-runbehaviorofempiricalhistograms:

Asthenumberofrepetitionsincreases,theempiricalhistogramofarandomquantitylooksmoreandmoreliketheprobabilityhistogram.

AttheRouletteTable¶Equippedwithournewknowledgeaboutthelong-runbehaviorofchances,letusexploreagamblinggame.BettingonrouletteispopularingamblingcenterssuchasLasVegasandMonteCarlo,andwewillsimulateoneofthebetshere.

ThemainrandomizerinrouletteinNevadaisawheelthathas38pocketsonitsrim.Twoofthepocketsaregreen,eighteenblack,andeighteenred.Thewheelisonaspindle,andthereisasmallballonit.Whenthewheelisspun,theballricochetsaroundandfinallycomestorestinoneofthepockets.Thatisdeclaredtobethewinningpocket.

Youareallowedtobetonseveralpre-specifiedcollectionsofpockets.Ifyoubeton"red,"youwiniftheballcomestorestinoneoftheredpockets.

Thebetevenmoney.Thatis,itpays1to1.Tounderstandwhatthatmeans,assumeyouaregoingtobet$1on"red."Thefirstthingthathappens,evenbeforethewheelisspun,isthatyouhavetohandoveryour$1.Iftheballlandsinagreenorblackpocket,younever

117Sampling

seethatdollaragain.Iftheballlandsinaredpocket,yougetyourdollarback(tobringyoubacktoeven),plusanother$1inwinnings.

ThetablewheelrepresentsthepocketsofaNevadaroulettewheel.Ithas38rowslabeled1through38,onerowperpocket.

pockets=np.arange(1,39)

colors=(['red','black']*5+['black','red']*4)*2+['green'

wheel=Table().with_columns([

'pocket',pockets,

'color',colors])

pocket color

2 black

4 black

6 black

8 black

10 black

...(28rowsomitted)

Thefunctionbet_on_redtakesanumericalargumentxandreturnsthenetwinningsona$1beton"red,"providedxisthenumberofapocket.

defbet_on_red(x):

"""Thenetwinningsofbettingonredforoutcomex."""

pockets=wheel.where('pocket',x)

ifpockets.column('color').item(0)=='red':

return1

return-1

118Sampling

bet_on_red(17)

Thefunctionspinstakesanumericalargumentnandreturnsanewtableconsistingofnrowsofwheelsampledatrandomwithreplacement.Inotherwords,itsimulatestheresultsofnspinsoftheroulettewheel.

defspins(n):

returnwheel.sample(n,with_replacement=True)

Wewillcreateatablecalledplayconsistingoftheresultsof10spins,andaddacolumnthatshowsthenetwinningson$1placedon"red."Recallthattheapplymethodappliesafunctiontoeachelementinacolumnofatable.

outcomes=spins(500)

results=outcomes.with_column(

'winnings',outcomes.apply(bet_on_red,'pocket'))

results

pocket color winnings

13 black -1

22 black -1

38 green -1

11 black -1

29 black -1

19 red 1

38 green -1

37 green -1

19 red 1

31 black -1

...(490rowsomitted)

Andhereisthenetgainonall500bets:

119Sampling

sum(results.column('winnings'))

Betting$1onredhundredsoftimesseemslikeabadideafromagambler'sperspective.Butfromthecasinos'perspectiveitisexcellent.Casinosrelyonlargenumbersofbetsbeingplaced.Thepayoffoddsaresetsothatthemorebetsthatareplaced,themoremoneythecasinosarelikelytomake,eventhoughafewpeoplearelikelytogohomewithwinnings.

SimpleRandomSample–aRandomSamplewithoutReplacement

Arandomsamplewithoutreplacementisoneinwhichelementsaredrawnfromalistrepeatedly,uniformlyatrandom,ateachstagedeletingfromthelisttheelementthatwasdrawn.

Arandomsamplewithoutreplacementisalsocalledasimplerandomsample.Allelementsofthepopulationhavethesamechanceofenteringasimplerandomsample.Allpairshavethesamechanceaseachother,asdoalltriples,andsoon.

Thedefaultactionofsampleistodrawwithoutreplacement.Incardgames,cardsarealmostalwaysdealtwithoutreplacement.Letususesampletodealcardsfromadeck.

Astandarddeckdeckconsistsof13ranksofcardsineachoffoursuits.Thesuitsarecalledspades,clubs,diamonds,andhearts.TheranksareAce,2,3,4,5,6,7,8,9,10,Jack,Queen,andKing.Spadesandclubsareblack;diamondsandheartsarered.TheJacks,Queens,andKingsarecalled'facecards.'

Thetabledeckcontainsall52cardsinacolumnlabeledcards.Theabbreviationsare:

Ace:AJack:JQueen:QKing:K

Below,weusetheproductfunctiontoproduceallpossiblepairsofasuitandarank.Thesuitsarerepresentedbyspecialcharactersusedtoprintbridgeandpokerarticlesinnewspapers.

120Sampling

fromitertoolsimportproduct

ranks=['A','2','3','4','5','6','7','8','9','10','J','Q'

suits=['♠','♥','♦','♣']

cards=product(ranks,suits)

deck=Table(['rank','suit']).with_rows(cards)

rank suit

...(42rowsomitted)

Apokerhandis5cardsdealtatrandomfromthedeck.Theexpressionbelowdealsahand.Executethecellafewtimes;howoftenareyoudealtapair?

deck.sample(5)

rank suit

10 ♣

121Sampling

Ahandisasetoffivecards,regardlessoftheorderinwhichtheyappeared.Forexample,thehand9♣9♥Q♣7♣8♥isthesameasthehand7♣9♣8♥9♥Q♣.

Thisfactcanbeusedtoshowthatsimplerandomsamplingcanbethoughtofintwoequivalentways:

drawingelementsonebyoneatrandomwithoutreplacement

randomlypermuting(thatis,shuffling)thewholelist,andthenpullingoutasetofelementsatthesametime

122Sampling

IterationInteract

Repetition¶Itisoftenthecasewhenprogrammingthatyouwillwishtorepeatthesameoperationmultipletimes,perhapswithslightlydifferentbehavioreachtime.Youcouldcopy-pastethecode10times,butthat'stediousandpronetotypos,andifyouwantedtodoitathousandtimes(oramilliontimes),forgetit.

Abettersolutionistouseaforstatementtoloopoverthecontentsofasequence.Aforstatementbeginswiththewordfor,followedbyanamefortheiteminthesequence,followedbythewordin,andendingwithanexpressionthatevaluatestoasequence.Theindentedbodyoftheforstatementisexecutedonceforeachiteminthatsequence.

foriinnp.arange(5):

print(i)

Atypicaluseofaforstatementistobuildupatablebyrepeatingarandomcomputationmanytimesandstoringeachresultasanewrow.Theappendmethodofatabletakesasequenceandaddsanewrow.It'sdifferentfromwith_rowbecauseanewtableisnotcreated;instead,theoriginaltableisextended.Thecellbelowdraws100cards,butkeepsonlytheaces.

123Iteration

aces=Table(['Rank','Suit'])

foriinnp.arange(100):

card=deck.row(np.random.randint(deck.num_rows))

ifcard.item(0)=='A':

aces.append(card)

Rank Suit

Thispatterncanbeusedtotracktheresultsofrepeatedexperiments.Forexample,perhapswewanttolearnabouttheempiricalpropertiesofsomerandomlydrawnpokerhands.Below,wetrackwhetherthehandcontainsfour-of-a-kindorfivecardsofthesamesuit(calledaflush).

hands=Table(['Four-of-a-kind','Flush'])

hand=deck.sample(5)

four_of_a_kind=max(hand.group('rank').column('count'))==4

flush=max(hand.group('suit').column('count'))==5

hands.append([four_of_a_kind,flush])

124Iteration

Four-of-a-kind Flush

False False

...(9990rowsomitted)

Aforstatementcanalsoiterateoverasequenceoflabels.Wecanusethisfeaturetosummarizetheresultsofourexperiment.Thesearerarehandsindeed!

forlabelinhands.labels:

success=np.count_nonzero(hands.column(label))

print('A',label,'wasdrawn',success,'of',hands.num_rows,'times'

AFour-of-a-kindwasdrawn2of10000times

AFlushwasdrawn22of10000times

Randomizedresponse¶Next,we'lllookatatechniquethatwasdesignedseveraldecadesagotohelpconductsurveysofsensitivesubjects.Researcherswantedtoaskparticipantsafewquestions:Haveyoueverhadanaffair?Doyousecretlythinkyouaregay?Haveyouevershoplifted?HaveyoueversungaJustinBiebersongintheshower?Theyfiguredthatsomepeoplemightnotrespondhonestly,becauseofthesocialstigmaassociatedwithanswering"yes".So,theycameupwithacleverwaytoestimatethefractionofthepopulationwhoareinthe"yes"camp,withoutviolatinganyone'sprivacy.

125Iteration

Here'stheidea.We'llinstructtherespondenttorollafair6-sideddie,secretly,wherenooneelsecanseeit.Ifthediecomesup1,2,3,or4,thenrespondentissupposedtoanswerhonestly.Ifitcomesup5or6,therespondentissupposedtoanswertheoppositeofwhattheirtrueanswerwouldbe.But,theyshouldn'trevealwhatcameupontheirdie.

Noticehowcleverthisis.Evenifthepersonsays"yes",thatdoesn'tnecessarilymeantheirtrueansweris"yes"--theymightverywellhavejustrolleda5or6.Sotheresponsestothesurveydon'trevealanyoneindividual'strueanswer.Yet,inaggregate,theresponsesgiveenoughinformationthatwecangetaprettygoodestimateofthefractionofpeoplewhosetrueansweris"yes".

Let'stryasimulation,sowecanseehowthisworks.We'llwritesomecodetoperformthisoperation.First,afunctiontosimulaterollingonedie:

defroll_once():

returnnp.random.randint(1,7)

Nowwe'llusethistowriteafunctiontosimulatehowsomeoneissupposedtorespondtothesurvey.Theargumenttothefunctionistheirtrueanswer(TrueorFalse);thefunctionreturnswhatthey'resupposedtotelltheinterview.

defrespond(true_answer):

ifroll_once()>=5:

returnnottrue_answer

returntrue_answer

Wecantryit.Assumeourtrueansweris'no';let'sseewhathappensthistime:

respond(False)

Ofcourse,ifyouweretorunitmanytimes,youmightgetadifferentresulteachtime.Below,webuildatableoftheresponsesformanyresponseswhenthetrueanswerisalwaysFalse.

126Iteration

responses=Table(['Truth','Response'])

responses.append([False,respond(False)])

responses

Truth Response

False False

False True

False False

False True

False False

...(990rowsomitted)

Let'sbuildabarchartandlookathowmanyTrueandFalseresponsesweget.

responses.group('Response').barh('Response')

responses.where('Response',False).num_rows

127Iteration

responses.where('Response',True).num_rows

Exerciseforyou:IfNoutof1000responsesareTrue,approximatelywhatfractionofthepopulationhastrulysungaJustinBiebersongintheshower?

Analysis¶Thismethodiscalled"randomizedresponse".Itisonewaytopollpeopleaboutsensitivesubjects,whilestillprotectingtheirprivacy.Youcanseehowitisaniceexampleofrandomnessatwork.

Itturnsoutthatrandomizedresponsehasbeautifulgeneralizations.Forinstance,yourChromewebbrowserusesittoanonymouslyreportfeedbacktoGoogle,inawaythatwon'tviolateyourprivacy.That'sallwe'llsayaboutitforthissemester,butifyoutakeanupper-divisioncourse,maybeyou'llgettoseesomegeneralizationsofthisbeautifultechnique.

Thestepsintherandomizedresponsesurveycanbevisualizedusingatreediagram.Thediagrampartitionsallthesurveyrespondentsaccordingtotheirtrueandanswerandtheanswerthattheyeventuallygive.Italsodisplaystheproportionsofrespondentswhosetrueanswersare1("True")and0("False"),aswellasthechancesthatdeterminetheanswersthattheygive.Asinthecodeabove,wehaveusedptodenotetheproportionwhosetrueansweris1.

128Iteration

Therespondentswhoanswer1splitintotwogroups.Thefirstgroupconsistsoftherespondentswhosetrueanswerandgivenanswersareboth1.Ifthenumberofrespondentsislarge,theproportioninthisgroupislikelytobeabout2/3ofp.Thesecondgroupconsistsoftherespondentswhosetrueansweris0andgivenansweris1.Thisproportioninthisgroupislikelytobeabout1/3of1-p.

Wecanobserved ,theproportionof1'samongthegivenanswers.Thus

Thisallowsustosolveforanapproximatevalueofp:

Inthiswaywecanusetheobservedproportionof1'sto"workbackwards"andgetanestimateofp,theproportioninwhichwheareinterested.

Technicalnote.Itisworthnotingtheconditionsunderwhichthisestimateisvalid.Thecalculationoftheproportionsinthetwogroupswhosegivenansweris1reliesoneachofthegroupsbeinglargeenoughsothattheLawofAveragesallowsustomakeestimatesabouthowtheirdicearegoingtoland.Thismeansthatitisnotonlythetotalnumberof

129Iteration

respondentsthathastobelarge–thenumberofrespondentswhosetrueansweris1hastobelarge,asdoesthenumberwhosetrueansweris0.Forthistohappen,pmustbeneithercloseto0norcloseto1.Ifthecharacteristicofinterestiseitherextremelyrareorextremelycommoninthepopulation,themethodofrandomizedresponsedescribedinthisexamplemightnotworkwell.

Let'stryoutthismethodonsomerealdata.Thechanceofdrawingapokerhandwithnoacesis

np.product(np.arange(48,43,-1)/np.arange(52,47,-1))

0.65884199833779666

Itisquiteembarassingindeedtodrawahandwithnoaces.Thetablebelowcontainsonecolumnforthetruthofwhetherahandhasnoacesandanotherfortherandomizedresponse.

ace_responses=Table(['Truth','Response'])

hand=deck.sample(5)

no_aces=hand.where('rank','A').num_rows==0

ace_responses.append([no_aces,respond(no_aces)])

ace_responses

130Iteration

Truth Response

False True

True True

False True

True False

True True

False False

True False

False False

True True

Usingourderivedformula,wecanestimatewhatfractionofhandshavenoaces.

3*np.count_nonzero(ace_responses.column('Response'))/10000-1

0.6644000000000001

131Iteration

EstimationInteractIntheprevioussectionwesawthattwokindsofhistogramscanbeassociatedwithaquantitygeneratedbyarandomprocess.

Theprobabilityhistogramshowsallthepossiblevaluesofthequantityandtheprobabilitiesofallthosevalues.Theprobabilitiesarecalculatedusingmathandtherulesofchance.

Theempiricalhistogramisbasedonrunningtherandomprocessrepeatedlyandkeepingtrackofthevalueofthequantityeachtime.Itshowsallthevaluesofthequantityandtheproportionoftimeseachvaluewasobservedamongalltherepetitions.

WenotedthatbytheLawofAverages,theempiricalhistogramislikelytoresembletheprobabilityhistogramiftherandomprocessisrepeatedalargenumberoftimes.

Thismeansthatsimulatingrandomprocessesrepeatedlyisawayofapproximatingprobabilitydistributionswithoutfiguringouttheprobabilitiesmathematically.Thuscomputersimulationsbecomeapowerfultoolindatascience.Theycanhelpdatascientistsunderstandthepropertiesofrandomquantitiesthatwouldbecomplicatedtoanalyzeusingmathalone.

Hereisanexampleofsuchasimulation.

Estimatingthenumberofenemyaircraft¶InWorldWarII,dataanalystsworkingfortheAlliesweretaskedwithestimatingthenumberofGermanwarplanes.ThedataincludedtheserialnumbersoftheGermanplanesthathadbeenobservedbyAlliedforces.Theseserialnumbersgavethedataanalystsawaytocomeupwithananswer.

Tocreateanestimateofthetotalnumberofwarplanes,thedataanalystshadtomakesomeassumptionsabouttheserialnumbers.Herearetwosuchassumptions,greatlysimplifiedtomakeourcalculationseasier.

1. ThereareNplanes,numbered .

2. Theobservedplanesaredrawnuniformlyatrandomwithreplacementfromtheplanes.

Thegoalistoestimatethenumber .

132Estimation

Supposeyouobservesomeplanesandnotedowntheirserialnumbers.Howmightyouusethedatatoguessthevalueof ?Anaturalandstraightforwardmethodwouldbetosimplyusethelargestserialnumberobserved.

Letusseehowwellthismethodofestimationworks.First,anothersimplification:SomehistoriansnowestimatethattheGermanaircraftindustryproducedalmost100,000warplanesofmanydifferentkinds,Butherewewillimaginejustonekind.ThatmakesAssumption1aboveeasiertojustify.

Supposethereareinfact planesofthiskind,andthatyouobserve30ofthem.Wecanconstructatablecalledserialnothatcontainstheserialnumbers1through .Wecanthensample30timeswithreplacement(seeAssumption2)togetoursampleofserialnumbers.Ourestimateisthemaximumofthese30numbers.

serialno=Table().with_column('serialnumber',np.arange(1,N+1))

serialno

serialnumber

...(290rowsomitted)

max(serialno.sample(30,with_replacement=True).column(0))

133Estimation

Aswithallcodeinvolvingrandomsampling,runitafewtimestoseethevariation.Youwillobservethatevenwithjust30observationsfromamong300,thelargestserialnumberistypicallyinthe250-300range.

Inprinciple,thelargestserialnumbercouldbeassmallas1,ifyouwereunluckyenoughtoseePlaneNumber1all30times.Anditcouldbeaslargeas300ifyouobservePlaneNumber300atleastonce.Butusually,itseemstobeintheveryhigh200's.Itappearsthatifyouusethelargestobservedserialnumberasyourestimateofthetotal,youwillnotbeveryfarwrong.

Letusgeneratesomedatatoseeifwecanconfirmthis.Wewilluseiterationtorepeatthesamplingprocedurenumeroustimes,eachtimenotingthelargestserialnumberobserved.Thesewouldbeourestimatesof fromallthenumeroussamples.Wewillthendrawahistogramofalltheseestimates,andexaminebyhowmuchtheydifferfrom .

Inthecodebelow,wewillrun750repetitionsofthefollowingprocess:Sample30timesatrandomwithreplacementfrom1through300andnotethelargestnumberobserved.

Todothis,wewilluseaforloop.Asyouhaveseenbefore,wewillstartbysettingupanemptytablethatwilleventuallyholdalltheestimatesthataregenerated.Aseachestimateisthelargestnumberinitssample,wewillcallthistablemaxes.

Foreachinteger(callediinthecode)intherange0through749(750total),theforloopexecutesthecodeinthebodyoftheloop.Inthisexample,itgeneratesarandomsampleof30serialnumbers,computesthemaximumvalue,andaugmentstherowsofmaxeswiththatvalue.

sample_size=30

repetitions=750

maxes=Table(['i','max'])

foriinnp.arange(repetitions):

sample=serialno.sample(sample_size,with_replacement=True)

maxes.append([i,max(sample.column(0))])

134Estimation

...(740rowsomitted)

Hereisahistogramofthe750estimates.Asyoucansee,theestimatesareallcrowdedupnear300,eventhoughintheorytheycouldbemuchsmaller.Thehistogramindicatesthatasanestimateofthetotalnumberofplanes,thelargestserialnumbermightbetoolowbyabout10to25.Butitisextremelyunlikelytobeunderestimatethetruenumberofplanesbymorethanabout50.

every_ten=np.arange(1,N+100,10)

maxes.hist('max',bins=every_ten)

Theempiricaldistributionofastatistic¶

135Estimation

Intheexampleabove,thelargestserialnumberiscalledastatistic(singular!).Astatisticisanynumbercomputedusingthedatainasample.

Thegraphaboveisanempiricalhistogram.Itdisplaystheempiricalorobserveddistributionofthestatistic,basedon750samples.

Thestatisticcouldhavebeendifferent.

Afundamentalconsiderationinusinganystatisticbasedonarandomsampleisthatthesamplecouldhavecomeoutdifferently,andthereforethestatisticcouldhavecomeoutdifferentlytoo.

Justhowdifferentcouldthestatistichavebeen?

Ifyougenerateallofthepossiblesamples,andcomputethestatisticforeachofthem,thenyouwillhaveaclearpictureofhowdifferentthestatisticmighthavebeen.Indeed,youwillhaveacompleteenumerationofallthepossiblevaluesofthestatisticandalltheirprobabilities.Theresultingdistributioniscalledtheprobabilitydistributionofthestatistic,anditshistogramisthesameastheprobabilityhistogram.

Theprobabilitydistributionofastatisticisalsocalledthesamplingdistributionofthestatistic,becauseitisbasedonallofthepossiblesamples.

Thetotalnumberofpossiblesamplesisoftenverylarge.Forexample,thetotalnumberofpossiblesequencesof30serialnumbersthatyoucouldseeiftherewere300aircraftis

300**30

205891132094649000000000000000000000000000000000000000000000000000000000000

That'salotofsamples.Fortunately,wedon'thavetogenerateallofthem.Weknowthattheempiricalhistogramofthestatistic,basedonmanybutnotallofthepossiblesamples,isagoodapproximationtotheprobabilityhistogram.Theempiricaldistributionofastatisticgivesagoodideaofhowdifferentthestatisticcouldbe.

Itistruethattheprobabilitydistributionofastatisticcontainsmoreaccurateinformationaboutthestatisticthananempiricaldistributiondoes.Butoften,asinthisexample,theapproximationprovidedbytheempiricaldistributionissufficientfordatascientiststounderstandhowmuchastatisticcanvary.Andempiricaldistributionsareeasiertocompute.Therefore,datascientistsoftenuseempiricaldistributionsinsteadofexactprobabilitydistributionswhentheyaretryingtounderstandrandomquantities.

AnotherEstimate.

136Estimation

Hereisanexampletoillustratethispoint.Thusfar,wehaveusedthelargestobservedserialnumberasanestimateofthetotalnumberofplanes.Butthereareotherpossibleestimates,andwewillnowconsideroneofthem.

Theideaunderlyingthisestimateisthattheaverageoftheobservedserialnumbersislikelybeabouthalfwaybetween1and .Thus,if istheaverage,then

Thusanewstatisticcanbeusedtoestimatethetotalnumberofplanes:taketheaverageoftheobservedserialnumbersanddoubleit.

Howdoesthismethodofestimationcomparewithusingthelargestnumberobserved?Itisnoteasytocalculatetheprobabilitydistributionofthenewstatistic.Butasbefore,wecansimulateittogettheprobabilitiesapproximately.Let'stakealookattheempiricaldistributionbasedonrepeatedsampling.Thenumberofrepetitionsischosentobethesameasitwasfortheearlierestimate.Thiswillallowustocomparethetwoempiricaldistributions.

new_est=Table(['i','2*avg'])

new_est.append([i,2*np.average(sample.column(0))])

new_est.hist('2*avg',bins=every_ten)

Noticethatunliketheearlierestimate,thisonecanoverestimatethenumberofplanes.Thiswillhappenwhentheaverageoftheobservedserialnumbersiscloserto thanto1.

137Estimation

Foreaseofcomparison,letusrunanothersetof750repetitionsofthesamplingprocessandcalculatebothestimatesforeachsample.

new_est=Table(['i','max','2*avg'])

new_est.append([i,max(sample.column(0)),2*np.average(sample

new_est.select(['max','2*avg']).hist(bins=every_ten)

Youcanseethattheoldmethodalmostalwaysunderestimates;formally,wesaythatitisbiased.Butithaslowvariability,andishighlylikelytobeclosetothetruetotalnumberofplanes.

Thenewmethodoverestimatesaboutasoftenasitunderestimates,andthusisroughlyunbiasedonaverageinthelongrun.However,itismorevariablethantheoldestimate,andthusispronetolargerabsoluteerrors.

Thisisaninstanceofabias-variancetradeoffthatisnotuncommonamongcompetingestimates.Whichestimateyoudecidetousewilldependonthekindsoferrorsthatmatterthemosttoyou.Inthecaseofenemywarplanes,underestimatingthetotalnumbermighthavegrimconsequences,inwhichcaseyoumightchoosetousethemorevariablemethodthatoverestimatesabouthalfthetime.Ontheotherhand,ifoverestimationleadstoneedlesslyhighcostsofguardingagainstplanesthatdon'texist,youmightbesatisfiedwiththemethodthatunderestimatesbyamodestamount.

138Estimation

Technicalnote:Infact,twicetheaverageisnotunbiased.Onaverage,itoverestimatesbyexactly1.Forexample,ifNis3,theaverageofdrawsfrom1,2,and3willbe2,and2times2is4,whichisonemorethanN.Twicetheaverageminus1isanunbiasedestimatorofN.

139Estimation

CenterInteractWehaveseenthatstudyingthevariabilityofanestimateinvolvesquestionssuchas:

Whereistheempiricalhistogramcentered?Abouthowfarawayfromcentercouldthevaluesbe?

Inordertodiscussestimatesandtheirvariabilityingreaterdetail,itwillhelptoquantifysomebasicfeaturesofdistributions.

Themeanofalistofnumbers¶

Inthiscourse,wewillusethewords"average"and"mean"interchangeably.

Definition:Theaverageormeanofalistofnumbersisthesumofallthenumbersinthelist,dividedbythenumberofentriesinthelist.

Hereisthedefinitionappliedtoalistof20numbers.

#Thedata:alistofnumbers

x_list=[4,4,4,5,5,5,2,2,2,2,2,2,3,3,3,3,3,3,3,

#Theaverage

sum(x_list)/len(x_list)

Theaverageofthelistis3.15.

Youcanthinkoftakingthemeanasan"equalizing"or"smoothing"operation.Forexample,imaginetheentriesinx_listbeingtheamountsofmoney(indollars)inthepocketsof20differentpeople.Togetthemean,youfirstputallofthemoneyintoonebigpot,andthendivideitevenlyamongall20people.Theyhadstartedoutwithdifferentamountsofmoneyintheirpockets,butnoweachpersonhas$3.15,theaverageamount.

140Center

Computation:Wecouldinsteadhavefoundthemeanbyapplyingthefunctionnp.meantothelistx_list.Differencesinroundingleadtoanswersthatdifferveryslightly.

np.mean(x_list)

3.1499999999999999

Orwecouldhaveplacedthelistinatable,andusedtheTablemethodmean:

x_table=Table().with_column('value',x_list)

x_table['value'].mean()

3.1499999999999999

Themeanandthehistogram¶

Anotherwaytocalculatethemeanistofirstarrangethelistinadistributiontable.Thecolumnscontainthedistinctvaluesinthelist,thecountofeachvalue,andthecorrespondingproportion.Thesumofthecolumncountis20,thenumberofentriesinthelist.

values=Table(['value','count']).with_rows([

[2,6],

[3,8],

[4,3],

counts=values.column('count')

x_dist=values.with_column('proportion',counts/sum(counts))

#thedistributiontable

x_dist

141Center

value count proportion

2 6 0.3

3 8 0.4

4 3 0.15

5 3 0.15

Thesumofthenumbersinthelistcanbecalculatedbymultiplyingthefirsttwocolumnsandaddingup–four2's,five3's,andsoon–andthendividingby20:

sum(x_dist.column('value')*x_dist.column('count'))/sum(x_dist.column

3.1499999999999999

Thisisthesameasmultiplyingthefirstandthirdcolumns–thevaluesandtheirproportions–andaddingup:

sum(x_dist.column('value')*x_dist.column('proportion'))

3.1500000000000004

Wenowhaveanotherwaytothinkaboutthemean:Themeanofalististheweightedaverageofthedistinctvalues,wheretheweightsaretheproportionsinwhichthosevaluesappear.

Physically,thisimpliesthattheaverageisthebalancepointofthehistogram.Hereisthehistogramofthedistribution.Imaginethatitismadeoutofcardboardandthatyouaretryingtobalancethecardboardfigureatapointonthehorizontalaxis.Theaverageof3.15iswherethefigurewillbalance.

Tounderstandwhythatis,youwillhavetostudysomephysics.Balancepoints,orcentersofgravity,canbecalculatedinexactlythewaywecalculatedthemeanfromthedistributiontable.

x_dist.select(['value','proportion']).hist(counts='value',bins=np

142Center

Noticethatthemeanofalistdependsonlyonthedistinctvaluesandtheirproportions;youdonotneedtoknowhowmanyentriesthereareinthelist.

Inotherwords,theaverageisapropertyofthehistogram.Iftwolistshavethesamehistogram,theywillalsohavethesameaverage.

Themeanandthemedian¶

Ifastudent'sscoreonatestisbelowaverage,doesthatimplythatthestudentisinthebottomhalfoftheclassonthattest?

Happilyforthestudent,theansweris,"Notnecessarily."Thereasonhastodowiththerelationbetweentheaverage,whichisthebalancepointofthehistogram,andthemedian,whichisthe"half-waypoint"ofthedata.

Let'scomparethebalancepointofthehistogramtothepointthathas50%oftheareaofthehistogramoneithersideofit.

Therelationshipiseasytoseeinasimpleexample.Hereisthelist1,2,2,3,representedasadistributiontablecalledsymfor"symmetric".

sym=Table().with_columns([

'value',[1,2,3],

'dist',[0.25,0.5,0.25]

143Center

value dist

1 0.25

3 0.25

Thehistogram(oracalculation)showsthattheaverageandmedianareboth2.

sym.hist(counts='value',bins=np.arange(0.5,10.6,1))

Ingeneral,forsymmetricdistributions,themeanandthemedianareequal.

Whatifthedistributionisnotsymmetric?Wewillexplorethisbymakingachangeinthehistogramabove:wewilltakethebaratthevalue3andslideitovertothevalue10.

#Averageversusmedian:

#Balancepointversus50%point

slide=Table().with_columns([

'value',[1,2,3,10],

'dist1',[0.25,0.5,0.25,0],

'dist2',[0.25,0.5,0,0.25]

144Center

value dist1 dist2

1 0.25 0.25

2 0.5 0.5

3 0.25 0

10 0 0.25

slide.hist(counts='value',bins=np.arange(0.5,10.6,1))

Thebluehistogramrepresentstheoriginaldistribution;thegoldhistogramstartsoutthesameastheblueattheleftend,butitsrightmostbarhasslidovertothevalue10.Thebrownpartiswherethetwohistogramsoverlap.

Themedianandmeanofthebluedistributionarebothequalto2.Themedianofthegolddistributionisalsoequalto2;theareaoneithersideof2isstill50%,thoughtherighthalfisdistributeddifferentlyfromtheleft.

Butthemeanofthegolddistributionisnot2:thegoldhistogramwouldnotbalanceat2.Thebalancepointhasshiftedtotheright,to3.75.

sum(slide['value']*slide['dist2'])

Inthegolddistribution,3outof4entries(75%)arebelowaverage.Thestudentwithabelowaveragescorecanthereforetakeheart.Heorshemightbeinthemajorityoftheclass.

145Center

Themean,themedian,andskeweddistributions:Ingeneral,ifthehistogramhasatailononeside(theformaltermis"skewed"),thenthemeanispulledawayfromthemedianinthedirectionofthetail.

Example:FlightDelays.TheBureauofTransportationStatisticsprovidesdataondeparturesandarrivalsofflightsintheUnitedStates.Thetableuacontainsthedeparturedelaytimes,inminutes,ofabout37,500UnitedAirlinesflightsoverthepastfewyears.Anegativedelaymeansthattheflightleftearlierthanscheduled.

ontime=Table.read_table('airline_ontime.csv')

ontime.show(3)

Carrier DepartureDelay ArrivalDelay

AA -1 -1

ua=ontime.where('Carrier','UA').select(1)

ua.show(3)

DepartureDelay

ua_bins=np.arange(-15,121,2)

ua.hist(bins=ua_bins,unit='minute')

146Center

Thehistogramofthedelaytimesisright-skewed:ithasalongrighthandtail.

Themeangetspulledawayfromthemedianinthedirectionofthetail.Soweexpectthemeandelaytobelargerthanthemedian,andthatisindeedthecase:

np.mean(ua['DepartureDelay'])

13.885555228434548

np.median(ua['DepartureDelay'])

147Center

SpreadInteract

Variability¶

TheRoughSizeofDeviationsfromAverage¶

Aswehaveseen,inferencebasedonteststatisticsmusttakeintoaccountthewayinwhichthestatisticsvaryacrosssamples.Itisthereforeimportanttobeabletoquantifythevariabilityinanylistofnumbers.Onewayistocreateameasureofthedifferencebetweenthevaluesinthelistandtheaverageofthelist.

Wewillstartbydefiningsuchameasureinthecontextofasimplearrayofjustfournumbers.

numbers=np.array([1,2,2,10])

numbers

array([1,2,2,10])

Thegoalistogetameasureofroughlyhowfaroffthenumbersinthelistarefromtheaverage.Todothis,weneedtheaverage,andallthedeviationsfromtheaverage.A"deviationfromaverage"isjustavalueminustheaverage.

#Step1.Theaverage.

average=np.mean(numbers)

average

148Spread

#Step2.Thedeviationsfromaverage.

deviations=numbers-average

numbers_and_deviations=Table().with_columns([

'value',numbers,

'deviationfromaverage',deviations

numbers_and_deviations

value deviationfromaverage

1 -2.75

2 -1.75

10 6.25

Someofthedeviationsarenegative;thosecorrespondtovaluesthatarebelowaverage.Positivedeviationscorrespondtoabove-averagevalues.

Tocalculateroughlyhowbigthedeviationsare,itisnaturaltocomputethemeanofthedeviations.Butsomethinginterestinghappenswhenallthedeviationsareaddedtogether:

sum(deviations)

Thepositivedeviationsexactlycanceloutthenegativeones.Thisistrueofalllistsofnumbers,nomatterwhatthehistogramofthelistlookslike:thesumofthedeviationsfromaverageiszero.

Sincethesumofthedeviationsis0,themeanofthedeviationswillbe0aswell:

np.mean(deviations)

149Spread

Becauseofthis,themeanofthedeviationsisnotausefulmeasureofthesizeofthedeviations.Whatwereallywanttoknowisroughlyhowbigthedeviationsare,regardlessofwhethertheyarepositiveornegative.Soweneedawaytoeliminatethesignsofthedeviations.

Therearetwotime-honoredwaysoflosingsigns:theabsolutevalue,andthesquare.Itturnsoutthattakingthesquareconstructsameasurewithextremelypowerfulproperties,someofwhichwewillstudyinthiscourse.

Soletuseliminatethesignsbysquaringallthedeviations.Thenwewilltakethemeanofthesquares:

#Step3.Thesquareddeviationsfromaverage

squared_deviations=deviations**2

numbers_and_deviations.with_column('squareddeviationsfromaverage'

value deviationfromaverage squareddeviationsfromaverage

1 -2.75 7.5625

2 -1.75 3.0625

10 6.25 39.0625

#Variance=themeansquareddeviationfromaverage

variance=np.mean(squared_deviations)

variance

13.1875

Variance:Themeansquareddeviationiscalledthevarianceofasequenceofvalues.Whileitdoesgiveusanideaofspread,itisnotonthesamescaleastheoriginalvariableanditsunitsarethesquareoftheoriginal.Thismakesinterpretationverydifficult.Sowereturntotheoriginalscalebytakingthepositivesquarerootofthevariance:

150Spread

#StandardDeviation:rootmeansquareddeviationfromaverage

#Stepsofcalculation:54321

sd=variance**0.5

3.6314597615834874

StandardDeviation¶Thequantitythatwehavejustcomputediscalledthestandarddeviationofthelist,andisabbreviatedasSD.Itmeasuresroughlyhowfarthenumbersonthelistarefromtheiraverage.

Definition.TheSDofalistisdefinedastherootmeansquareofdeviationsfromaverage.That'samouthful.Butreaditfromrighttoleftandyouhavethesequenceofstepsinthecalculation.

Computation.ThenumPyfunctionnp.stdcomputestheSDofvaluesinanarray:

np.std(numbers)

3.6314597615834874

WorkingwiththeSD¶

LetusnowexaminetheSDinthecontextofamoreinterestingdataset.ThetablenbacontainsdataontheplayersintheNationalBasketballAssociation(NBA)in2013.Foreachplayer,thetablerecordsthepositionatwhichtheplayerusuallyplayed,hisheightininches,weightinpounds,andageinyears.

nba=Table.read_table("nba2013.csv")

151Spread

Name Position Height Weight Agein2013

DeQuanJones Guard 80 221 23

DariusMiller Guard 80 235 23

TrevorAriza Guard 80 210 28

JamesJones Guard 80 215 32

WesleyJohnson Guard 79 215 26

KlayThompson Guard 79 205 23

ThaboSefolosha Guard 79 215 29

ChaseBudinger Guard 79 218 25

KevinMartin Guard 79 185 30

EvanFournier Guard 79 206 20

...(495rowsomitted)

Hereisahistogramoftheplayers'heights.

nba.select('Height').hist(bins=np.arange(68,88,1))

ItisnosurprisethatNBAplayersaretall!Theiraverageheightisjustover79inches(6'7"),about10inchestallerthantheaverageheightofmenintheUnitedStates.

mean_height=np.mean(nba.column('Height'))

mean_height

152Spread

79.065346534653472

Aboutfaroffaretheplayers'heightsfromtheaverage?ThisismeasuredbytheSDoftheheights,whichisabout3.45inches.

sd_height=np.std(nba.column('Height'))

sd_height

3.4505971830275546

ThetoweringcenterHasheemThabeetoftheOklahomaCityThunderwasthetallestplayerataheightof87inches.

nba.sort('Height',descending=True).show(3)

HasheemThabeet Center 87 263 26

RoyHibbert Center 86 278 26

TysonChandler Center 85 235 30

...(502rowsomitted)

Thabeetwasabout8inchesabovetheaverageheight.

87-mean_height

7.9346534653465284

That'sadeviationfromaverage,anditisabout2.3timesthestandarddeviation:

(87-mean_height)/sd_height

2.2995015194397923

Inotherwords,theheightofthetallestplayerwasabout2.3SDsaboveaverage.

153Spread

At69inchestall,IsaiahThomaswasoneofthetwoshortestplayersintheNBA.Hisheightwasabout2.9SDsbelowaverage.

nba.sort('Height').show(3)

IsaiahThomas Guard 69 185 24

NateRobinson Guard 69 180 29

JohnLucasIII Guard 71 157 30

...(502rowsomitted)

(69-mean_height)/sd_height

-2.9169868288775844

WhatwehaveobservedisthatthetallestandshortestplayerswerebothjustafewSDsawayfromtheaverageheight.ThisisanexampleofwhytheSDisausefulmeasureofspread.Nomatterwhattheshapeofthehistogram,theaverageandtheSDtogethertellyoualotaboutthedata.

FirstmainreasonformeasuringspreadbytheSD¶

Informalstatement.Foralllists,thebulkoftheentriesarewithintherange"average afewSDs".

Trytoresistthedesiretoknowexactlywhatfuzzywordslike"bulk"and"few"mean.Wewilmakethempreciselaterinthissection.Fornow,letusexaminethestatementinthecontextofsomeexamples.

WehavealreadyseenthatalloftheheightsoftheNBAplayerswereintherange"average3SDs".

Whatabouttheages?Hereisahistogramofthedistribution.Yes,therewasa15-year-oldplayerintheNBAin2013.

nba.select('Agein2013').hist(bins=np.arange(15,45,1))

154Spread

JuwanHowardwastheoldestplayer,at40.Hisagewasabout3.2SDsaboveaverage.

nba.sort('Agein2013',descending=True).show(3)

JuwanHoward Forward 81 250 40

MarcusCamby Center 83 235 39

DerekFisher Guard 73 210 39

...(502rowsomitted)

ages=nba['Agein2013']

(40-np.mean(ages))/np.std(ages)

3.1958482778922357

Theyoungestwas15-year-oldJarvisVarnado,whowontheNBAChampionshipthatyearwiththeMiamiHeat.Hisagewasabout2.6SDsbelowaverage.

nba.sort('Agein2013').show(3)

155Spread

JarvisVarnado Forward 81 230 15

GiannisAntetokounmpo Forward 81 205 18

SergeyKarasev Guard 79 197 19

...(502rowsomitted)

(15-np.mean(ages))/np.std(ages)

-2.5895811038670811

Whatwehaveobservedfortheheightsandagesistrueingreatgenerality.Foralllists,thebulkoftheentriesarenomorethan2or3SDsawayfromtheaverage.

Theseroughstatementsaboutdeviationsfromaveragearemadeprecisebythefollowingresult.

Chebychev'sinequality¶

Foralllists,andallpositivenumbersz,theproportionofentriesthatareintherange"average SDs"isatleast .

Whatmakesthisresultpowerfulisthatitistrueforalllists–alldistributions,nomatterhowirregular.

Specifically,Chebychev'sinequalitysaysthatforeverylist:

theproportionintherange"average 2SDs"isatleast1-1/4=0.75

theproportionintherange"average 3SDs"isatleast1-1/9 0.89

theproportionintherange"average 4.5SDs"isatleast1-1/ 0.95

Chebychev'sInequalitygivesalowerbound,notanexactansweroranapproximation.Forexample,thepercentofentriesintherange"average SDs"mightbequiteabitlargerthan75%.Butitcannotbesmaller.

Standardunits¶

Inthecalculationsabove, measuresstandardunits,thenumberofstandarddeviationsaboveaverage.

156Spread

Somevaluesofstandardunitsarenegative,correspondingtooriginalvaluesthatarebelowaverage.Othervaluesofstandardunitsarepositive.Butnomatterwhatthedistributionofthelistlookslike,Chebychev'sInequalitysaysthatstandardunitswilltypicallybeinthe(-5,5)range.

Toconvertavaluetostandardunits,firstfindhowfaritisfromaverage,andthencomparethatdeviationwiththestandarddeviation.

Aswewillsee,standardunitsarefrequentlyusedindataanalysis.Soitisusefultodefineafunctionthatconvertsanarrayofnumberstostandardunits.

defstandard_units(any_numbers):

"Convertanyarrayofnumberstostandardunits."

return(any_numbers-np.mean(any_numbers))/np.std(any_numbers)

Aswesawinanearliersection,thetableuacontainsacolumnDepartureDelayconsistingofthedeparturedelaytimes,inminutes,ofover37,000UnitedAirlinesflights.WecancreateanewcolumncalledDelay_subyapplyingthefunctionstandard_unitstothecolumnofdelaytimes.Thisallowsustoseeallthedelaytimesinminutesaswellastheircorrespondingvaluesinstandardunits.

ua.append_column('z',standard_units(ua.column('DepartureDelay')))

157Spread

DepartureDelay z

-1 -0.391942

Somethingratheralarminghappenswhenwesortthedelaytimesfromhighesttolowest.Thestandardunitsareextremelyhigh!

ua.sort(0,descending=True)

DepartureDelay z

886 22.9631

739 19.0925

678 17.4864

595 15.301

566 14.5374

558 14.3267

543 13.9318

541 13.8791

537 13.7738

513 13.1419

WhatthisshowsisthatitispossiblefordatatobemanySDsaboveaverage(andforflightstobedelayedbyalmost15hours).Thehighestvalueismorethan22instandardunits.

158Spread

However,theproportionoftheseextremevaluesistiny,andtheboundsinChebychev'sinequalitystillholdtrue.Forexample,letuscalculatethepercentofdelaytimesthatareintherange"average 2SDs".Thisisthesameasthepercentoftimesforwhichthestandardunitsareintherange(-2,2).Thatisjustabout96%,ascomputedbelow,consistentwithChebychev'sboundof"atleast75%".

within_2_sd=np.logical_and(ua.column('z')>-2,ua.column('z')<

ua.where(within_2_sd).num_rows/ua.num_rows

0.960522441988063

Thehistogramofdelaytimesisshownbelow,withthehorizontalaxisinstandardunits.Bythetableabove,therighthandtailcontinuesallthewayoutto22standardunits(886minutes).Theareaofthehistogramoutsidetherange to isabout4%,puttogetherintinylittlebitsthataremostlyinvisibleinahistogram.

ua.hist('z',bins=np.arange(-5,23,1),unit='standardunit')

159Spread

TheNormalDistributionInteractWeknowthatthemeanisthebalancepointofthehistogram.Unlikethemean,theSDisusuallynoteasytoidentifybylookingatthehistogram.

However,thereisoneshapeofdistributionforwhichtheSDisalmostasclearlyidentifiableasthemean.Thissectionexaminesthatshape,asitappearsfrequentlyinprobabilityhistogramsandalsoinmanyhistogramsofdata.

TheSDandthebellcurve¶

Thetablebirthscontainsdataonover1,000babiesbornatahospitalintheBayArea.Thecolumnscontainseveralvariables:thebaby'sbirthweightinounces;thenumberofgestationaldays;themother'sageinyears,heightininches,andpregnancyweightinpounds;andarecordofwhethershesmokedornot.

births=Table.read_table('baby.csv')

births

BirthWeight

GestationalDays

MaternalAge

MaternalHeight

MaternalPregnancyWeight

MaternalSmoker

120 284 27 62 100 False

113 282 33 64 135 False

128 279 28 64 115 True

108 282 23 67 125 True

136 286 25 62 93 False

138 244 33 62 178 False

132 245 23 65 140 False

120 289 25 62 125 False

143 299 30 66 136 True

140 351 27 68 120 False

160TheNormalDistribution

Themothers'heightshaveameanof64inchesandanSDof2.5inches.Unliketheheightsofthebasketballplayers,themothers'heightsaredistributedfairlysymmetricallyaboutthemeaninabell-shapedcurve.

heights=births.column('MaternalHeight')

mean_height=np.round(np.mean(heights),1)

mean_height

sd_height=np.round(np.std(heights),1)

sd_height

births.hist('MaternalHeight',bins=np.arange(55.5,72.5,1),unit=

positions=np.arange(-3,3.1,1)*sd_height+mean_height

_=plots.xticks(positions)

Thelasttwolinesofcodeinthecellabovechangethelabelingofthehorizontalaxis.Now,thelabelscorrespondto"average SDs"for ,and .Becauseoftheshapeofthedistribution,the"center"hasanunambiguousmeaningandisclearlyvisibleat64.

ToseehowtheSDisrelatedtothecurve,startatthetopofthecurveandlooktowardstheright.Noticethatthereisaplacewherethecurvechangesfromlookinglikean"upside-downcup"toa"right-way-upcup";formally,thecurvehasapointofinflection.ThatpointisoneSDaboveaverage.Itisthepoint ,whichis"averageplus1SD"=66.5inches.

Symmetricallyontheleft-handsideofthemean,thepointofinflectionisat ,thatis,"averageminus1SD"=61.5inches.

HowtospottheSDonabell-shapedcurve:Forbell-shapeddistributions,theSDisthedistancebetweenthemeanandthepointsofinflectiononeitherside.

Bell-shapedprobabilityhistograms¶

Thereasonforstudyingbell-shapedcurvesisnotjustthattheyappearassomedistributionsofdata,butthattheyappearasapproximationstoprobabilityhistogramsofsampleaverages.

Wehaveseenthisphenomenonalready,intheexampleaboutestimatingthetotalnumberofaircraft.Recallthatwesimulatedrandomsamplesofsize30drawnwithreplacementfromtheserialnumbers1,2,3,...,300,andcomputedtwodifferentquantities:thelargestofthe30numbersobserved,andtwicetheaverageofthe30numbers.Herearethegeneratedhistograms.

sample_size=30

repetitions=5000

serialno=Table().with_column('serialnumber',np.arange(1,N+1))

new_est=Table(['i','max','2*avg'])

new_est.append([i,max(sample.column(0)),2*np.average(sample

every_ten=np.arange(1,N+100,10)

new_est.select(['max','2*avg']).hist(bins=every_ten)

Weobservedthatthedistributionof"twicethesampleaverage"isroughlysymmetric;nowlet'salsoobservethatitisroughlybell-shaped.

Ifthedistributionof"twicetheaverage"isbell-shaped,soisthedistributionoftheaverage.Hereisthedistributionoftheaverageofasampleofsize30drawnfromthenumbers1,2,3,...,300.Thenumberofreplicationsofthesamplingprocedureislarge,soitissafetoassumethattheempiricalhistogramresemblestheprobabilityhistogramofthesampleaverage.

sample_means=Table(['Samplemean'])

sample_means.append([np.average(sample.column(0))])

every_five=np.arange(100.5,200.6,5)

sample_means.hist(bins=every_five)

Wecandrawthiscurvewitharedbell-shapedcurvesuperposed.Thiscurveisgeneratedfromthemeanandstandarddeviationofthesamplemeans.Don'tworryaboutthecodethatdrawsthecurve;justfocusonthediagramitself.

#Importamoduleofstandardstatisticalfunctions

fromscipyimportstats

sample_means.hist(bins=every_five)

means=sample_means['Samplemean']

m=np.mean(means)

s=np.std(means)

x=np.arange(100.5,200.6,1)

plots.plot(x,stats.norm.pdf(x,m,s),color='r',lw=1)

positions=np.arange(-3,3.1,1)*15+150

_=plots.xticks(positions,[int(y)foryinpositions])

Sinceallthenumbersdrawnarebetween1and300,weexpectthesamplemeantobearound150,whichiswherethehistogramiscentered.

CanyouguesswhattheSDis?Runyoureyealongthecurvetillyoufeelyouareclosetothepointofinflectiontotherightofcenter.Ifyouguessedthatthepointisatabout165,you'recorrect–it'sbetween165and166andsymmetricallybetween134and135totheleftofcenter.

Theprobabilityhistogramofarandomsampleaverage¶

WhatisnotableaboutthisisnotjustthatwecanspottheSD,butthattheprobabilitydistributionofthestatisticisbell-shapedinthefirstplace.ThereisaremarkablepieceofmathematicscalledtheCentralLimitTheoremthatshowsthatthedistributionoftheaverageofalargerandomsamplewillberoughlynormal,nomatterwhatthedistributionofthepopulationfromwhichthesampleisdrawn.

Inourexampleaboutserialnumbers,eachofthe300serialnumbersisequallylikelytobedrawneachtime,sothedistributionofthepopulationisuniformandtheprobabilityhistogramlookslikeabrick.

serialno.hist(bins=np.arange(0.5,300.5,1))

Let'stestouttheCentralLimitTheorembydrawingsamplesfromquiteadifferentdistribution.Hereisthehistogramofover37,000flightdelaytimesinthetableua.Itlooksnothingliketheuniformdistributionabove.

plotrange=np.arange(-15,121,2)

ua.hist(bins=plotrange,unit='minute')

Wewillnowlookattheprobabilitydistributionoftheaverageof625flighttimesdrawnatrandomfromthispopulation.Thedrawswillbemadewithreplacement,astheywereintheexampleaboutserialnumbers.

sample_size=625

sample_means=Table(['Samplemean'])

sample=ua.sample(sample_size,with_replacement=True)

sample_means.append([np.average(sample.column(0))])

sample_means.hist(bins=np.arange(9,19.1,0.5),unit='minute')

#Drawaredbell-shapedcurvefromthemeanandSDofthesamplemeans

means=sample_means['Samplemean']

m=np.mean(means)

s=np.std(means)

x=np.arange(9,19.1,0.1)

_=plots.plot(x,stats.norm.pdf(x,np.mean(ua[0]),np.std(ua[0])/25

Theprobabilitiesforthesampleaverageareroughlybell-shaped,consistentwiththeCenralLimitTheorem.

Theaveragedelaytimeofalltheflightsinthepopulationisabout14minutes,whichiswherethehistogramiscentered.TospottheSD,itwillhelptonotethatthebarsarehalfaminutewide.Byeye,thepointsofinflectionlooktobeaboutthreebarsawayfromthecenter.SowewouldguessthattheSDisabout1.5minutes,andinfactthatisnotfarfromtheexactvalue.

Whatistheexactvalue?Wewillanswerthatquestionlaterintheterm.Remarkably,itcanbecalculatedeasilybasedonthesamplesizeandtheSDofthepopulation,withoutdoinganysimulationsatall.

Fornow,hereisthemainpointtonote.

SecondmainreasonforusingtheSDtomeasurespread¶

Foralargerandomsample,theprobabilitydistributionofthesampleaveragewillberoughlybell-shaped,withameanandSDthatcanbeeasilyidentified,nomatterwhatthedistributionofthepopulationfromwhichthesampleisdrawn.

Thestandardnormalcurve¶

Thebell-shapedcurvesabovelookessentiallyallthesameapartfromthelabelsthatwehaveplacedonthehorizontalaxes.Indeed,thereisreallyjustonebasiccurvefromwhichallofthesecurvescanbedrawnjustbyrelabelingtheaxesappropriately.

Todrawthatbasiccurve,wewillusetheunitsintowhichwecanconverteverylist:standardunits.Theresultingcurveisthereforecalledthestandardnormalcurve.

Thestandardnormalcurvehasanimpressiveequation.Butfornow,itisbesttothinkofitasasmoothedversionofahistogramofavariablethathasbeenmeasuredinstandardunits.

#Thestandardnormalprobabilitydensityfunction(pdf)

plot_normal_cdf()

Asalwayswhenyouexamineanewhistogram,startbylookingatthehorizontalaxis.Onthehorizontalaxisofthestandardnormalcurve,thevaluesarestandardunits.Herearesomepropertiesofthecurve.Someareapparentbyobservation,andothersrequireaconsiderableamountofmathematicstoestablish.

Thetotalareaunderthecurveis1.Soyoucanthinkofitasahistogramdrawntothedensityscale.

Thecurveissymmetricabout0.Soifavariablehasthisdistribution,itsmeanandmedianareboth0.

Thepointsofinflectionofthecurveareat-1and+1.

Ifavariablehasthisdistribution,itsSDis1.ThenormalcurveisoneoftheveryfewdistributionsthathasanSDsoclearlyidentifiableonthehistogram.

Sincewearethinkingofthecurveasasmoothedhistogram,wewillwanttorepresentproportionsofthetotalamountofdatabyareasunderthecurve.Areasundersmoothcurvesareoftenfoundbycalculus,usingamethodcalledintegration.Itisafactofmathematics,however,thatthestandardnormalcurvecannotbeintegratedinanyofthe

usualwaysofcalculus.Therefore,areasunderthecurvehavetobeapproximated.Thatiswhyalmostallstatisticstextbookscarrytablesofareasunderthenormalcurve.Itisalsowhyallstatisticalsystems,includingthestatsmoduleofPython,includemethodsthatprovideexcellentapproximationstothoseareas.

Areasunderthenormalcurve¶

Thestandardnormalcumulativedistributionfunction(cdf)

Thefundamentalfunctionforfindingareasunderthenormalcurveisstats.norm.cdf.Ittakesanumericalargumentandreturnsalltheareaunderthecurvetotheleftofthatnumber.Formally,itiscalledthe"cumulativedistributionfunction"ofthestandardnormalcurve.Thatratherunwieldymouthfulisabbreviatedascdf.

Letususethisfunctiontofindtheareatotheleftof underthestandardnormalcurve.

#Areaunderthestandardnormalcurve,below1

plot_normal_cdf(1)

stats.norm.cdf(1)

0.84134474606854293

That'sabout84%.Wecannowusethesymmetryofthecurveandthefactthatthetotalareaunderthecurveis1tofindotherareas.

Forexample,theareatotherightof isabout100%-84%=16%.

#Areaunderthestandardnormalcurve,above1

plot_normal_cdf(lbound=1)

1-stats.norm.cdf(1)

0.15865525393145707

Theareabetween and canbecomputedinseveraldifferentways.

#Areaunderthestandardnormalcurve,between-1and1

plot_normal_cdf(1,lbound=-1)

Forexample,wecancalculatetheareaas"100%-twoequaltails",whichworksouttoroughly100%-2x16%=68%.

Orwecouldnotethattheareabetween and isequaltoalltheareatotheleftof ,minusalltheareatotheleftof .

stats.norm.cdf(1)-stats.norm.cdf(-1)

0.68268949213708585

Byasimilarcalculation,weseethattheareabetween and2isabout95%.

#Areaunderthestandardnormalcurve,between-2and2

plot_normal_cdf(2,lbound=-2)

stats.norm.cdf(2)-stats.norm.cdf(-2)

0.95449973610364158

Inotherwords,ifahistogramisroughlybellshaped,theproportionofdataintherange"average 2SDs"isabout95%.

ThatisquiteabitmorethanChebychev'slowerboundof75%.Chebychev'sboundisweakerbecauseithastoworkforalldistributions.Ifweknowthatadistributionisnormal,wehavegoodapproximationstotheproportions,notjustbounds.

Thetablebelowcompareswhatweknowaboutalldistributionsandaboutnormaldistributions.NoticethatChebychev'sboundisnotveryilluminatingwhen .

PercentinRange AllDistributions NormalDistribution

average 1SD atleast0% about68%

average 2SDs atleast75% about95%

average 3SDs atleast88.888...% about99.73%

Exploration:PrivacySectionauthor:DavidWagner

Interact

Licenseplates¶We'regoingtolookatsomedatacollectedbytheOaklandPoliceDepartment.Theyhaveautomatedlicenseplatereadersontheirpolicecars,andthey'vebuiltupadatabaseoflicenseplatesthatthey'veseen--andwhereandwhentheysaweachone.

Datacollection¶First,we'llgatherthedata.ItturnsoutthedataispubliclyavailableontheOaklandpublicrecordssite.IdownloadeditandcombineditintoasingleCSVfilebymyselfbeforelecture.

lprs=Table.read_table('./all-lprs.csv.gz',compression='gzip',sep

lprs.labels

('red_VRM','red_Timestamp','Location')

Let'sstartbyrenamingsomecolumns,andthentakealookatit.

lprs.relabel('red_VRM','Plate')

lprs.relabel('red_Timestamp','Timestamp')

173Exploration:Privacy

Plate Timestamp Location

1275226 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

27529C 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

1158423 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

1273718 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

1077682 01/19/201102:06:00AM (37.798304999999999,-122.27574799999999)

1214195 01/19/201102:06:00AM (37.798281000000003,-122.27575299999999)

1062420 01/19/201102:06:00AM (37.79833,-122.27574300000001)

1319726 01/19/201102:05:00AM (37.798475000000003,-122.27571500000001)

1214196 01/19/201102:05:00AM (37.798499999999997,-122.27571)

75227 01/19/201102:05:00AM (37.798596000000003,-122.27569)

Phew,that'salotofdata:wecanseeabout2.7millionlicenseplatereadshere.

Let'sstartbyseeingwhatcanbelearnedaboutsomeone,usingthisdata--assumingyouknowtheirlicenseplate.

StalkingJeanQuan¶Asawarmup,we'lltakealookatex-MayorJeanQuan'scar,andwhereithasbeenseen.Herlicenseplatenumberis6FCH845.(HowdidIlearnthat?Turnsoutshewasinthenewsforgetting$1000ofparkingtickets,andthenewsarticleincludedapictureofhercar,withthelicenseplatevisible.You'dbeamazedbywhat'soutthereontheInternet...)

lprs.where('Plate','6FCH845')

Plate Timestamp Location

6FCH845 11/01/201209:04:00AM (37.79871,-122.276221)

6FCH845 10/24/201211:15:00AM (37.799695,-122.274868)

6FCH845 10/24/201211:01:00AM (37.799693,-122.274806)

6FCH845 10/24/201210:20:00AM (37.799735,-122.274893)

6FCH845 05/08/201407:30:00PM (37.797558,-122.26935)

6FCH845 12/31/201310:09:00AM (37.807556,-122.278485)

OK,sohercarshowsup6timesinthisdataset.However,it'shardtomakesenseofthosecoordinates.Idon'tknowaboutyou,butIcan'treadGPSsowell.

So,let'sworkoutawaytoshowwherehercarhasbeenseenonamap.We'llneedtoextractthelatitudeandlongitude,asthedataisn'tquiteintheformatthatthemappingsoftwareexpects:themappingsoftwareexpectsthelatitudetobeinonecolumnandthelongitudeinanother.Let'swritesomePythoncodetodothat,bysplittingtheLocationstringintotwopieces:thestuffbeforethecomma(thelatitude)andthestuffafter(thelongitude).

defget_latitude(s):

before,after=s.split(',')#Breakitintotwoparts

lat_string=before.replace('(','')#Getridoftheannoying'('

returnfloat(lat_string)#Convertthestringtoanumber

defget_longitude(s):

before,after=s.split(',')#Breakitintotwoparts

long_string=after.replace(')','').strip()#Getridofthe')'andspaces

returnfloat(long_string)#Convertthestringtoanumber

Let'stestittomakesureitworkscorrectly.

get_latitude('(37.797558,-122.26935)')

37.797558

get_longitude('(37.797558,-122.26935)')

-122.26935

Good,nowwe'rereadytoaddtheseasextracolumnstothetable.

lprs=lprs.with_columns([

'Latitude',lprs.apply(get_latitude,'Location'),

'Longitude',lprs.apply(get_longitude,'Location')

Plate Timestamp Location Latitude Longitude

1275226 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

27529C 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

1158423 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

1273718 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

1077682 01/19/201102:06:00AM

(37.798304999999999,-122.27574799999999) 37.7983 -122.276

1214195 01/19/201102:06:00AM

(37.798281000000003,-122.27575299999999) 37.7983 -122.276

1062420 01/19/201102:06:00AM

(37.79833,-122.27574300000001) 37.7983 -122.276

1319726 01/19/201102:05:00AM

(37.798475000000003,-122.27571500000001) 37.7985 -122.276

1214196 01/19/201102:05:00AM

(37.798499999999997,-122.27571) 37.7985 -122.276

75227 01/19/201102:05:00AM

(37.798596000000003,-122.27569) 37.7986 -122.276

Andatlast,wecandrawamapwithamarkereverywherethathercarhasbeenseen.

jean_quan=lprs.where('Plate','6FCH845').select(['Latitude','Longitude'

Marker.map_table(jean_quan)

OK,soit'sbeenseenneartheOaklandpolicedepartment.Thisshouldmakeyoususpectwemightbegettingabitofabiasedsample.WhymighttheOaklandPDbethemostcommonplacewherehercarisseen?Canyoucomeupwithaplausibleexplanationforthis?

Pokingaround¶Let'stryanother.Andlet'sseeifwecanmakethemapalittlemorefancy.It'dbenicetodistinguishbetweenlicenseplatereadsthatareseenduringthedaytime(onaweekday),vstheevening(onaweekday),vsonaweekend.Sowe'llcolor-codethemarkers.Todothis,we'llwritesomePythoncodetoanalyzetheTimestampandchooseanappropriatecolor.

importdatetime

defget_color(ts):

t=datetime.datetime.strptime(ts,'%m/%d/%Y%I:%M:%S%p')

ift.weekday()>=6:

return'green'#Weekend

elift.hour>=6andt.hour<=17:

return'blue'#Weekdaydaytime

return'red'#Weekdayevening

lprs.append_column('Color',lprs.apply(get_color,'Timestamp'))

Nowwecancheckoutanotherlicenseplate,thistimewithourspiffycolor-coding.ThisonehappenstobethecarthatthecityissuestotheFireChief.

t=lprs.where('Plate','1328354').select(['Latitude','Longitude',

Marker.map_table(t)

Hmm.WecanseeablueclusterindowntownOakland,wheretheFireChief'scarwasseenonweekdaysduringbusinesshours.Ibetwe'vefoundheroffice.Infact,ifyouhappentoknowdowntownOakland,thosearemostlyclusteredrightnearCityHall.Also,hercarwasseentwiceinnorthernOaklandonweekdayevenings.Onecanonlyspeculatewhatthatindicates.Maybedinnerwithafriend?Orrunningerrands?Offtothesceneofafire?Whoknows.Andthenthecarhasbeenseenoncemore,lateatnightonaweekend,inaresidentialareainthehills.Herhomeaddress,maybe?

Let'slookatanother.Thistime,we'llmakeafunctiontodisplaythemap.

defmap_plate(plate):

sightings=lprs.where('Plate',plate)

t=sightings.select(['Latitude','Longitude','Timestamp','Color'

returnMarker.map_table(t)

map_plate('5AJG153')

Whatcanwetellfromthis?LookstomelikethispersonlivesonInternationalBlvdand9th,roughly.Onweekdaysthey'veseeninavarietyoflocationsinwestOakland.It'sfuntoimaginewhatthismightindicate--deliveryperson?taxidriver?someonerunningerrandsallovertheplaceinwestOakland?

Wecanlookatanother:

map_plate('6UZA652')

Whatcanwelearnfromthismap?First,it'sprettyeasytoguesswherethispersonlives:16thandInternational,orprettynearthere.AndthenwecanseethemspendingsomenightsandaweekendnearLaneyCollege.Didtheyhaveanapartmenttherebriefly?Arelationshipwithsomeonewholivedthere?

Isanyoneelsegettingalittlebitcreepedoutaboutthis?IthinkI'vehadenoughoflookingatindividualpeople'sdata.

Inference¶Aswecansee,thiskindofdatacanpotentiallyrevealafairbitaboutpeople.Someonewithaccesstothedatacandrawinferences.Takeamomenttothinkaboutwhatsomeonemightbeabletoinferfromthiskindofdata.

Aswe'veseenhere,it'snottoohardtomakeaprettygoodguessatroughlywheresomelives,fromthiskindofinformation:theircarisprobablyparkedneartheirhomemostnights.Also,itwilloftenbepossibletoguesswheresomeoneworks:iftheycommuteintoworkbycar,thenonweekdaysduringbusinesshours,theircarisprobablyparkedneartheiroffice,sowe'llseeaclearclusterthatindicateswheretheywork.

Butitdoesn'tstopthere.Ifwehaveenoughdata,itmightalsobepossibletogetasenseofwhattheyliketododuringtheirdowntime(dotheyspendtimeatthepark?).Andinsomecasesthedatamightrevealthatsomeoneisinarelationshipandspendingnightsatsomeoneelse'shouse.That'sarguablyprettysensitivestuff.

Thisgetsatoneofthechallengeswithprivacy.Datathat'scollectedforonepurpose(fightingcrime,orsomethinglikethat)canpotentiallyrevealalotmore.Itcanallowtheownerofthedatatodrawinferences--sometimesaboutthingsthatpeoplewouldprefertokeepprivate.Andthatmeansthat,inaworldof"bigdata",ifwe'renotcareful,privacycanbecollateraldamage.

Mitigation¶Ifwewanttoprotectpeople'sprivacy,whatcanbedoneaboutthis?That'salengthysubject.Butatriskofover-simplifying,thereareafewsimplestrategiesthatdataownerscantake:

1. Minimizethedatatheyhave.Collectonlywhattheyneed,anddeleteitafterit'snotneeded.

2. Controlwhohasaccesstothesensitivedata.Perhapsonlyahandfuloftrustedinsidersneedaccess;ifso,thenonecanlockdownthedatasoonlytheyhaveaccesstoit.Onecanalsologallaccess,todetermisuse.

3. Anonymizethedata,soitcan'tbelinkedbacktotheindividualwhoitisabout.Unfortunately,thisisoftenharderthanitsounds.

4. Engagewithstakeholders.Providetransparency,totrytoavoidpeoplebeingtakenbysurprise.Giveindividualsawaytoseewhatdatahasbeencollectedaboutthem.Givepeopleawaytooptoutandhavetheirdatabedeleted,iftheywish.Engageinadiscussionaboutvalues,andtellpeoplewhatstepsyouaretakingtoprotectthemfromunwantedconsequences.

Thisonlyscratchesthesurfaceofthesubject.Mymaingoalinthislecturewastomakeyouawareofprivacyconcerns,sothatifyouareeverastewardofalargedataset,youcanthinkabouthowtoprotectpeople'sdataanduseitresponsibly.

Chapter4

Prediction

183Prediction

CorrelationInteract

Therelationbetweentwovariables¶Intheprevioussections,wedevelopedseveraltoolsthathelpusdescribethedistributionofasinglevariable.Datasciencealsohelpsusunderstandhowmultiplevariablesarerelatedtoeachother.Thisallowsustopredictthevalueofonevariablegiventhevaluesofothers,andtogetasenseoftheamountoferrorintheprediction.

Agoodwaytostartexploringtherelationbetweentwovariablesisbyvisualization.Agraphcalledascatterdiagramcanbeusedtoplotthevalueofonevariableagainstvaluesofanother.Letusstartbylookingatsomescatterdiagramsandthenmoveontoquantifyingsomeofthefeaturesthatwesee.

ThetablehybridcontainsdataonhybridpassengercarssoldintheUnitedStatesfrom1997to2013.ThedatawereobtainedfromtheonlinedataarchiveofProf.LarryWinneroftheUniversityofFlorida.Thecolumnsincludethemodelofthecar,yearofmanufacture,theMSRP(manufacturer'ssuggestedretailprice)in2013dollars,theaccelerationrateinkmperhourpersecond,fuelecononmyinmilespergallon,andthemodel'sclass.

hybrid

vehicle year msrp acceleration mpg class

Prius(1stGen) 1997 24509.7 7.46 41.26 Compact

Tino 2000 35355 8.2 54.1 Compact

Prius(2ndGen) 2000 26832.2 7.97 45.23 Compact

Insight 2000 18936.4 9.52 53 TwoSeater

Civic(1stGen) 2001 25833.4 7.04 47.04 Compact

Insight 2001 19036.7 9.52 53 TwoSeater

Insight 2002 19137 9.71 53 TwoSeater

Alphard 2003 38084.8 8.33 40.46 Minivan

Insight 2003 19137 9.52 53 TwoSeater

Civic 2003 14071.9 8.62 41 Compact

184Correlation

...(143rowsomitted)

TheTablemethodscattercanbeusedtoplotaccelerationonthehorizontalaxis(thefirstargument)andmsrponthevertical(thesecondargument).Thereare153pointsinthescatter,oneforeachcarinthetable.

hybrid.scatter('acceleration','msrp')

Thescatterofpointsisslopingupwards,indicatingthatcarswithgreateraccelerationtendedtocostmore,onaverage;conversely,thecarsthatcostmoretendedtohavegreateraccelerationonaverage.

Thisisanexampleofpositiveassociation:above-averagevaluesofonevariabletendtobeassociatedwithabove-averagevaluesoftheother.

ThescatterdiagramofMSRP(verticalaxis)versusmileage(horizontalaxis)showsanegativeassociation.Thescatterhasacleardownwardtrend.Hybridcarswithhighermileagetendedtocostless,onaverage.Thisseemssurprisingtillyouconsiderthatcarsthatacceleratefasttendtobelessfuelefficientandhavelowermileage.Asthepreviousscattershowed,thosewerealsothecarsthattendedtocostmore.

hybrid.scatter('mpg','msrp')

185Correlation

Alongwiththenegativeassociation,thescatterdiagramofpriceversusefficiencyshowsanon-linearrelationbetweenthetwovariables.Thepointsappeartobeclusteredaroundacurve.

IfwerestrictthedatajusttotheSUVclass,however,theassociationbetweenpriceandefficiencyisstillnegativebuttherelationappearstobemorelinear.TherelationbetweenthepriceandaccelerationofSUV'salsoshowsalineartrend,butwithapositiveslope.

suv=hybrid.where('class','SUV')

suv.scatter('mpg','msrp')

suv.scatter('acceleration','msrp')

186Correlation

Youwillhavenoticedthatwecanderiveusefulinformationfromthegeneralorientationandshapeofascatterdiagramevenwithoutpayingattentiontotheunitsinwhichthevariablesweremeasured.

Indeed,wecouldplotallthevariablesinstandardunitsandtheplotswouldlookthesame.Thisgivesusawaytocomparethedegreeoflinearityintwoscatterdiagrams.

HerearethetwoscatterdiagramsforSUVs,withallthevariablesmeasuredinstandardunits.

'mpg(standardunits)',standard_units(suv.column('mpg')),

'msrp(standardunits)',standard_units(suv.column('msrp'))

]).scatter(0,1)

plots.xlim([-3,3])

plots.ylim([-3,3])

187Correlation

'acceleration(standardunits)',standard_units(suv.column('acceleration'

'msrp(standardunits)',standard_units(suv.column('msrp'

]).scatter(0,1)

plots.xlim([-3,3])

plots.ylim([-3,3])

Theassociationsthatweseeinthesefiguresarethesameasthosewesawbefore.Also,becausethetwoscatterdiagramsaredrawnonexactlythesamescale,wecanseethatthelinearrelationintheseconddiagramisalittlemorefuzzythaninthefirst.

Wewillnowdefineameasurethatusesstandardunitstoquantifythekindsofassociationthatwehaveseen.

188Correlation

Thecorrelationcoefficient¶Thecorrelationcoefficientmeasuresthestrengthofthelinearrelationshipbetweentwovariables.Graphically,itmeasureshowclusteredthescatterdiagramisaroundastraightline.

Thetermcorrelationcoefficientisn'teasytosay,soitisusuallyshortenedtocorrelationanddenotedby .

Herearesomemathematicalfactsabout thatwewillobservebysimulation.

Thecorrelationcoefficient isanumberbetween and1.measurestheextenttowhichthescatterplotclustersaroundastraightline.

ifthescatterdiagramisaperfectstraightlineslopingupwards,and ifthescatterdiagramisaperfectstraightlineslopingdownwards.

Thefunctionr_scattertakesavalueof asitsargumentandsimulatesascatterplotwithacorrelationverycloseto .Becauseofrandomnessinthesimulation,thecorrelationisnotexpectedtobeexactlyequalto .

Callr_scatterafewtimes,withdifferentvaluesof astheargument,andseehowthefootballchanges.Positive correspondstopositiveassociation:above-averagevaluesofonevariableareassociatedwithabove-averagevaluesoftheother,andthescatterplotslopesupwards.

When thescatterplotisperfectlylinearandslopesupward.When ,thescatterplotisperfectlylinearandslopesdownward.When ,thescatterplotisaformlesscloudaroundthehorizontalaxis,andthevariablesaresaidtobeuncorrelated.

r_scatter(0.8)

189Correlation

r_scatter(0.25)

r_scatter(0)

r_scatter(-0.7)

190Correlation

Calculating ¶

Theformulafor isnotapparentfromourobservationssofar,andithasamathematicalbasisthatisoutsidethescopeofthisclass.However,thecalculationisstraightforwardandhelpsusunderstandseveralofthepropertiesof .

Formulafor :

istheaverageoftheproductsofthetwovariables,whenbothvariablesaremeasuredinstandardunits.

Herearethestepsinthecalculation.Wewillapplythestepstoasimpletableofvaluesofand .

x=np.arange(1,7,1)

y=[2,3,1,5,2,7]

t=Table().with_columns([

'x',x,

191Correlation

Basedonthescatterdiagram,weexpectthat willbepositivebutnotequalto1.

t.scatter(0,1,s=30,color='red')

Step1.Converteachvariabletostandardunits.

t_su=t.with_columns([

'x(standardunits)',standard_units(x),

'y(standardunits)',standard_units(y)

192Correlation

x y x(standardunits) y(standardunits)

1 2 -1.46385 -0.648886

2 3 -0.87831 -0.162221

3 1 -0.29277 -1.13555

4 5 0.29277 0.811107

5 2 0.87831 -0.648886

6 7 1.46385 1.78444

Step2.Multiplyeachpairofstandardunits.

t_product=t_su.with_column('productofstandardunits',t_su.column

t_product

x y x(standardunits) y(standardunits) productofstandardunits

1 2 -1.46385 -0.648886 0.949871

2 3 -0.87831 -0.162221 0.142481

3 1 -0.29277 -1.13555 0.332455

4 5 0.29277 0.811107 0.237468

5 2 0.87831 -0.648886 -0.569923

6 7 1.46385 1.78444 2.61215

Step3. istheaverageoftheproductscomputedinStep2.

#ristheaverageoftheproductsofstandardunits

r=np.mean(t_product.column(4))

0.61741639718977093

Asexpected, ispositivebutnotequalto1.

Propertiesof ¶

Thecalculationshowsthat:

193Correlation

isapurenumber.Ithasnounits.Thisisbecause isbasedonstandardunits.isunaffectedbychangingtheunitsoneitheraxis.Thistooisbecause isbasedon

standardunits.isunaffectedbyswitchingtheaxes.Algebraically,thisisbecausetheproductof

standardunitsdoesnotdependonwhichvariableiscalled andwhich .Geometrically,switchingaxesreflectsthescatterplotabouttheline ,butdoesnotchangetheamountofclusteringnorthesignoftheassociation.

t.scatter('y','x',s=30,color='red')

Wecandefineafunctioncorrelationtocompute ,basedontheformulathatweusedabove.Theargumentsareatableandtwolabelsofcolumnsinthetable.Thefunctionreturnsthemeanoftheproductsofthosecolumnvaluesinstandardunits,whichis .

defcorrelation(t,x,y):

returnnp.mean(standard_units(t.column(x))*standard_units(t.column

Let'scallthefunctiononthexandycolumnsoft.Thefunctionreturnsthesameanswertothecorrelationbetween and aswegotbydirectapplicationoftheformulafor.

correlation(t,'x','y')

0.61741639718977093

194Correlation

Callingcorrelationoncolumnsofthetablesuvgivesusthecorrelationbetweenpriceandmileageaswellasthecorrelationbetweenpriceandacceleration.

correlation(suv,'mpg','msrp')

-0.6667143635709919

correlation(suv,'acceleration','msrp')

0.48699799279959155

Thesevaluesconfirmwhatwehadobserved:

Thereisanegativeassociationbetweenpriceandefficiency,whereastheassociationbetweenpriceandaccelerationispositive.Thelinearrelationbetweenpriceandaccelerationisalittleweaker(correlationabout0.5)thanbetweenpriceandmileage(correlationabout-0.67).

CareinUsingCorrelation¶

Correlationisasimpleandpowerfulconcept,butitissometimesmisused.Beforeusing ,itisimportanttobeawareofwhatcorrelationdoesanddoesnotmeasure.

Correlationonlymeasuresassociation.Correlationdoesnotimplycausation.Thoughthecorrelationbetweentheweightandthemathabilityofchildreninaschooldistrictmaybepositive,thatdoesnotmeanthatdoingmathmakeschildrenheavierorthatputtingonweightimprovesthechildren'smathskills.Ageisaconfoundingvariable:olderchildrenarebothheavierandbetteratmaththanyoungerchildren,onaverage.

Correlationmeasureslinearassociation.Variablesthathavestrongnon-linearassociationmighthaveverylowcorrelation.Hereisanexampleofvariablesthathaveaperfectquadraticrelation buthavecorrelationequalto0.

nonlinear=Table().with_columns([

'x',np.arange(-4,4.1,0.5),

'y',np.arange(-4,4.1,0.5)**2

nonlinear.scatter('x','y',s=30,color='r')

195Correlation

correlation(nonlinear,'x','y')

Outlierscanhaveabigeffectoncorrelation.Hereisanexamplewhereascatterplotforwhich isequalto1isturnedintoaplotforwhich isequalto0,bytheadditionofjustoneoutlyingpoint.

line=Table().with_columns([

'x',[1,2,3,4],

'y',[1,2,3,4]

line.scatter('x','y',s=30,color='r')

196Correlation

correlation(line,'x','y')

outlier=Table().with_columns([

'x',[1,2,3,4,5],

'y',[1,2,3,4,0]

outlier.scatter('x','y',s=30,color='r')

correlation(outlier,'x','y')

Correlationsbasedonaggregateddatacanbemisleading.Asanexample,herearedataontheCriticalReadingandMathSATscoresin2014.Thereisonepointforeachofthe50statesandoneforWashington,D.C.ThecolumnParticipationRatecontainsthepercentofhighschoolseniorswhotookthetest.Thenextthreecolumnsshowtheaveragescoreinthestateoneachportionofthetest,andthefinalcolumnistheaverageofthetotalscoresonthetest.

sat2014=Table.read_table('sat2014.csv').sort('State')

sat2014

197Correlation

State ParticipationRate

CriticalReading Math Writing Combined

Alabama 6.7 547 538 532 1617

Alaska 54.2 507 503 475 1485

Arizona 36.4 522 525 500 1547

Arkansas 4.2 573 571 554 1698

California 60.3 498 510 496 1504

Colorado 14.3 582 586 567 1735

Connecticut 88.4 507 510 508 1525

Delaware 100 456 459 444 1359

DistrictofColumbia 100 440 438 431 1309

Florida 72.2 491 485 472 1448

...(41rowsomitted)

ThescatterdiagramofMathscoresversusCriticalReadingscoresisverytightlyclusteredaroundastraightline;thecorrelationiscloseto0.985.

sat2014.scatter('CriticalReading','Math')

correlation(sat2014,'CriticalReading','Math')

0.98475584110674341

198Correlation

ItisimportanttonotethatthisdoesnotreflectthestrengthoftherelationbetweentheMathandCriticalReadingscoresofstudents.Statesdon'ttaketests–studentsdo.Thedatainthetablehavebeencreatedbylumpingallthestudentsineachstateintoasinglepointattheaveragevaluesofthetwovariablesinthatstate.Butnotallstudentsinthestatewillbeatthatpoint,asstudentsvaryintheirperformance.Ifyouplotapointforeachstudentinsteadofjustoneforeachstate,therewillbeacloudofpointsaroundeachpointinthefigureabove.Theoverallpicturewillbemorefuzzy.ThecorrelationbetweentheMathandCriticalReadingscoresofthestudentswillbelowerthanthevaluecalculatedbasedonstateaverages.

Correlationsbasedonaggregatesandaveragesarecalledecologicalcorrelationsandarefrequentlyreported.Aswehavejustseen,theymustbeinterpretedwithcare.

Seriousortongue-in-cheek?¶

In2012,apaperintherespectedNewEnglandJournalofMedicineexaminedtherelationbetweenchocolateconsumptionandNobelPrizesinagroupofcountries.TheScientificAmericanrespondedseriously;othersweremorerelaxed.Thepaperincludedthefollowinggraph:

199Correlation

200Correlation

RegressionInteract

Theregressionline¶Theconceptsofcorrelationandthe"best"straightlinethroughascatterplotweredevelopedinthelate1800'sandearly1900's.ThepioneersinthefieldwereSirFrancisGalton,whowasacousinofCharlesDarwin,andGalton'sprotégéKarlPearson.Galtonwasinterestedineugenics,andwasameticulousobserverofthephysicaltraitsofparentsandtheiroffspring.Pearson,whohadgreaterexpertisethanGaltoninmathematics,helpedturnthoseobservationsintothefoundationsofmathematicalstatistics.

ThescatterplotbelowisofafamousdatasetcollectedbyPearsonandhiscolleaguesintheearly1900's.Itconsistsoftheheights,ininches,of1,078pairsoffathersandsons.Thescattermethodofatabletakesanoptionalargumentstosetthesizeofthedots.

heights=Table.read_table('heights.csv')

heights.scatter('father','son',s=10)

Noticethefamiliarfootballshapewithadensecenterandafewpointsontheperimeter.Thisshaperesultsfromtwobell-shapeddistributionsthatarecorrelated.Theheightsofbothfathersandsonshaveabell-shapeddistribution.

201Regression

heights.hist(bins=np.arange(55.5,80,1),unit='inch')

Theaverageheightofsonsisaboutaninchtallerthantheaverageoffathers.

np.mean(heights.column('son'))-np.mean(heights.column('father'))

0.99740259740261195

Thedifferenceinheightbetweenasonandafathervariesaswell,withthebulkofthedistributionbetween-5inchesand+7inches.

diffs=Table().with_column('son-fatherdifference',

heights.column('son')-heights.column(

diffs.hist(bins=np.arange(-9.5,10,1),unit='inch')

202Regression

Thecorrelationbetweentheheightsofthefathersandsonsisabout0.5.

r=correlation(heights,'father','son')

0.50116268080759108

Theregressioneffect¶

Forfathersaround72inchesinheight,wemightexpecttheirsonstobetallaswell.Thehistogrambelowshowstheheightofallsonsof72-inchfathers.

six_foot_fathers=heights.where(np.round(heights.column('father'))

six_foot_fathers.hist('son',bins=np.arange(55.5,80,1))

203Regression

Most(68%)ofthesonsofthese72-inchfathersarelessthan72inchestall,eventhoughsonsareaninchtallerthanfathersonaverage!

np.count_nonzero(six_foot_fathers.column('son')<72)/six_foot_fathers

0.6842105263157895

Infact,theaverageheightofasonofa72-inchfatherislessthan71inches.Thesonsoftallfathersaresimplynotastallinthissample.

np.mean(six_foot_fathers.column('son'))

70.728070175438603

ThisfactwasnoticedbyGalton,whohadbeenhopingthatexceptionallytallfatherswouldhavesonswhowerejustasexceptionallytall.However,thedatawereclear,andGaltonrealizedthatthetallfathershavesonswhoarenotquiteasexceptionallytall,onaverage.Frustrated,Galtoncalledthisphenomenon"regressiontomediocrity."

Galtonalsonoticedthatexceptionallyshortfathershadsonswhoweresomewhattallerrelativetotheirgeneration,onaverage.Ingeneral,individualswhoareawayfromaverageononevariableareexpectedtobenotquiteasfarawayfromaverageontheother.Thisiscalledtheregressioneffect.

204Regression

Thefigurebelowshowsthescatterplotofthedatawhenbothvariablesaremeasuredinstandardunits.Theredlineshowsequalstandardunitsandhasaslopeof1,butitdoesnotmatchtheangleofthecloudofpoints.Theblueline,whichdoesfollowtheangleofthecloud,iscalledtheregressionline,namedforthe"regressiontomediocrity"itpredicts.

heights_standard=Table().with_columns([

'father(standardunits)',standard_units(heights['father']),

'son(standardunits)',standard_units(heights['son'])

heights_standard.scatter(0,fit_line=True,s=10)

_=plots.plot([-3,3],[-3,3],color='r',lw=3)

ThescattermethodofaTabledrawsthisregressionlineforuswhencalledwithfit_line=True.Thislinepassesthroughtheresultofmultiplyingfatherheights(instandardunits)by .

father_standard=heights_standard.column(0)

heights_standard.with_column('father(standardunits)*r',father_standard

205Regression

Anotherinterpretationofthislineisthatitpassesthroughaveragevaluesforslicesofthepopulationofsons.Toseethisrelationship,wecanroundthefatherseachtothenearestunit,thenaveragetheheightsofallsonsassociatedwiththeseroundedvalues.Thegreenlineistheregressionlineforthesedata,anditpassesclosetoalloftheyellowpoints,whicharemeanheightsofsons(instandardunits).

rounded=heights_standard.with_column('father(standardunits)',np

rounded.join('father(standardunits)',rounded.group(0,np.average

_=plots.plot([-3,3],[-3*r,3*r],color='g',lw=2)

206Regression

KarlPearsonusedtheobservationoftheregressioneffectinthedataabove,aswellasinotherdataprovidedbyGalton,todeveloptheformalcalculationofthecorrelationcoefficient.Thatiswhy issometimescalledPearson'scorrelation.

Theregressionlineinoriginalunits¶

Aswesawinthelastsectionforfootballshapedscatterplots,whenthevariables andaremeasuredinstandardunits,thebeststraightlineforestimating basedon hasslopeandpassesthroughtheorigin.Thustheequationoftheregressionlinecanbewrittenas:

Thatis,

Theequationcanbeconvertedintotheoriginalunitsofthedata,eitherbyrearrangingthisequationalgebraically,orbylabelingsomeimportantfeaturesofthelinebothinstandardunitsandintheoriginalunits.

207Regression

Calculationoftheslopeandintercept¶

Theregressionlineisalsocommonlyexpressedasaslopeandanintercept,whereanestimateforyiscomputedfromanxvalueusingtheequation

Thecalculationsoftheslopeandinterceptoftheregressionlinecanbederivedfromtheequationabove.

208Regression

defslope(table,x,y):

r=correlation(table,x,y)

returnr*np.std(table.column(y))/np.std(table.column(x))

defintercept(table,x,y):

a=slope(table,x,y)

returnnp.mean(table.column(y))-a*np.mean(table.column(x))

[slope(heights,'father','son'),intercept(heights,'father','son'

[0.51400591254559247,33.892800540661682]

Itisworthnotingthattheinterceptofapproximately33.89inchesisnotintendedasanestimateoftheheightofasonwhosefatheris0inchestall.Thereisnosuchsonandnosuchfather.Theinterceptismerelyageometricoralgebraicquantitythathelpsdefinetheline.Ingeneral,itisnotagoodideatoextrapolate,thatis,tomakeestimatesoutsidetherangeoftheavailabledata.Itiscertainlynotagoodideatoextrapolateasfarawayas0isfromtheheightsofthefathersinthestudy.

Itisalsoworthnotingthattheslopeisnotr!Instead,itisrmultipliedbytheratioofstandarddeviations.

correlation(heights,'father','son')

0.50116268080759108

Fittedvalues¶Wecanalsoestimateeverysoninthedatausingtheslopeandintercept.Theestimatedvaluesof arecalledthefittedvalues.Theyalllieonastraightline.Tocalculatethem,takeason'sheight,multiplyitbytheslopeoftheregressionline,andaddtheintercept.Inotherwords,calculatetheheightoftheregressionlineatthegivenvalueof .

209Regression

deffit(table,x,y):

"""Returntheheightoftheregressionlineateachxvalue."""

a=slope(table,x,y)

b=intercept(table,x,y)

returna*table.column(x)+b

fitted=heights.with_column('son(fitted)',fit(heights,'father',

fitted

father son son(fitted)

65 59.8 67.3032

63.3 63.2 66.4294

65 63.3 67.3032

65.8 62.8 67.7144

61.1 64.3 65.2986

63 64.2 66.2752

65.4 64.1 67.5088

64.7 64 67.149

66.1 64.6 67.8686

67 64 68.3312

Inoriginalunits,wecanseethatthesefittedvaluesfallonaline.

fitted.scatter(0,s=10)

210Regression

Moreover,thisline(ingreenbelow),passesneartheaverageheightsofsonsforslicesofthepopulation.Inthiscase,weroundtheheightsoffatherstothenearestinchtoconstructtheslices.

rounded=heights.with_column('father',np.round(heights.column('father'

rounded.join('father',rounded.group(0,np.average)).scatter(0,s=80

_=plots.plot(fitted.column('father'),fitted.column('son(fitted)'

211Regression

212Regression

PredictionInteractInthelastsection,wedevelopedtheconceptsofcorrelationandregressionaswaystodescribedata.Wewillnowseehowtheseconceptscanbecomepowerfultoolsforprediction,whenusedappropriately.Inparticular,givenpairedobservationsoftwoquantitiesandsomeadditionalobservationsofonlythefirstquantity,wecaninfertypicalvaluesforthesecondquantity.

Assumptionsofrandomness:a"regressionmodel"¶Predictionispossibleifwebelievethatascatterplotreflectstheunderlyingrelationbetweenthetwovariablesbeingplotted,butdoesnotspecifytherelationcompletely.Forexample,ascatterplotoftheheightsoffathersandsonsshowsusthepreciserelationbetweenthetwovariablesinoneparticularsampleofmen;butwemightwonderwhetherthatrelationholdstrue,oralmosttrue,amongallfathersandsonsinthepopulationfromwhichthesamplewasdrawn,orindeedamongfathersandsonsingeneral.

Asalways,inferentialthinkingbeginswithacarefulexaminationoftheassumptionsaboutthedata.Setsofassumptionsareknownasmodels.Setsofassumptionsaboutrandomnessinroughlylinearscatterplotsarecalledregressionmodels.

Inbrief,suchmodelssaythattheunderlyingrelationbetweenthetwovariablesisperfectlylinear;thisstraightlineisthesignalthatwewouldliketoidentify.However,wearenotabletoseethelineclearly,becausethereisadditionalvariabilityinthedatabeyondtherelation.Whatweseearepointsthatarescatteredaroundtheline.Foreachofthepoints,thelinearsignalthatrelatesthetwovariableshasbeencombinedwithsomeothersourceofvariabilitythathasnothingtodowiththerelationatall.Thisrandomnoiseobfuscatesthelinearsignal,anditisourinferentialgoaltoidentifythesignaldespitethenoise.

Inordertomakethesestatementsmoreprecise,wewillusethenp.random.normalfunction,whichgeneratessamplesfromanormaldistributionwith0meanand1standarddeviation.Thesesamplescorrespondtoabell-shapeddistributionexpressedinstandardunits.

np.random.normal()

213Prediction

-0.6423537870160524

samples=Table('x')

samples.append([np.random.normal()])

samples.hist(0,bins=np.arange(-4,4,0.1))

Theregressionmodelspecifiesthatthepointsinascatterplot,measuredinstandardunits,aregeneratedatrandomasfollows.

Thexaresampledfromthenormaldistribution.

Foreachx,itscorrespondingyissampledusingthesignal_and_noisefunctionbelow,whichtakesxandthecorrelationcoefficientr.

defsignal_and_noise(x,r):

returnr*x+np.random.normal()*(1-r**2)**0.5

Forexample,ifwechoosertobeahalf,thenthefollowingscatterdiagrammightresult.

214Prediction

defregression_model(r,sample_size):

pairs=Table(['x','y'])

foriinnp.arange(sample_size):

x=np.random.normal()

y=signal_and_noise(x,r)

pairs.append([x,y])

returnpairs

regression_model(0.5,1000).scatter('x','y')

Basedonthisscatterplot,howshouldweestimatethetrueline?Thebestlinethatwecanputthroughascatterplotistheregressionline.Sotheregressionlineisanaturalestimateofthetrueline.

Thesimulationbelowdrawsboththetruelineusedtogeneratethedata(green)andtheregressionline(blue)foundbyanalyzingthepairsthatweregenerated.

215Prediction

defcompare(true_r,sample_size):

pairs=regression_model(true_r,sample_size)

estimated_r=correlation(pairs,'x','y')

pairs.scatter('x','y',fit_line=True)

plt.plot([-3,3],[-3*true_r,3*true_r],color='g',lw=2)

print("Thetrueris",true_r,"andtheestimatedris",estimated_r

compare(0.5,1000)

Thetrueris0.5andtheestimatedris0.459672161067

With1000samples,weseethatthetruelineandtheregressionlinearequitesimilar.Withfewersamples,theregressionestimateofthetruelinethatgeneratedthedatamaybequitedifferent.

compare(0.5,50)

Thetrueris0.5andtheestimatedris0.627329170668

216Prediction

Withveryfewsamples,theregressionlinemayhaveaclearpositiveornegativeslope,evenwhenthetruerelationshiphasacorrelationof0.

compare(0,10)

Thetrueris0andtheestimatedris0.501582566452

217Prediction

Withrealdata,wewillneverobservethetrueline.Infact,regressionisoftenappliedwhenweareuncertainwhetherthetruerelationbetweentwoquantitiesislinearatall.Whatthesimulationshowsthatiftheregressionmodellooksplausible,andwehavealargesample,thenregressionlineisagoodapproximationtothetrueline.

Predictions¶Hereisanexamplewhereregressionmodelcanbeusedtomakepredictions.

ThedataareasubsetoftheinformationgatheredinarandomizedcontrolledtrialabouttreatmentsforHodgkin'sdisease.Hodgkin'sdiseaseisacancerthattypicallyaffectsyoungpeople.Thediseaseiscurablebutthetreatmentisveryharsh.Thepurposeofthetrialwastocomeupwithdosagethatwouldcurethecancerbutminimizetheadverseeffectsonthepatients.

Thistablehodgkinscontainsdataontheeffectthatthetreatmenthadonthelungsof22patients.Thecolumnsare:

HeightincmAmeasureofradiationtothemantle(neck,chest,underarms)AmeasureofchemotherapyAscoreofthehealthofthelungsatbaseline,thatis,atthestartofthetreatment;higherscorescorrespondtomorehealthylungsThesamescoreofthehealthofthelungs,15monthsaftertreatment

218Prediction

hodgkins=Table.read_table('hodgkins.csv')

hodgkins.show()

height rad chemo base month15

164 679 180 160.57 87.77

168 311 180 98.24 67.62

173 388 239 129.04 133.33

157 370 168 85.41 81.28

160 468 151 67.94 79.26

170 341 96 150.51 80.97

163 453 134 129.88 69.24

175 529 264 87.45 56.48

185 392 240 149.84 106.99

178 479 216 92.24 73.43

179 376 160 117.43 101.61

181 539 196 129.75 90.78

173 217 204 97.59 76.38

166 456 192 81.29 67.66

170 252 150 98.29 55.51

165 622 162 118.98 90.92

173 305 213 103.17 79.74

174 566 198 94.97 93.08

173 322 119 85 41.96

173 270 160 115.02 81.12

183 259 241 125.02 97.18

188 238 252 137.43 113.2

Itisevidentthatthepatients'lungswerelesshealthy15monthsafterthetreatmentthanatbaseline.At36months,theydidrecovermostoftheirlungfunction,butthosedataarenotpartofthistable.

Thescatterplotbelowshowsthattallerpatientshadhigherscoresat15months.

219Prediction

hodgkins.scatter('height','month15',fit_line=True)

Thescatterplotlooksroughlylinear.Letusassumethattheregressionmodelholds.Underthatassumption,theregressionlinecanbeusedtopredictthe15-monthscoreofanewpatient,basedonthepatient'sheight.Thepredictionwillbegoodprovidedtheassumptionsoftheregressionmodelarejustifiedforthesedata,andprovidedthenewpatientissimilartothoseinthestudy.

Thefunctionregressgivesustheslopeandinterceptoftheregressionline.Topredictthe15-monthscoreofanewpatientwhoseheightis cm,weusethefollowingcalculation:

Predicted15-monthscoreofapatientwhois inchestall slope intercept

Thepredicted15-monthscoreofapatientwhois173cmtallis83.66cm,andthatofapatientwhois163cmtallis73.69cm.

#slopeandintercept

a=slope(hodgkins,'height','month15')

b=intercept(hodgkins,'height','month15')

220Prediction

#Newpatient,173cmtall

#Predicted15-monthscore:

a*173+b

83.65699801460444

#Newpatient,163cmtall

#Predicted15-monthscore:

a*163+b

73.694360467072741

Thelogicofregressionpredictioncanbewrappedinafunctionpredictthattakesatable,thetwocolumnsofinterest,andsomenewvalueforx.Itpredictstheaveragevaluefory.

defpredict(t,x,y,new_x_value):

a=slope(t,x,y)

b=intercept(t,x,y)

returna*new_x_value+b

predict(hodgkins,'height','month15',163)

73.694360467072741

VariabilityofthePrediction¶Asdatascientistsworkingundertheregressionmodel,wemustalwaysremainawarethatasamplemighthavebeendifferent.Hadthesamplebeendifferent,theregressionlinewouldhavebeendifferenttoo,andsowouldourprediction.

221Prediction

Tounderstandhowmuchapredictionmightvary,wecanselectdifferentsamplesfromthedatawehave.Below,weselectonly15ofthe22rowsofthehodgkinstableatrandomwithoutreplacement,thenmakesaprediction.

predict(hodgkins.sample(15),'height','month15',163)

71.653856125189805

Thepredictionisdifferent,buthowmuchmightweexpectittodiffer?Fromasinglesample,itisnotpossibletoanswerthisquestion.Thesimulationbelowdraws1000samplesof15rowseach,thendrawsahistogramofthepredictedvaluesusingeachsample.

defsample_predict(new_x_value):

predictions=Table(['Predictions'])

predicted=predict(hodgkins.sample(15),'height','month15'

predictions.append([predicted])

predictions.hist(0,bins=np.arange(40,96,2))

sample_predict(163)

Usingonly15rowsinsteadofall22,thepredictioncouldhavecomeoutdifferentlyfrom73.69.Thishistogramshowshowdifferentthepredictionmightbe.Almostallpredictionsfallbetween64and82,butafewareoutsideofthisrange.

222Prediction

Foranxvaluethatisclosertothemeanofallheights,thepredictionsfrom15rowsaremoretightlyclustered.

sample_predict(173)

Inthishistogramofthepredictionsforaheightof173,weseethatalmostallpredictionsfallbetween76and90.Thenarrowrange80to86accountsfornearly70%oftheestimates.

Predictionsofvaluesoutsidethetypicalobservationsofheightvarywildly.Adatascientistisrarelyjustifiedinextrapolatingalineartrendbeyondtherangeofvaluesobserved.

sample_predict(150)

223Prediction

Higher-OrderFunctionsInteractThefunctionswehavedefinedsofarcomputetheiroutputsdirectlyfromtheirinputs.Theirreturnvaluescanbecomputedfromtheirarguments.However,somefunctionsrequireadditionaldatatodeteriminetheirreturnvalues.

Let'sstartwithasimpleexample.Considerthefollowingthreefunctionsthatallscaletheirinputbyaconstantfactor.

defdouble(x):

return2*x

deftriple(x):

return3*x

defhalve(x):

return0.5*x

[double(4),triple(4),halve(4),double(5),triple(5),halve(5)]

[8,12,2.0,10,15,2.5]

Ratherthandefiningeachofthesefunctionswithaseparatedefstatement,itispossibletogenerateeachofthemdynamicallybypassinginaconstanttoagenericscale_byfunction.

defscale_by(a):

defscale(x):

returna*x

returnscale

double=scale_by(2)

triple=scale_by(3)

halve=scale_by(0.5)

[double(4),triple(4),halve(4),double(5),triple(5),halve(5)]

224Higher-OrderFunctions

[8,12,2.0,10,15,2.5]

Thebodyofthescale_byfunctionactuallydefinesanewfunction,calledscale,andthenreturnsit.Thereturnvalueofscale_byisafunctionthattakesanumberxandmultipliesitbya.Becausescale_byisafunctionthatreturnsafunction,itiscalledahigher-orderfunction.

Thepurposeofhigher-orderfunctionsistocreatefunctionsthatcombinedatawithbehavior.Above,thedoublefunctioniscreatedbycombiningthedatavalue2withthebehaviorofscalingbyaconstant.

Adefstatementwithinanotherdefstatementbehavesinthesamewayasanyotherdefstatementexceptthat:

1. Thenameoftheinnerfunctionisonlyaccessiblewithintheouterfunction'sbody.Forexample,itisnotpossibletousethenamescaleoncescale_byhasreturned.

2. Thebodyoftheinnerfunctioncanrefertotheargumentsandnameswithintheouterfunction.So,forexample,thebodyoftheinnerscalefunctioncanrefertoa,whichisanargumentoftheouterfunctionscale_by.

Thefinallineoftheouterscale_byfunctionisreturnscale.Thislinereturnsthefunctionthatwasjustdefined.Thepurposeofreturningafunctionistogiveitaname.Theline

double=scale_by(2)

givesthenamedoubletotheparticularinstanceofthescalefunctionthathasaassignedtothenumber2.Noticethatitispossibletokeepmultipleinstancesofthescalefunctionaroundatonce,allwithdifferentnamessuchasdoubleandtriple,andPythonkeepstrackofwhichvalueforatouseeachtimeoneofthesefunctionsiscalled.

Example:StandardUnits¶Wehaveseenhowtoconvertalistofnumberstostandardunitsbysubtractingthemeananddividingbythestandarddeviation.

defstandard_units(any_numbers):

"Convertanyarrayofnumberstostandardunits."

return(any_numbers-np.mean(any_numbers))/np.std(any_numbers)

Thisfunctioncanconvertanylistofnumberstostandardunits.

observations=[3,4,2,4,3,5,1,6]

np.round(standard_units(observations),3)

array([-0.333,0.333,-1.,0.333,-0.333,1.,-1.667,1.667])

Howwouldweanswerthequestion,"whatisthenumber8(inoriginalunits)convertedtostandardunits?"

Adding8tothelistofobservationsandre-computingstandard_unitsdoesnotanswerthisquestion,becauseaddinganobservationchangesthescale!

observations_with_eight=[3,4,2,4,3,5,1,6,8]

standard_units(observations_with_eight)

array([-0.5,0.,-1.,0.,-0.5,0.5,-1.5,1.,2.])

Ifinsteadwewanttomaintaintheoriginalscale,weneedafunctionthatrememberstheoriginalmeanandstandarddeviation.Thisscenario,wherebothdataandbehaviorarerequiredtoarriveattheanswer,isanaturalfitforahigher-orderfunction.

defconverter_to_standard_units(any_numbers):

"""Returnafunctionthatconvertstothestandardunitsofany_numbers."""

mean=np.mean(any_numbers)

sd=np.std(any_numbers)

defconvert(a_number_in_original_units):

return(a_number_in_original_units-mean)/sd

returnconvert

Now,wecanusetheoriginallistofnumberstocreatetheconversionfunction.Themeanandstandarddeviationarestoredwithinthefunctionthatisreturned,calledto_subelow.What'smore,thesenumbersareonlycomputedonce,nomatterhowmanytimeswecallthefunction.

observations=[3,4,2,4,3,5,1,6]

to_su=converter_to_standard_units(observations)

[to_su(2),to_su(3),to_su(8)]

[-1.0,-0.33333333333333331,3.0]

Acomplementaryfunctionthatreturnsaconverterfromstandardunitstooriginalunitsinvolvesmultipyingbythestandarddeviation,thenaddingthemean.

defconverter_from_standard_units(any_numbers):

"""Returnafunctionthatconvertsfromthestandardunitsofany_numbers."""

mean=np.mean(any_numbers)

sd=np.std(any_numbers)

defconvert(in_standard_units):

returnin_standard_units*sd+mean

returnconvert

from_su=converter_from_standard_units(observations)

from_su(3)

Itshouldbethecasethatanynumberconvertedtostandardunitsandback,usingthesamelisttocreatebothfunctions.

from_su(to_su(11))

Example:Prediction¶Previously,inordertocomputethefittedvaluesforaregression,wefirstcomputedtheslopeandinterceptoftheregressionline.Asanalternateimplementation,foreachx,wecanconvertxtostandardunitsfromtheoriginalunitsofxvalues,thenmultiplybyr,then

converttheresulttotheoriginalunitsofyvalues.Wecanwritedownthisprocessasahigher-orderfunction.

defregression_estimator(table,x,y):

"""Returnanestimatorforyasafunctionofx."""

to_su=converter_to_standard_units(table.column(x))

from_su=converter_from_standard_units(table.column(y))

defestimate(any_x):

returnfrom_su(r*to_su(any_x))

returnestimate

ThefollowingdatasetofmothersandbabieswascollectedfromtheKaiserHospitalinOakland,California.

baby=Table.read_table('baby.csv')

BirthWeight

GestationalDays

MaternalAge

MaternalHeight

MaternalSmoker

120 284 27 62 100 False

113 282 33 64 135 False

128 279 28 64 115 True

108 282 23 67 125 True

136 286 25 62 93 False

138 244 33 62 178 False

132 245 23 65 140 False

120 289 25 62 125 False

143 299 30 66 136 True

140 351 27 68 120 False

Gestationaldaysarethenumberofdaysthatthemotherispregnant.Mostpregnancieslastbetween250and310days.

baby.hist(1,bins=np.arange(200,330,10))

Forthiscollectionoftypicalbirths,gestationaldaysandbirthweightappeartobelinearlyrelated.

typical=baby.where(np.logical_and(baby.column(1)>=250,baby.column

typical.scatter(1,0,fit_line=True)

Thefunctionaverage_weightreturnstheaverageweightofbabiesbornafteracertainnumberofgestationaldays,usingtheregressionlinetopredictthisvalue.Thisfunctionisdefinedwithinregression_estimatorandreturned.Thereturnedvalueisboundtothenameaverage_weightusinganassignmentstatement.

average_weight=regression_estimator(typical,1,0)

Theaverage_weightfunctioncontainsalltheinformationneededtomakebirthweightpredictions,butwhenitiscalled,ittakesonlyasingleargument:thegestationaldays.Weavoidtheneedtopassthetableandcolumnlabelsintotheaverage_weightfunctioneachtimethatwecallitbecausethosevalueswerepassedtoregression_estimator,whichdefinedtheaverage_weightfunctionintermsofthosevalues.

average_weight(290)

126.33785752202166

Thisfunctionisquiteflexible.Forinstance,wecancomputefittedvaluesforallgestationaldaysbyapplyingittothecolumnofgestationaldays.

fitted=typical.apply(average_weight,1)

typical.select([1,0]).with_column('BirthWeight(fitted)',fitted)

Higher-orderfunctionsappearofteninthecontextofpredictionbecauseestimatingavalueofteninvolvesbothdatatoinformthepredictionandaprocedurethatmakesthepredictionitself.

ErrorsInteract

Errorintheregressionestimate¶

Thoughtheaverageresidualis0,eachindividualresidualisnot.Someresidualsmightbequitefarfrom0.Togetasenseoftheamountoferrorintheregressionestimate,wewillstartwithagraphicaldescriptionofthesenseinwhichtheregressionlineisthe"best".

Ourexampleisadatasetthathasonepointforeverychapterofthenovel"LittleWomen."Thegoalistoestimatethenumberofcharacters(thatis,letters,punctuationmarks,andsoon)basedonthenumberofperiods.Recallthatweattemptedtodothisintheveryfirstlectureofthiscourse.

little_women=Table.read_table('little_women.csv')

little_women.show(3)

Characters Periods

21759 189

22148 188

20558 231

...(44rowsomitted)

#Onepointforeachchapter

#Horizontalaxis:numberofperiods

#Verticalaxis:numberofcharacters(asina,b,",?,etc;notpeopleinthebook)

little_women.scatter('Periods','Characters')

232Errors

correlation(little_women,'Periods','Characters')

0.92295768958548163

Thescatterplotisremarkablyclosetolinear,andthecorrelationismorethan0.92.

Thefigurebelowshowsthescatterplotandregressionline,withfouroftheerrorsmarkedinred.

#Residuals:Deviationsfromtheregressionline

lw_slope=slope(little_women,'Periods','Characters')

lw_intercept=intercept(little_women,'Periods','Characters')

print('Slope:',np.round(lw_slope))

print('Intercept:',np.round(lw_intercept))

lw_errors(lw_slope,lw_intercept)

Slope:87.0

Intercept:4745.0

233Errors

Hadweusedadifferentlinetocreateourestimates,theerrorswouldhavebeendifferent.Thepicturebelowshowshowbigtheerrorswouldbeifweweretouseaparticularlysillylineforestimation.

#Errors:Deviationsfromadifferentline

lw_errors(-100,50000)

Belowisalinethatwehaveusedbeforewithoutsayingthatwewereusingalinetocreateestimates.Itisthehorizontallineatthevalue"averageof ."Supposeyouwereaskedtoestimate andwerenottoldthevalueof ;thenyouwouldusetheaverageof asyour

234Errors

estimate,regardlessofthechapter.Inotherwords,youwouldusetheflatlinebelow.

Eacherrorthatyouwouldmakewouldthenbeadeviationfromaverage.TheroughsizeofthesedeviationsistheSDof .

Insummary,ifweusetheflatlineattheaverageof tomakeourestimates,theestimateswillbeoffbytheSDof .

#Errors:Deviationsfromtheflatlineattheaverageofy

characters_average=np.mean(little_women.column('Characters'))

lw_errors(0,characters_average)

LeastSquares¶

Ifyouuseanyarbitrarylineasyourlineofestimates,thensomeofyourerrorsarelikelytobepositiveandothersnegative.Toavoidcancellationwhenmeasuringtheroughsizeoftheerrors,wetakethemeanofthesqaurederrorsratherthanthemeanoftheerrorsthemselves.Thisisexactlyanalogoustoourreasonforlookingatsquareddeviationsfromaverage,whenwewerelearninghowtocalculatetheSD.

Themeansquarederrorofestimationusingastraightlineisameasureofroughlyhowbigthesquarederrorsare;takingthesquarerootyieldstherootmeansquareerror,whichisinthesameunitsas .

235Errors

Hereisaremarkablefactofmathematics:theregressionlineminimizesthemeansquarederrorofestimation(andhencealsotherootmeansquarederror)amongallstraightlines.Thatiswhytheregressionlineissometimescalledthe"leastsquaresline."

Computingthe"best"line.

Togetestimatesof basedon ,youcanuseanylineyouwant.Everylinehasameansquarederrorofestimation."Better"lineshavesmallererrors.Theregressionlineistheuniquestraightlinethatminimizesthemeansquarederrorofestimationamongallstraightlines.

WecandefineafunctiontocomputetherootmeansquarederrorofanylinethroughttheLittleWomenscatterdiagram.

deflw_rmse(slope,intercept):

lw_errors(slope,intercept)

x=little_women.column('Periods')

y=little_women.column('Characters')

fitted=slope*x+intercept

mse=np.average((y-fitted)**2)

print("Rootmeansquarederror:",mse**0.5)

lw_rmse(0,characters_average)

Rootmeansquarederror:7019.17593405

236Errors

Theerroroftheregressionlineisindeedmuchsmallerifwechooseaslopeandinterceptneartheregressionline.

lw_rmse(90,4000)

Buttheminimumisachievedusingtheregressionlineitself.

lw_rmse(lw_slope,lw_intercept)

237Errors

NumericalOptimization¶Wecanalsodefinemean_squared_errorforanarbitrarydataset.We'lluseahigher-orderfunctionsothatwecantrymanydifferentlinesonthesamedataset,simplybypassingintheirslopeandintercept.

defmean_squared_error(table,x,y):

deffor_line(slope,intercept):

fitted=(slope*table.column(x)+intercept)

returnnp.average((table.column(y)-fitted)**2)

returnfor_line

TheOldFaithfulgeyserinYellowstonenationalparkeruptsregularly,butthewaitingtimebetweeneruptions(inseconds)andthedurationoftheeruptionsinseccondsdovary.

faithful=Table.read_table('faithful.csv')

faithful.scatter(1,0)

238Errors

Itappearsthattherearetwotypesoferuptions,shortandlong.Theeruptionsabove3secondsappearcorrelatedwithwaitingtime.

long=faithful.where(faithful.column('eruptions')>3)

long.scatter(1,0,fit_line=True)

Thermse_geyserfunctiontakesaslopeandaninterceptandreturnstherootmeansquarederrorofalinearpredictoroferuptionsizefromthewaitingtime.

239Errors

mse_long=mean_squared_error(long,1,0)

mse_long(0,4)**0.5

0.50268461001194098

Ifweexperimentwithdifferentvalues,wecanfindalow-errorslopeandinterceptthroughtrialanderror.

mse_long(0.01,3.5)**0.5

0.39143872353883735

mse_long(0.02,2.7)**0.5

0.38168519564089542

Theminimizefunctioncanbeusedtofindtheargumentsofafunctionforwhichthefunctionreturnsitsminimumvalue.Pythonusesasimilartrial-and-errorapproach,followingthechangesthatleadtoincrementallyloweroutputvalues.

a,b=minimize(mse_long)

Therootmeansquarederroroftheminimalslopeandinterceptissmallerthananyofthevaluesweconsideredbefore.

mse_long(a,b)**0.5

0.38014612988504093

Infact,thesevaluesaandbarethesameasthevaluesreturnedbytheslopeandinterceptfunctionswedevelopedbasedonthecorrelationcoefficient.Weseesmalldeviationsduetotheinexactnatureofminimize,butthevaluesareessentiallythesame.

240Errors

print("slope:",slope(long,1,0))

print("a:",a)

print("intecept:",intercept(long,1,0))

print("b:",b)

slope:0.0255508940715

a:0.0255496

intecept:2.24752334164

b:2.247629

Therefore,wehavefoundnotonlythattheregressionlineminimizesmeansquarederror,butalsothatminimizingmeansquarederrorgivesustheregressionline.Theregressionlineistheonlylinethatminimizesmeansquarederror.

Residuals¶Theamountoferrorineachoftheseregressionestimatesisthedifferencebetweentheson'sheightanditsestimate.Theseerrorsarecalledresiduals.Someresidualsarepositive.Thesecorrespondtopointsthatareabovetheregressionline–pointsforwhichtheregressionlineunder-estimates .Negativeresidualscorrespondtothelineover-estimatingvaluesof .

waiting=long.column('waiting')

eruptions=long.column('eruptions')

fitted=a*waiting+b

res=long.with_columns([

'eruptions(fitted)',fitted,

'residuals',eruptions-fitted

241Errors

eruptions waiting eruptions(fitted) residuals

3.6 79 4.26605 -0.666047

3.333 74 4.1383 -0.805299

4.533 85 4.41934 0.113655

4.7 88 4.49599 0.204006

3.6 85 4.41934 -0.819345

4.35 85 4.41934 -0.069345

3.917 84 4.3938 -0.476795

4.2 78 4.2405 -0.0404978

4.7 83 4.36825 0.331754

4.8 84 4.3938 0.406205

...(165rowsomitted)

Aswithdeviationsfromaverage,thepositiveandnegativeresidualsexactlycanceleachotherout.Sotheaverage(andsum)oftheresidualsis0.Becausewefoundaandbbynumericallyminimizingtherootmeansquarederrorratherthancomputingthemexactly,thesumofresidualsisslightlydifferentfromzerointhiscase.

sum(res.column('residuals'))

-0.00037579999996140145

Aresidualplotisascatterplotoftheresidualsversusthefittedvalues.Theresidualplotofagoodregressionlooksliketheonebelow:aformlesscloudwithnopattern,centeredaroundthehorizontalaxis.Itshowsthatthereisnodiscerniblenon-linearpatternintheoriginalscatterplot.

res.scatter(2,3,fit_line=True)

242Errors

Bycontrast,supposewehadattemptedtofitaregressionlinetotheentiresetoferuptionsofOldFaithfulintheoriginaldataset.

faithful.scatter(0,1,fit_line=True)

Thislinedoesnotpassthroughtheaveragevalueofverticalstrips.Wemaybeabletoseethisfactfromthescatteralone,butitisdramaticallyclearbyvisualizingtheresidualscatter.

243Errors

defresidual_plot(table,x,y):

fitted=fit(table,x,y)

residuals=table.column(y)-fitted

res_table=Table().with_columns([

'fitted',fitted,

'residuals',residuals])

res_table.scatter(0,1,fit_line=True)

residual_plot(faithful,1,0)

Thediagonalstripesshowanon-linearpatterninthedata.Inthiscase,wewouldnotexpecttheregressionlinetoprovidelow-errorestimates.

Example:SATScores¶Residualplotscanbeusefulforspottingnon-linearityinthedata,orotherfeaturesthatweakentheregressionanalysis.Forexample,considertheSATdataoftheprevioussection,andsupposeyoutrytoestimatetheCombinedscorebasedonParticipationRate.

sat=Table.read_table('sat2014.csv')

244Errors

State ParticipationRate

CriticalReading Math Writing Combined

NorthDakota 2.3 612 620 584 1816

Illinois 4.6 599 616 587 1802

Iowa 3.1 605 611 578 1794

SouthDakota 2.9 604 609 579 1792

Minnesota 5.9 598 610 578 1786

Michigan 3.8 593 610 581 1784

Wisconsin 3.9 596 608 578 1782

Missouri 4.2 595 597 579 1771

Wyoming 3.3 590 599 573 1762

Kansas 5.3 591 596 566 1753

...(41rowsomitted)

sat.scatter('ParticipationRate','Combined',s=8)

Therelationbetweenthevariablesisclearlynon-linear,butyoumightbetemptedtofitastraightlineanyway,especiallyifyouneverlookedatascatterdiagramofthedata.

245Errors

sat.scatter('ParticipationRate','Combined',s=8,fit_line=True)

Thepointsinthescatterplotstartoutabovetheregressionline,thenareconsistentlybelowtheline,thenabove,thenbelow.Thispatternofnon-linearityismoreclearlyvisibleintheresidualplot.

residual_plot(sat,'ParticipationRate','Combined')

_=plt.title('Residualplotofabadregression')

246Errors

Thisresidualplotisnotaformlesscloud;itshowsanon-linearpattern,andisasignalthatlinearregressionshouldnothavebeenusedforthesedata.

TheSizeofResiduals¶

Wewillconcludebyobservingseveralrelationshipsbetweenquantitieswehavealreadyidentified.We'llillustratetherelationshipsusingthelongeruptionsofOldFaithful,buttheyindeedholdforanycollectionofpairedobservations.

Fact1:Therootmeansquarederrorofalinethathaszeroslopeandaninterceptattheaverageofyisthestandarddeviationofy.

mse_long(0,np.average(eruptions))**0.5

0.40967604587437784

np.std(eruptions)

0.40967604587437784

Bycontrast,thestandarddeviationoffittedvaluesissmaller.

eruptions_fitted=fit(long,'waiting','eruptions')

np.std(eruptions_fitted)

0.15271994814407569

Fact2:Theratioofthestandarddeviationsoffittedvaluestoyvaluesis .

r=correlation(long,'waiting','eruptions')

print('r:',r)

print('ratio:',np.std(eruptions_fitted)/np.std(eruptions))

r:0.372782225571

ratio:0.372782225571

247Errors

Noticetheabsolutevalueof intheformulaabove.Fortheheightsoffathersandsons,thecorrelationispositiveandsothereisnodifferencebetweenusing andusingitsabsolutevalue.However,theresultistrueforvariablesthathavenegativecorrelationaswell,providedwearecarefultousetheabsolutevalueof insteadof .

Theregressionlinedoesabetterjobofestimatingeruptionlengthsthanazero-slopeline.Thus,theroughsizeoftheerrorsmadeusingtheregressionlinemustbesmallerthatthatusingaflatline.Inotherwords,theSDoftheresidualsmustbesmallerthantheoverallSDof .

eruptions_residuals=eruptions-eruptions_fitted

np.std(eruptions_residuals)

0.38014612980028634

Fact3:TheSDoftheresidualsis timestheSDof .

np.std(eruptions)*(1-r**2)**0.5

0.38014612980028639

Fact4:Thevarianceofthefittedvaluesplusthevarianceoftheresidualsgivesthevarianceofy.

np.std(eruptions_fitted)**2+np.std(eruptions_residuals)**2

0.16783446256326531

np.std(eruptions)**2

0.16783446256326534

248Errors

MultipleRegressionInteractNowthatwehaveexploredwaystousemultiplefeaturestopredictacategoricalvariable,itisnaturaltostudywaysofusingmultiplepredictorvariablestopredictaquantitativevariable.Acommonlyusedmethodtodothisiscalledmultipleregression.

Wewillstartwithanexampletoreviewsomefundamentalaspectsofsimpleregression,thatis,regressionbasedononepredictorvariable.

Example:SimpleRegression¶

Supposethatourgoalistouseregressiontoestimatetheheightofabassethoundbasedonitsweight,usingasamplethatlooksconsistentwiththeregressionmodel.Supposetheobservedcorrelation is0.5,andthatthesummarystatisticsforthetwovariablesareasinthetablebelow:

average SD

height 14inches 2inches

weight 50pounds 5pounds

Tocalculatetheequationoftheregressionline,weneedtheslopeandtheintercept.

Theequationoftheregressionlineallowsustocalculatetheestimatedheight,ininches,basedonagivenweightinpounds:

Theslopeofthelineismeasurestheincreaseintheestimatedheightperunitincreaseinweight.Theslopeispositive,anditisimportanttonotethatthisdoesnotmeanthatwethinkbassethoundsgettalleriftheyputonweight.Theslopereflectsthedifferenceintheaverageheightsoftwogroupsofdogsthatare1poundapartinweight.Specifically,

249MultipleRegression

consideragroupofdogswhoseweightis pounds,andthegroupwhoseweightispounds.Thesecondgroupisestimatedtobe0.2inchestaller,onaverage.Thisistrueforallvaluesof inthesample.

Ingeneral,theslopeoftheregressionlinecanbeinterpretedastheaverageincreaseinperunitincreasein .Notethatiftheslopeisnegative,thenforeveryunitincreasein ,theaverageof decreases.

MultiplePredictors¶

Inmultipleregression,morethanonepredictorvariableisusedtoestimate .Forexample,aDartmouthstudyofundergraduatescollectedinformationabouttheiruseofanonlinecourseforumandtheirGPA.Wemightwanttopredictastudent'sGPAbasedonthenumberofdaysthattheyvisitthecourseforumandhowmanytimestheyanswersomeoneelse'squestion.Thenthemultipleregressionmodelwouldinvolvetwoslopes,anintercept,andrandomerrorsasbefore:

OurgoalwouldbetofindtheestimatedGPAusingthebestestimatesofthetwoslopesandtheintercept;asbefore,the"best"estimatesarethosethatminimizethemeansquarederrorofestimation.

Tostartoff,wewillinvestigatethedataset,whichisdescribedonline.Eachrowrepresentsastudent.Inadditiontothestudent'sGPA,therowtallies

daysonline:Thenumberofdaysonwhichthestudentviewedtheonlineforumviews:Thenumberofpostsviewedcontributions:Thenumberofcontributions,includingpostsandfollow-updiscussionsquestions:Thenumberofquestionspostednotes:Thenumberofnotespostedanswers:Thenumberofanswersposted

grades=Table.read_table('grades_and_piazza.csv')

grades

GPA daysonline views contributions questions notes answers

2.863 29 299 5 1 1 0

3.505 57 299 0 0 0 0

3.029 27 101 1 1 0 0

3.679 67 301 1 0 0 0

3.474 43 201 12 1 0 0

3.705 67 308 45 22 0 5

3.806 36 171 20 4 3 4

3.667 82 300 26 11 0 3

3.245 44 127 6 1 1 0

3.293 35 259 16 13 1 0

...(20rowsomitted)

CorrelationMatrix¶

PerhapswewishtopredictGPAbasedonforumusage.AnaturalfirststepistoseewhichvariablesarecorrelatedwiththeGPA.Hereisthecorrelationmatrix.Theto_dfmethodgeneratesaPandasdataframecontainingthesamedataasthetable,anditscorrmethodgeneratesamatrixofcorrelationsforeachpairofcolumns.

AllforumusagevariablesarecorrelatedwithGPA,butthenotescorrelationhasaverysmallmagnitude.

grades.to_df().corr()

GPA daysonline views contributions questions

GPA 1.000000 0.684905 0.444175 0.427897 0.409212

daysonline 0.684905 1.000000 0.654557 0.448319 0.435269

views 0.444175 0.654557 1.000000 0.426406 0.361002

contributions 0.427897 0.448319 0.426406 1.000000 0.857981

questions 0.409212 0.435269 0.361002 0.857981 1.000000

notes -0.160604 -0.230839 0.065627 0.295661 -0.006365

answers 0.440382 0.502810 0.365010 0.702679 0.515661

Tostartoff,letusperformthesimpleregressionofGPAonjustdaysonline,thecountwiththestrongestcorrelationwithanrof0.68.Hereisthescatterdiagramandregressionline.

grades.scatter('daysonline','GPA',fit_line=True)

Wecanimmediatelyseethattherelationisnotentirelylinear,andaresidualplothighlightsthisfact.OneissueisthattheGPAisneverabove4.0,sothemodelassumptionthatyvaluesarebellshapeddoesnotholdinthiscase.

residual_plot(grades,'daysonline','GPA')

Nonetheless,theregressionlinedoescapturesomeofthepatterinthedata,andsowewillcontinuetoworkwithit,notingthatourpredictionsmaynotbeentirelyjustifiedbytheregressionmodel.

Therootmeansquarederrorisabitabove1/4ofaGPApointwhenpredictingGPAfromonlythedaysonline.

grades_days_mse=mean_squared_error(grades,'daysonline','GPA')

a,b=minimize(grades_days_mse)

grades_days_mse(a,b)**0.5

0.28494512878662448

TwoPredictors¶

Perhapsmoreinformationcanimproveourprediction.ThenumberofanswersisalsocorrelatedwithGPA.Whenusingitaloneasapredictor,therootmeansquarederrorisgreaterbecausethecorrelationhaslowermagnitudethandaysonline.

grades_answers_mse=mean_squared_error(grades,'answers','GPA')

a,b=minimize(grades_answers_mse)

grades_answers_mse(a,b)**0.5

0.35110518920562062

However,thetwovariablescanbeusedtogether.First,wedefinethemeansquarederrorintermsofalinethathastwoslopes,oneforeachvariable,aswellasanintercept.

defgrades_days_answers_mse(a_days,a_answers,b):

fitted=a_days*grades.column('daysonline')+\

a_answers*grades.column('answers')+\

y=grades.column('GPA')

returnnp.average((y-fitted)**2)

minimize(grades_days_answers_mse)

array([0.0157683,0.0301254,2.6031789])

Theminimizefunctionreturnsthreevalues,theslopefordaysonline,theslopeforanswers,andtheintercept.Together,thesethreevaluesdescribealinearpredictionfunction.Forexample,someonewhospent50daysonlineandanswered5questionsispredictedtohaveaGPAofabout3.54.

np.round(0.0157683*50+0.0301254*5+2.6031789,2)

Therootmeansquarederrorofthispredictorisabitsmallerthanthatofeithersingle-variablepredictorwehadbefore!

a_days,a_answers,b=minimize(grades_days_answers_mse)

grades_days_answers_mse(a_days,a_answers,b)**0.5

0.28161529539241298

Combiningallvariables¶

Wecancombineallinformationaboutthecourseforuminordertoattempttofindanevenbetterpredictor.

deffit_grades(a_days,a_views,a_contributions,a_questions,a_notes

returna_days*grades.column('daysonline')+\

a_views*grades.column('views')+\

a_contributions*grades.column('contributions')+\

a_questions*grades.column('questions')+\

a_notes*grades.column('notes')+\

a_answers*grades.column('answers')+\

defgrades_all_mse(a_days,a_views,a_contributions,a_questions,a_notes

fitted=fit_grades(a_days,a_views,a_contributions,a_questions

y=grades.column('GPA')

returnnp.average((y-fitted)**2)

minimize(grades_all_mse)

array([1.43703000e-02,-8.15000000e-05,6.50720000e-03,

-1.72780000e-03,-4.04014000e-02,1.59645000e-02,

2.65375470e+00])

Usingallofthisinformation,wehaveconstructedapredictorwithanevensmallerrootmeansquarederror.The*belowpassesallofthereturnvaluesoftheminimizefunctionasargumentstogrades_all_mse.

grades_all_mse(*minimize(grades_all_mse))**0.5

0.27783237243108722

Allofthisinformationdidnotimproveourpredictorverymuch.Therootmeansquarederrordecreasedfrom2.82forthebestsingle-variablepredictorto2.78whenusingalloftheavailableinformation.

Definitionof ,consistentwithourold

Whenwestudiedsimpleregression,wehadnotedthat

Letususeouroldfunctionstocomputethefittedvaluesandconfirmthatthisistrueforourexample:

fitted=fit(grades,'daysonline','GPA')

np.std(fitted)/np.std(grades.column('GPA'))

0.68490480299828194

Becausevarianceisthesquareofthestandarddeviation,wecansaythat

Noticethatthiswayofthinkingabout involvesonlytheestimatedvaluesandtheobservedvalues,notthenumberofpredictorvariables.Therefore,itmotivatesthedefinitionofmultiple :

Itisafactofmathematicsthatthisquantityisalwaysbetween0and1.Withmultiplepredictorvariables,thereisnoclearinterpretationofasignattachedtothesquarerootof

.Someofthepredictorsmightbepositivelyassociatedwith ,othersnegatively.Anoverallmeasureofthefitisprovidedby .

fitted=fit_grades(*minimize(grades_all_mse))

np.var(fitted)/np.var(grades.column('GPA'))

0.49525855565521787

Itisnotalwayswisetomaximize byincludingmanyvariables,butwewilladdressthisconcernlaterinthetext.

Wecanalsoviewtheresidualplotofmultipleregression,justasbefore.Thehorizontalaxisshowsfittedvaluesandtheverticalaxisshowserrors.Again,thereissomestructuretothisplot,soperhapstherelationbetweenthesevariablesisnotentirelylinear.Inthiscase,thestructureisminorenoughthattheregressionmodelisareasonablechoiceofpredictor.

'fitted',fitted,

'errors',grades.column('GPA')-fitted

]).scatter(0,1,fit_line=True)

ClassificationSectionauthor:DavidWagner

InteractThissectionwilldiscussmachinelearning.Machinelearningisaclassoftechniquesforautomaticallyfindingpatternsindataandusingittodrawinferencesormakepredictions.We'regoingtofocusonaparticularkindofmachinelearning,namely,classification.

Classificationisaboutlearninghowtomakepredictionsfrompastexamples:we'regivensomeexampleswherewehavebeentoldwhatthecorrectpredictionwas,andwewanttolearnfromthoseexampleshowtomakegoodpredictionsinthefuture.Hereareafewapplicationswhereclassificationisusedinpractice:

ForeachorderAmazonreceives,Amazonwouldliketopredict:isthisorderfraudulent?Theyhavesomeinformationabouteachorder(e.g.,itstotalvalue,whethertheorderisbeingshippedtoanaddressthiscustomerhasusedbefore,whethertheshippingaddressisthesameasthecreditcardholder'sbillingaddress).Theyhavelotsofdataonpastorders,andtheyknowwhetherwhichofthosepastorderswerefraudulentandwhichweren't.Theywanttolearnpatternsthatwillhelpthempredict,asnewordersarrive,whetherthosenewordersarefraudulent.

Onlinedatingsiteswouldliketopredict:arethesetwopeoplecompatible?Willtheyhititoff?Theyhavelotsofdataonwhichmatchesthey'vesuggestedtotheircustomersinthepast,andtheyhavesomeideawhichonesweresuccessful.Asnewcustomerssignup,they'dliketopredictmakepredictionsaboutwhomightbeagoodmatchforthem.

Doctorswouldliketoknow:doesthispatienthavecancer?Basedonthemeasurementsfromsomelabtest,they'dliketobeabletopredictwhethertheparticularpatienthascancer.Theyhavelotsofdataonpastpatients,includingtheirlabmeasurementsandwhethertheyultimatelydevelopedcancer,andfromthat,they'dliketotrytoinferwhatmeasurementstendtobecharacteristicofcancer(ornon-cancer)sotheycandiagnosefuturepatientsaccurately.

Politicianswouldliketopredict:areyougoingtovoteforthem?Thiswillhelpthemfocusfundraisingeffortsonpeoplewhoarelikelytosupportthem,andfocusget-out-the-voteeffortsonvoterswhowillvoteforthem.Publicdatabasesandcommercialdatabaseshavealotofinformationaboutmostpeople:e.g.,whethertheyownahomeorrent;whethertheyliveinarichneighborhoodorpoorneighborhood;theirinterestsandhobbies;theirshoppinghabits;andsoon.Andpoliticalcampaignshavesurveyed

258Classification

somevotersandfoundoutwhotheyplantovotefor,sotheyhavesomeexampleswherethecorrectanswerisknown.Fromthisdata,thecampaignswouldliketofindpatternsthatwillhelpthemmakepredictionsaboutallotherpotentialvoters.

Alloftheseareclassificationtasks.Noticethatineachoftheseexamples,thepredictionisayes/noquestion--wecallthisbinaryclassification,becausethereareonlytwopossiblepredictions.Inaclassificationtask,wehaveabunchofobservations.Eachobservationrepresentsasingleindividualorasinglesituationwherewe'dliketomakeaprediction.Eachobservationhasmultipleattributes,whichareknown(e.g.,thetotalvalueoftheorder;voter'sannualsalary;andsoon).Also,eachobservationhasaclass,whichistheanswertothequestionwecareabout(e.g.,yesorno;fraudulentornot;etc.).

Forinstance,withtheAmazonexample,eachordercorrespondstoasingleobservation.Eachobservationhasseveralattributes(e.g.,thetotalvalueoftheorder,whethertheorderisbeingshippedtoanaddressthiscustomerhasusedbefore,andsoon).Theclassoftheobservationiseither0or1,where0meansthattheorderisnotfraudulentand1meansthattheorderisfraudulent.Giventheattributesofsomeneworder,wearetryingtopredictitsclass.

Classificationrequiresdata.Itinvolveslookingforpatterns,andtofindpatterns,youneeddata.That'swherethedatasciencecomesin.Inparticular,we'regoingtoassumethatwehaveaccesstotrainingdata:abunchofobservations,whereweknowtheclassofeachobservation.Thecollectionofthesepre-classifiedobservationsisalsocalledatrainingset.Aclassificationalgorithmisgoingtoanalyzethetrainingset,andthencomeupwithaclassifier:analgorithmforpredictingtheclassoffutureobservations.

Notethatclassifiersdonotneedtobeperfecttobeuseful.Theycanbeusefuleveniftheiraccuracyislessthan100%.Forinstance,iftheonlinedatingsiteoccasionallymakesabadrecommendation,that'sOK;theircustomersalreadyexpecttohavetomeetmanypeoplebeforethey'llfindsomeonetheyhititoffwith.Ofcourse,youdon'twanttheclassifiertomaketoomanyerrors--butitdoesn'thavetogettherightanswereverysingletime.

Chronickidneydisease¶

Let'sworkthroughanexample.We'regoingtoworkwithadatasetthatwascollectedtohelpdoctorsdiagnosechronickidneydisease(CKD).Eachrowinthedatasetrepresentsasinglepatientwhowastreatedinthepastandwhosediagnosisisknown.Foreachpatient,wehaveabunchofmeasurementsfromabloodtest.We'dliketofindwhichmeasurementsaremostusefulfordiagnosingCKD,anddevelopawaytoclassifyfuturepatientsas"hasCKD"or"doesn'thaveCKD"basedontheirbloodtestresults.

Let'sloadthedatasetintoatableandlookatit.

259Classification

ckd=Table.read_table('ckd.csv').relabeled('BloodGlucoseRandom',

Age BloodPressure

SpecificGravity Albumin Sugar

RedBloodCells

PusCell PusCellclumps

48 70 1.005 4 0 normal abnormal present

53 90 1.02 2 0 abnormal abnormal present

63 70 1.01 3 0 abnormal abnormal present

68 80 1.01 3 2 normal abnormal present

61 80 1.015 2 0 abnormal abnormal notpresent

48 80 1.025 4 0 normal abnormal notpresent

69 70 1.01 3 4 normal abnormal notpresent

73 70 1.005 0 0 normal normal notpresent

73 80 1.02 2 0 abnormal abnormal notpresent

46 60 1.01 1 0 normal normal notpresent

...(148rowsomitted)

Wehavedataon158patients.Thereareanawfullotofattributeshere.Thecolumnlabelled"Class"indicateswhetherthepatientwasdiagnosedwithCKD:1meanstheyhaveCKD,0meanstheydonothaveCKD.

Let'slookattwocolumnsinparticular:thehemoglobinlevel(inthepatient'sblood),andthebloodglucoselevel(atarandomtimeintheday;withoutfastingspeciallyforthebloodtest).We'lldrawascatterplot,tomakeiteasytovisualizethis.ReddotsarepatientswithCKD;bluedotsarepatientswithoutCKD.WhattestresultsseemtoindicateCKD?

ckd.scatter('Hemoglobin','Glucose',c=ckd.column('Class'))

260Classification

SupposeAliceisanewpatientwhoisnotinthedataset.IfItellyouAlice'shemoglobinlevelandbloodglucoselevel,couldyoupredictwhethershehasCKD?Itsurelookslikeit!Youcanseeaveryclearpatternhere:pointsinthelower-righttendtorepresentpeoplewhodon'thaveCKD,andtheresttendtobefolkswithCKD.Toahuman,thepatternisobvious.Buthowcanweprogramacomputertoautomaticallydetectpatternssuchasthisone?

Well,therearelotsofkindsofpatternsonemightlookfor,andlotsofalgorithmsforclassification.ButI'mgoingtotellyouaboutonethatturnsouttobesurprisinglyeffective.Itiscallednearestneighborclassification.Here'stheidea.IfwehaveAlice'shemoglobinandglucosenumbers,wecanputhersomewhereonthisscatterplot;thehemoglobinisherx-coordinate,andtheglucoseishery-coordinate.Now,topredictwhethershehasCKDornot,wefindthenearestpointinthescatterplotandcheckwhetheritisredorblue;wepredictthatAliceshouldreceivethesamediagnosisasthatpatient.

Inotherwords,toclassifyAliceasCKDornot,wefindthepatientinthetrainingsetwhois"nearest"toAlice,andthenusethatpatient'sdiagnosisasourpredictionforAlice.Theintuitionisthatiftwopointsareneareachotherinthescatterplot,thenthecorrespondingmeasurementsareprettysimilar,sowemightexpectthemtoreceivethesamediagnosis(morelikelythannot).Wedon'tknowAlice'sdiagnosis,butwedoknowthediagnosisofallthepatientsinthetrainingset,sowefindthepatientinthetrainingsetwhoismostsimilartoAlice,andusethatpatient'sdiagnosistopredictAlice'sdiagnosis.

Thescatterplotsuggeststhatthisnearestneighborclassifiershouldbeprettyaccurate.Pointsinthelower-rightwilltendtoreceivea"noCKD"diagnosis,astheirnearestneighborwillbeabluepoint.Therestofthepointswilltendtoreceivea"CKD"diagnosis,astheirnearestneighborwillbearedpoint.Sothenearestneighborstrategyseemstocaptureourintuitionprettywell,forthisexample.

261Classification

However,theseparationbetweenthetwoclasseswon'talwaysbequitesoclean.Forinstance,supposethatinsteadofhemoglobinlevelsweweretolookatwhitebloodcellcount.Lookatwhathappens:

ckd.scatter('WhiteBloodCellCount','Glucose',c=ckd.column('Class'

Asyoucansee,non-CKDindividualsareallclusteredinthelower-left.MostofthepatientswithCKDareaboveortotherightofthatcluster...butnotall.TherearesomepatientswithCKDwhoareinthelowerleftoftheabovefigure(asindicatedbythehandfulofreddotsscatteredamongthebluecluster).Whatthismeansisthatyoucan'ttellforcertainwhethersomeonehasCKDfromjustthesetwobloodtestmeasurements.

IfwearegivenAlice'sglucoselevelandwhitebloodcellcount,canwepredictwhethershehasCKD?Yes,wecanmakeaprediction,butweshouldn'texpectittobe100%accurate.Intuitively,itseemslikethere'sanaturalstrategyforpredicting:plotwhereAlicelandsinthescatterplot;ifsheisinthelower-left,predictthatshedoesn'thaveCKD,otherwisepredictshehasCKD.Thisisn'tperfect--ourpredictionswillsometimesbewrong.(Takeaminuteandthinkitthrough:forwhichpatientswillitmakeamistake?)Asthescatterplotaboveindicates,sometimespeoplewithCKDhaveglucoseandwhitebloodcelllevelsthatlookidenticaltothoseofsomeonewithoutCKD,soanyclassifierisinevitablygoingtomakethewrongpredictionforthem.

Canweautomatethisonacomputer?Well,thenearestneighborclassifierwouldbeareasonablechoiceheretoo.Takeaminuteandthinkitthrough:howwillitspredictionscomparetothosefromtheintuitivestrategyabove?Whenwilltheydiffer?Itspredictionswillbeprettysimilartoourintuitivestrategy,butoccasionallyitwillmakeadifferentprediction.In

262Classification

particular,ifAlice'sbloodtestresultshappentoputherrightnearoneofthereddotsinthelower-left,theintuitivestrategywouldpredict"notCKD",whereasthenearestneighborclassifierwillpredict"CKD".

Thereisasimplegeneralizationofthenearestneighborclassifierthatfixesthisanomaly.Itiscalledthek-nearestneighborclassifier.TopredictAlice'sdiagnosis,ratherthanlookingatjusttheoneneighborclosesttoher,wecanlookatthe3pointsthatareclosesttoher,andusethediagnosisforeachofthose3pointstopredictAlice'sdiagnosis.Inparticular,we'llusethemajorityvalueamongthose3diagnosesasourpredictionforAlice'sdiagnosis.Ofcourse,there'snothingspecialaboutthenumber3:wecoulduse4,or5,ormore.(It'softenconvenienttopickanoddnumber,sothatwedon'thavetodealwithties.)Ingeneral,wepickanumber ,andourpredicteddiagnosisforAliceisbasedonthe patientsinthetrainingsetwhoareclosesttoAlice.Intuitively,thesearethe patientswhosebloodtestresultsweremostsimilartoAlice,soitseemsreasonabletousetheirdiagnosestopredictAlice'sdiagnosis.

The -nearestneighborclassifierwillnowbehavejustlikeourintuitivestrategyabove.

Decisionboundary¶

Sometimesahelpfulwaytovisualizeaclassifieristomaptheregionofspacewheretheclassifierwouldpredict'CKD',andtheregionofspacewhereitwouldpredict'notCKD'.Weendupwithsomeboundarybetweenthetwo,wherepointsononesideoftheboundarywillbeclassified'CKD'andpointsontheothersidewillbeclassified'notCKD'.Thisboundaryiscalledthedecisionboundary.Eachdifferentclassifierwillhaveadifferentdecisionboundary;thedecisionboundaryisjustawaytovisualizewhatcriteriatheclassifierisusingtoclassifypoints.

Banknoteauthentication¶

Let'sdoanotherexample.Thistimewe'lllookatpredictingwhetherabanknote(e.g.,a$20bill)iscounterfeitorlegitimate.Researchershaveputtogetheradatasetforus,basedonphotographsofmanyindividualbanknotes:somecounterfeit,somelegitimate.Theycomputedafewnumbersfromeachimage,usingtechniquesthatwewon'tworryaboutforthiscourse.So,foreachbanknote,weknowafewnumbersthatwerecomputedfromaphotographofitaswellasitsclass(whetheritiscounterfeitornot).Let'sloaditintoatableandtakealook.

banknotes=Table.read_table('banknote.csv')

banknotes

263Classification

WaveletVar WaveletSkew WaveletCurt Entropy Class

3.6216 8.6661 -2.8073 -0.44699 0

4.5459 8.1674 -2.4586 -1.4621 0

3.866 -2.6383 1.9242 0.10645 0

3.4566 9.5228 -4.0112 -3.5944 0

0.32924 -4.4552 4.5718 -0.9888 0

4.3684 9.6718 -3.9606 -3.1625 0

3.5912 3.0129 0.72888 0.56421 0

2.0922 -6.81 8.4636 -0.60216 0

3.2032 5.7588 -0.75345 -0.61251 0

1.5356 9.1772 -2.2718 -0.73535 0

Let'slookatwhetherthefirsttwonumberstellusanythingaboutwhetherthebanknoteiscounterfeitornot.Here'sascatterplot:

banknotes.scatter('WaveletVar','WaveletCurt',c=banknotes.column('Class'

Prettyinteresting!Thosetwomeasurementsdoseemhelpfulforpredictingwhetherthebanknoteiscounterfeitornot.However,inthisexampleyoucannowseethatthereissomeoverlapbetweentheblueclusterandtheredcluster.Thisindicatesthattherewillbesome

264Classification

imageswhereit'shardtotellwhetherthebanknoteislegitimatebasedonjustthesetwonumbers.Still,youcouldusea -nearestneighborclassifiertopredictthelegitimacyofabanknote.

Takeaminuteandthinkitthrough:Supposeweused (say).Whatpartsoftheplotwouldtheclassifiergetright,andwhatpartswoulditmakeerrorson?Whatwouldthedecisionboundarylooklike?

Thepatternsthatshowupinthedatacangetprettywild.Forinstance,here'swhatwe'dgetifusedadifferentpairofmeasurementsfromtheimages:

banknotes.scatter('WaveletSkew','Entropy',c=banknotes.column('Class'

Theredoesseemtobeapattern,butit'saprettycomplexone.Nonetheless,the -nearestneighborsclassifiercanstillbeusedandwilleffectively"discover"patternsoutofthis.Thisillustrateshowpowerfulmachinelearningcanbe:itcaneffectivelytakeadvantageofevenpatternsthatwewouldnothaveanticipated,orthatwewouldhavethoughtto"programinto"thecomputer.

Multipleattributes¶

SofarI'vebeenassumingthatwehaveexactly2attributesthatwecanusetohelpusmakeourprediction.Whatifwehavemorethan2?Forinstance,whatifwehave3attributes?

265Classification

Here'sthecoolpart:youcanusethesameideasforthiscase,too.Allyouhavetodoismakea3-dimensionalscatterplot,insteadofa2-dimensionalplot.Youcanstillusethe -nearestneighborsclassifier,butnowcomputingdistancesin3dimensionsinsteadofjust2.Itjustworks.Verycool!

Infact,there'snothingspecialabout2or3.Ifyouhave4attributes,youcanusethe -nearestneighborsclassifierin4dimensions.5attributes?Workin5-dimensionalspace.Andnoneedtostopthere!Thisallworksforarbitrarilymanyattributes;youjustworkinaveryhighdimensionalspace.Itgetswicked-impossibletovisualize,butthat'sOK.Thecomputeralgorithmgeneralizesverynicely:allyouneedistheabilitytocomputethedistance,andthat'snothard.Mind-blowingstuff!

Forinstance,let'sseewhathappensifwetrytopredictwhetherabanknoteiscounterfeitornotusing3ofthemeasurements,insteadofjust2.Here'swhatyouget:

ax=plt.figure(figsize=(8,8)).add_subplot(111,projection='3d')

ax.scatter(banknotes.column('WaveletSkew'),

banknotes.column('WaveletVar'),

banknotes.column('WaveletCurt'),

c=banknotes.column('Class'))

<mpl_toolkits.mplot3d.art3d.Path3DCollectionat0x1093bdef0>

266Classification

Awesome!Withjust2attributes,therewassomeoverlapbetweenthetwoclusters(whichmeansthattheclassifierwasboundtomakesomemistakesforpointersintheoverlap).Butwhenweusethese3attributes,thetwoclustershavealmostnooverlap.Inotherwords,aclassifierthatusesthese3attributeswillbemoreaccuratethanonethatonlyusesthe2attributes.

Thisisageneralphenomenominclassification.Eachattributecanpotentiallygiveyounewinformation,somoreattributessometimeshelpsyoubuildabetterclassifier.Ofcourse,thecostisthatnowwehavetogathermoreinformationtomeasurethevalueofeachattribute,butthiscostmaybewellworthitifitsignificantlyimprovestheaccuracyofourclassifier.

Tosumup:younowknowhowtouse -nearestneighborclassificationtopredicttheanswertoayes/noquestion,basedonthevaluesofsomeattributes,assumingyouhaveatrainingsetwithexampleswherethecorrectpredictionisknown.Thegeneralroadmapisthis:

1. identifysomeattributesthatyouthinkmighthelpyoupredicttheanswertothequestion;2. gatheratrainingsetofexampleswhereyouknowthevaluesoftheattributesaswellas

thecorrectprediction;3. tomakepredictionsinthefuture,measurethevalueoftheattributesandthenuse -

nearestneighborclassificationtopredicttheanswertothequestion.

267Classification

Breastcancerdiagnosis¶NowIwanttodoamoreextendedexamplebasedondiagnosingbreastcancer.IwasinspiredbyBrittanyWenger,whowontheGooglenationalsciencefairthreeyearsagoasa17-yearoldhighschoolstudent.Here'sBrittany:

Brittany'ssciencefairprojectwastobuildaclassificationalgorithmtodiagnosebreastcancer.Shewongrandprizeforbuildinganalgorithmwhoseaccuracywasalmost99%.

Let'sseehowwellwecando,withtheideaswe'velearnedinthiscourse.

So,letmetellyoualittlebitaboutthedataset.Basically,ifawomanhasalumpinherbreast,thedoctorsmaywanttotakeabiopsytoseeifitiscancerous.Thereareseveraldifferentproceduresfordoingthat.Brittanyfocusedonfineneedleaspiration(FNA),becauseitislessinvasivethanthealternatives.Thedoctorgetsasampleofthemass,putsitunderamicroscope,takesapicture,andatrainedlabtechanalyzesthepicturetodeterminewhetheritiscancerornot.Wegetapicturelikeoneofthefollowing:

268Classification

Unfortunately,distinguishingbetweenbenignvsmalignantcanbetricky.So,researchershavestudiedusingmachinelearningtohelpwiththistask.Theideaisthatwe'llaskthelabtechtoanalyzetheimageandcomputevariousattributes:thingslikethetypicalsizeofacell,howmuchvariationthereisamongthecellsizes,andsoon.Then,we'lltrytousethisinformationtopredict(classify)whetherthesampleismalignantornot.Wehaveatrainingsetofpastsamplesfromwomenwherethecorrectdiagnosisisknown,andwe'llhopethatourmachinelearningalgorithmcanusethosetolearnhowtopredictthediagnosisforfuturesamples.

Weendupwiththefollowingdataset.Forthe"Class"column,1meansmalignant(cancer);0meansbenign(notcancer).

patients=Table.read_table('breast-cancer.csv').drop('ID')

patients

269Classification

ClumpThickness

UniformityofCellSize

UniformityofCellShape

MarginalAdhesion

SingleEpithelialCellSize

BareNuclei

BlandChromatin

5 1 1 1 2 1 3

5 4 4 5 7 10 3

3 1 1 1 2 2 3

6 8 8 1 3 4 3

4 1 1 3 2 1 3

8 10 10 8 7 10 9

1 1 1 1 2 10 3

2 1 2 1 2 1 3

2 1 1 1 2 1 1

4 2 1 1 2 1 2

...(673rowsomitted)

Sowehave9differentattributes.Idon'tknowhowtomakea9-dimensionalscatterplotofallofthem,soI'mgoingtopicktwoandplotthem:

patients.scatter('BlandChromatin','SingleEpithelialCellSize',

270Classification

Oops.Thatplotisutterlymisleading,becausethereareabunchofpointsthathaveidenticalvaluesforboththex-andy-coordinates.Tomakeiteasiertoseeallthedatapoints,I'mgoingtoaddalittlebitofrandomjittertothex-andy-values.Here'showthatlooks:

defrandomize_column(a):

returna+np.random.normal(0.0,0.09,size=len(a))

'BlandChromatin(jittered)',

randomize_column(patients.column('BlandChromatin')),

'SingleEpithelialCellSize(jittered)',

randomize_column(patients.column('SingleEpithelialCellSize'

]).scatter(0,1,c=patients.column('Class'))

Forinstance,youcanseetherearelotsofsampleswithchromatin=2andepithelialcellsize=2;allnon-cancerous.

Keepinmindthatthejitteringisjustforvisualizationpurposes,tomakeiteasiertogetafeelingforthedata.Whenwewanttoworkwiththedata,we'llusetheoriginal(unjittered)data.

Applyingthek-nearestneighborclassifiertobreastcancerdiagnosis¶

We'vegotadataset.Let'stryoutthe -nearestneighborclassifierandseehowitdoes.Thisisgoingtobegreat.

271Classification

We'regoingtoneedanimplementationofthe -nearestneighborclassifier.Inpracticeyouwouldprobablyuseanexistinglibrary,butit'ssimpleenoughthatI'mgoingtoimeplmentitmyself.

Thefirstthingweneedisawaytocomputethedistancebetweentwopoints.Howdowedothis?In2-dimensionalspace,it'sprettyeasy.Ifwehaveapointatcoordinates andanotherat ,thedistancebetweenthemis

(Wheredidthiscomefrom?ItcomesfromthePythogoreantheorem:wehavearighttrianglewithsidelengths and ,andwewanttofindthelengthofthediagonal.)

In3-dimensionalspace,theformulais

In -dimensionalspace,thingsareabithardertovisualize,butIthinkyoucanseehowtheformulageneralized:wesumupthesquaresofthedifferencesbetweeneachindividualcoordinate,andthentakethesquarerootofthat.Let'simplementafunctiontocomputethisdistancefunctionforus:

defdistance(pt1,pt2):

total=0

foriinnp.arange(len(pt1)):

total=total+(pt1.item(i)-pt2.item(i))**2

returnmath.sqrt(total)

Next,we'regoingtowritesomecodetoimplementtheclassifier.Theinputisapatientpwhowewanttodiagnose.Theclassifierworksbyfindingthe nearestneighborsofpfromthetrainingset.So,ourapproachwillgolikethis:

1. Findtheclosest neighborsofp,i.e.,the patientsfromthetrainingsetthataremostsimilartop.

2. Lookatthediagnosesofthose neighbors,andtakethemajorityvotetofindthemost-commondiagnosis.Usethatasourpredicteddiagnosisforp.

SothatwillguidethestructureofourPythoncode.

272Classification

Toimplementthefirststep,wewillcomputethedistancefromeachpatientinthetrainingsettop,sortthembydistance,andtakethe closestpatientsinthetrainingset.Thecodewillmakeacopyofthetable,computethedistancefromeachpatienttop,addanewcolumntothetablewiththosedistances,andthensortthetablebydistanceandtakethefirst rows.ThatleadstothefollowingPythoncode:

defclosest(training,p,k):

defmajority(topkclasses):

defclassify(training,p,k):

kclosest=closest(training,p,k)

kclosest.classes=kclosest.select('Class')

returnmajority(kclosest)

273Classification

defcomputetablewithdists(training,p):

dists=np.zeros(training.num_rows)

attributes=training.drop('Class')

foriinnp.arange(training.num_rows):

dists[i]=distance(attributes.row(i),p)

returntraining.with_column('Distance',dists)

withdists=computetablewithdists(training,p)

sortedbydist=withdists.sort('Distance')

topk=sortedbydist.take(np.arange(k))

returntopk

iftopkclasses.where('Class',1).num_rows>topkclasses.where('Class'

return1

return0

closestk=closest(training,p,k)

topkclasses=closestk.select('Class')

returnmajority(topkclasses)

Let'sseehowthisworks,withourdataset.We'lltakepatient12andimaginewe'regoingtotrytodiagnosethem:

patients.take(12)

ClumpThickness

MarginalAdhesion

BareNuclei

BlandChromatin

5 3 3 3 2 3 4

Wecanpulloutjusttheirattributes(excludingtheclass),likethis:

patients.drop('Class').row(12)

274Classification

Row(ClumpThickness=5,UniformityofCellSize=3,UniformityofCellShape=3,MarginalAdhesion=3,SingleEpithelialCellSize=2,BareNuclei=3,BlandChromatin=4,NormalNucleoli=4,Mitoses=1)

Let'stake .Wecanfindthe5nearestneighbors:

closest(patients,patients.drop('Class').row(12),5)

ClumpThickness

MarginalAdhesion

BareNuclei

BlandChromatin

5 3 3 3 2 3 4

5 3 3 4 2 4 3

5 1 3 3 2 2 2

5 2 2 2 2 2 3

5 3 3 1 3 3 3

3outofthe5nearestneighborshaveclass1,sothemajorityis1(hascancer)--andthatistheoutputofourclassifierforthispatient:

classify(patients,patients.drop('Class').row(12),5)

Awesome!Wenowhaveaclassificationalgorithmfordiagnosingwhetherapatienthasbreastcancerornot,basedonthemeasurementsfromthelab.Arewedone?Shallwegivethistodoctorstouse?

Holdon:we'renotdoneyet.There'sanobviousquestiontoanswer,beforewestartusingthisinpractice:

Howaccurateisthismethod,atdiagnosingbreastcancer?

Andthatraisesamorefundamentalissue.Howcanwemeasuretheaccuracyofaclassificationalgorithm?

Measuringaccuracyofaclassifier¶

We'vegotaclassifier,andwe'dliketodeterminehowaccurateitwillbe.Howcanwemeasurethat?

275Classification

Tryitout.Onenaturalideaistojusttryitonpatientsforayear,keeprecordsonit,andseehowaccurateitis.However,thishassomedisadvantages:(a)we'retryingsomethingonpatientswithoutknowinghowaccurateitis,whichmightbeunethical;(b)wehavetowaitayeartofindoutwhetherourclassifierisanygood.Ifit'snotgoodenoughandwegetanideaforanimprovement,we'llhavetowaitanotheryeartofindoutwhetherourimprovementwasbetter.

Getsomemoredata.Wecouldtrytogetsomemoredatafromotherpatientswhosediagnosisisknown,andmeasurehowaccurateourclassifier'spredictionsareonthoseadditionalpatients.Wecancomparewhattheclassifieroutputsagainstwhatweknowtobetrue.

Usethedatawealreadyhave.Anothernaturalideaistore-usethedatawealreadyhave:wehaveatrainingsetthatweusedtotrainourclassifier,sowecouldjustrunourclassifieroneverypatientinthedatasetandcomparewhatitoutputstowhatweknowtobetrue.Thisissometimesknownastestingtheclassifieronyourtrainingset.

Howshouldwechooseamongtheseoptions?Aretheyallequallygood?

Itturnsoutthatthethirdoption,testingtheclassifieronourtrainingset,isfundamentallyflawed.Itmightsoundattractive,butitgivesmisleadingresults:itwillover-estimatetheaccuracyoftheclassifier(itwillmakeusthinktheclassifierismoreaccuratethanitreallyis).Intuitively,theproblemisthatwhatwereallywanttoknowishowwelltheclassifierhasdoneat"generalizing"beyondthespecificexamplesinthetrainingset;butifwetestitonpatientsfromthetrainingset,thenwehaven'tlearnedanythingabouthowwellitwouldgeneralizetootherpatients.

Thisissubtle,soitmightbehelpfultotryanexample.Let'stryathoughtexperiment.Let'sfocusonthe1-nearestneighborclassifier( ).Supposeyoutrainedthe1-nearestneighborclassifierondatafromall683patientsinthedataset,andthenyoutesteditonthosesame683patients.Howmanywoulditgetright?Thinkitthroughandseeifyoucanworkoutwhatwillhappen.That'sright!Theclassifierwillgettherightanswerforall683patients.Supposeweapplytheclassifiertoapatientfromthetrainingset,sayAlice.Theclassifierwilllookforthenearestneighbor(themostsimilarpatientfromthetrainingset),andthenearestneighborwillturnouttobeAliceherself(thedistancefromanypointtoitselfiszero).Thus,theclassifierwillproducetherightdiagnosisforAlice.Thesamereasoningappliestoeveryotherpatientinthetrainingset.

So,ifwetestthe1-nearestneighborclassifieronthetrainingset,theaccuracywillalwaysbe100%:absolutelyperfect.Thisistruenomatterwhetherthereareactuallyanypatternsinthedata.Butthe100%isatotallie.Whenyouapplytheclassifiertootherpatientswhowerenotinthetrainingset,theaccuracycouldbefarworse.

276Classification

Inotherwords,testingonthetrainingtellsyounothingabouthowaccuratethe1-nearestneighborclassifierwillbe.Thisillustrateswhytestingonthetrainingsetissoflawed.Thisflawisprettyblatantwhenyouusethe1-nearestneighborclassifier,butdon'tthinkthatwithsomeotherclassifieryou'dbeimmunetothisproblem--theproblemisfundamentalandappliesnomatterwhatclassifieryouuse.Testingonthetrainingsetgivesyouabiasedestimateoftheclassifier'saccurate.Forthesereasons,youshouldnevertestonthetrainingset.

Sowhatshouldyoudo,instead?Isthereamoreprincipledapproach?

Itturnsoutthereis.Theapproachcomesdownto:getmoredata.Morespecifically,therightsolutionistouseonedatasetfortraining,andadifferentdatasetfortesting,withnooverlapbetweenthetwodatasets.Wecalltheseatrainingsetandatestset.

Wheredowegetthesetwodatasetsfrom?Typically,we'llstartoutwithsomedata,e.g.,thedataseton683patients,andbeforewedoanythingelsewithit,we'llsplititupintoatrainingsetandatestset.Wemightput50%ofthedataintothetrainingsetandtheother50%intothetestset.Basically,wearesettingasidesomedataforlateruse,sowecanuseittomeasuretheaccuracyofourclassifier.Sometimespeoplewillcallthedatathatyousetasidefortestingahold-outset,andthey'llcallthisstrategyforestimatingaccuracythehold-outmethod.

Notethatthisapproachrequiresgreatdiscipline.Beforeyoustartapplyingmachinelearningmethods,youhavetotakesomeofyourdataandsetitasidefortesting.Youmustavoidusingthetestsetfordevelopingyourclassifier:youshouldn'tuseittohelptrainyourclassifierortweakitssettingsorforbrainstormingwaystoimproveyourclassifier.Instead,youshoulduseitonlyonce,attheveryend,afteryou'vefinalizedyourclassifier,whenyouwantanunbiasedestimateofitsaccuracy.

Theeffectivenessofourclassifier,forbreastcancer¶OK,solet'sapplythehold-outmethodtoevaluatetheeffectivenessofthe -nearestneighborclassifierforbreastcancerdiagnosis.Thedatasethas683patients,sowe'llrandomlypermutethedatasetandput342oftheminthetrainingsetandtheremaining341inthetestset.

patients=patients.sample(683)#Randomlypermutetherows

trainset=patients.take(range(342))

testset=patients.take(range(342,683))

277Classification

We'lltraintheclassifierusingthe342patientsinthetrainingset,andevaluatehowwellitperformsonthetestset.Tomakeourliveseasier,we'llwriteafunctiontoevaluateaclassifieroneverypatientinthetestset:

defevaluate_accuracy(training,test,k):

testattrs=test.drop('Class')

numcorrect=0

foriinrange(test.num_rows):

#Runtheclassifierontheithpatientinthetestset

c=classify(training,testattrs.rows[i],k)

#Wastheclassifier'spredictioncorrect?

ifc==test.column('Class').item(i):

numcorrect=numcorrect+1

returnnumcorrect/test.num_rows

Nowforthegrandreveal--let'sseehowwedid.We'llarbitrarilyuse .

evaluate_accuracy(trainset,testset,5)

0.967741935483871

About96%accuracy.Notbad!Prettydarngoodforsuchasimpletechnique.

Asafootnote,youmighthavenoticedthatBrittanyWengerdidevenbetter.Whattechniquesdidsheuse?Onekeyinnovationisthatsheincorporatedaconfidencescoreintoherresults:heralgorithmhadawaytodeterminewhenitwasnotabletomakeaconfidentprediction,andforthosepatients,itdidn'teventrytopredicttheirdiagnosis.Heralgorithmwas99%accurateonthepatientswhereitmadeaprediction--sothatextensionseemedtohelpquiteabit.

Importanttakeaways¶

Hereareafewlessonswewantyoutolearnfromthis.

First,machinelearningispowerful.Ifyouhadtotrytowritecodetomakeadiagnosiswithoutknowingaboutmachinelearning,youmightspendalotoftimebytrial-and-errortryingtocomeupwithsomecomplicatedsetofrulesthatseemtowork,andtheresultmightnotbeveryaccurate.The -nearestneighborsalgorithmautomatestheentiretaskforyou.Andmachinelearningoftenletsthemmakepredictionsfarmoreaccuratelythananythingyou'dcomeupwithbytrial-and-error.

278Classification

Second,youcandoit.Yes,you.Youcanusemachinelearninginyourownworktomakepredictionsbasedondata.Younowknowenoughtostartapplyingtheseideastonewdatasetsandhelpothersmakeusefulpredictions.Thetechniquesareverypowerful,butyoudon'thavetohaveaPh.D.instatisticstousethem.

Third,becarefulabouthowtoevaluateaccuracy.Useahold-outset.

There'slotsmoreonecansayaboutmachinelearning:howtochooseattributes,howtochoose orotherparameters,whatotherclassificationmethodsareavailable,howtosolvemorecomplexpredictiontasks,andlotsmore.Inthiscourse,we'vebarelyevenscratchedthesurface.Ifyouenjoyedthismaterial,youmightenjoycontinuingyourstudiesinstatisticsandcomputerscience;courseslikeStats132and154andCS188and189gointoalotmoredepth.

279Classification

FeaturesSectionauthor:DavidWagner

Interact

Buildingamodel¶Sofar,wehavetalkedaboutprediction,wherethepurposeoflearningistobeabletopredicttheclassofnewinstances.I'mnowgoingtoswitchtomodelbuilding,wherethegoalistolearnamodelofhowtheclassdependsupontheattributes.

Oneplacewheremodelbuildingisusefulisforscience:e.g.,whichgenesinfluencewhetheryoubecomediabetic?Thisisinterestingandusefulinitsright(apartfromanyapplicationstopredictingwhetheraparticularindividualwillbecomediabetic),becauseitcanpotentiallyhelpusunderstandtheworkingsofourbody.

Anotherplacewheremodelbuildingisusefulisforcontrol:e.g.,whatshouldIchangeaboutmyadvertisementtogetmorepeopletoclickonit?HowshouldIchangetheprofilepictureIuseonanonlinedatingsite,togetmorepeopleto"swiperight"?Whichattributesmakethebiggestdifferencetowhetherpeopleclick/swipe?Ourgoalistodeterminewhichattributestochange,tohavethebiggestpossibleeffectonsomethingwecareabout.

Wealreadyknowhowtobuildaclassifier,givenatrainingset.Let'sseehowtousethatasabuildingblocktohelpussolvetheseproblems.

Howdowefigureoutwhichattributeshavethebiggestinfluenceontheoutput?Takeamomentandseewhatyoucancomeupwith.

Featureselection¶Background:attributesarealsocalledfeatures,inthemachinelearningliterature.

Ourgoalistofindasubsetoffeaturesthataremostrelevanttotheoutput.Thewaywe'llformalizeisthisistoidentifyasubsetoffeaturesthat,whenwetrainaclassifierusingjustthosefeatures,givesthehighestpossibleaccuracyatprediction.

Intuitively,ifweget90%accuracyusingallthefeaturesand88%accuracyusingjustthreeofthefeatures(forexample),thenitstandstoreasonthatthosethreefeaturesareprobablythemostrelevant,andtheycapturemostoftheinformationthataffectsordeterminesthe

280Features

output.

Withthisinsight,ourproblembecomes:

Findthesubsetof featuresthatgivesthebestpossibleaccuracy(whenweuseonlythose featuresforprediction).

Thisisafeatureselectionproblem.Therearemanypossibleapproachestofeatureselection.Onesimpleoneistotryallpossiblewaysofchoosing ofthefeatures,andevaluatetheaccuracyofeach.However,thiscanbeveryslow,becausetherearesomanywaystochooseasubsetof features.

Therefore,we'llconsideramoreefficientprocedurethatoftenworksreasonablywellinpractice.Itisknownasgreedyfeatureselection.Here'showitworks.

1. Supposethereare features.Tryeachonitsown,toseehowmuchaccuracywecangetusingaclassifiertrainedwithjustthatonefeature.Keepthebestfeature.

2. Nowwehaveonefeature.Tryremaining features,toseewhichisthebestonetoaddtoit(i.e.,wearenowtrainingaclassifierwithjust2features:thebestfeaturepickedinstep1,plusonemore).Keeptheonethatbestimprovesaccuracy.Nowwehave2features.

3. Repeat.Ateachstage,wetryallpossibilitiesforhowtoaddonemorefeaturetothefeaturesubsetwe'vealreadypicked,andwekeeptheonethatbestimprovesaccuracy.

Let'simplementitandtryitonsomeexamples!

Codefork-NN¶First,somecodefromlasttime,toimplement -nearestneighbors.

defdistance(pt1,pt2):

foriinrange(len(pt1)):

tot=tot+(pt1[i]-pt2[i])**2

returnmath.sqrt(tot)

281Features

defcomputetablewithdists(training,p):

dists=np.zeros(training.num_rows)

attributes=training.drop('Class').rows

foriinrange(training.num_rows):

dists[i]=distance(attributes[i],p)

withdists=training.copy()

withdists.append_column('Distance',dists)

returnwithdists

withdists=computetablewithdists(training,p)

sortedbydist=withdists.sort('Distance')

topk=sortedbydist.take(range(k))

returntopk

iftopkclasses.where('Class',1).num_rows>topkclasses.where('Class'

return1

return0

closestk=closest(training,p,k)

topkclasses=closestk.select('Class')

returnmajority(topkclasses)

defevaluate_accuracy(training,valid,k):

validattrs=valid.drop('Class')

numcorrect=0

foriinrange(valid.num_rows):

#Runtheclassifierontheithpatientinthetestset

c=classify(training,validattrs.rows[i],k)

#Wastheclassifier'spredictioncorrect?

ifc==valid['Class'][i]:

numcorrect=numcorrect+1

returnnumcorrect/valid.num_rows

282Features

Codeforfeatureselection¶Nowwe'llimplementthefeatureselectionalgorithm.First,asubroutinetoevaluatetheaccuracywhenusingaparticularsubsetoffeatures:

defevaluate_features(training,valid,features,k):

tr=training.select(['Class']+features)

va=valid.select(['Class']+features)

returnevaluate_accuracy(tr,va,k)

Next,we'llimplementasubroutinethat,givenacurrentsubsetoffeatures,triesallpossiblewaystoaddonemorefeaturetothesubset,andevaluatestheaccuracyofeachcandidate.Thisreturnsatablethatsummarizestheaccuracyofeachoptionitexamined.

deftry_one_more_feature(training,valid,baseattrs,k):

results=Table(['Attribute','Accuracy'])

forattrintraining.drop(['Class']+baseattrs).labels:

acc=evaluate_features(training,valid,[attr]+baseattrs,

results.append((attr,acc))

returnresults.sort('Accuracy',descending=True)

Finally,we'llimplementthegreedyfeatureselectionalgorithm,usingtheabovesubroutines.Forourownpurposesofunderstandingwhat'sgoingon,I'mgoingtohaveitprintout,ateachiteration,allfeaturesitconsideredandtheaccuracyitgotwitheach.

283Features

defselect_features(training,valid,k,maxfeatures=3):

results=Table(['NumAttrs','Attributes','Accuracy'])

curattrs=[]

iters=min(maxfeatures,len(training.labels)-1)

whilelen(curattrs)<iters:

print('==Computingbestfeaturetoaddto'+str(curattrs))

#Tryallwaysofaddingjustonemorefeaturetocurattrs

r=try_one_more_feature(training,valid,curattrs,k)

r.show()

print()

#Takethesinglebestfeatureandaddittocurattrs

attr=r['Attribute'][0]

acc=r['Accuracy'][0]

curattrs.append(attr)

results.append((len(curattrs),','.join(curattrs),acc))

returnresults

Example:TreeCover¶Nowlet'stryitoutonanexample.I'mworkingwithadatasetgatheredbytheUSForestryservice.Theyvisitedthousandsofwildnernesslocationsandrecordedvariouscharacteristicsofthesoilandland.Theyalsorecordedwhatkindoftreewasgrowingpredominantlyonthatland.FocusingonlyonareaswherethetreecoverwaseitherSpruceorLodgepolePine,let'sseeifwecanfigureoutwhichcharacteristicshavethegreatesteffectonwhetherthepredominanttreecoverisSpruceorLodgepolePine.

Thereare500,000recordsinthisdataset--morethanIcananalyzewiththesoftwarewe'reusing.So,I'llpickarandomsampleofjustafractionoftheserecords,toletusdosomeexperimentsthatwillcompleteinareasonableamountoftime.

all_trees=Table.read_table('treecover2.csv.gz',sep=',')

training=all_trees.take(range(0,1000))

validation=all_trees.take(range(1000,1500))

test=all_trees.take(range(1500,2000))

training.show(2)

284Features

Elevation Aspect Slope HorizDistToWater VertDistToWater HorizDistToRoad

2804 139 9 268 65 3180

2785 155 18 242 118 3090

...(998rowsomitted)

Let'sstartbyfiguringouthowaccurateaclassifierwillbe,iftrainedusingthisdata.I'mgoingtoarbitrarilyuse forthe -nearestneighborclassifier.

evaluate_accuracy(training,validation,15)

Nowwe'llapplyfeatureselection.IwonderwhichcharacteristicshavethebiggestinfluenceonwhetherSprucevsLodgepolePinegrows?We'lllookforthebest4features.

best_features=select_features(training,validation,15)

==Computingbestfeaturetoaddto[]

285Features

Attribute Accuracy

Elevation 0.762

HorizDistToRoad 0.562

HorizDistToFire 0.534

Hillshade9am 0.518

Hillshade3pm 0.516

Slope 0.516

Area4 0.514

Area3 0.514

Area2 0.514

Area1 0.514

VertDistToWater 0.512

HillshadeNoon 0.51

Aspect 0.51

HorizDistToWater 0.502

==Computingbestfeaturetoaddto['Elevation']

Attribute Accuracy

HorizDistToWater 0.782

Hillshade9am 0.776

Slope 0.77

Hillshade3pm 0.768

Area4 0.762

Area3 0.762

Area2 0.762

Area1 0.762

Aspect 0.75

HillshadeNoon 0.748

286Features

==Computingbestfeaturetoaddto['Elevation','HorizDistToWater']

Attribute Accuracy

Hillshade3pm 0.792

HillshadeNoon 0.792

Slope 0.784

Aspect 0.784

Area4 0.782

Area3 0.782

Area2 0.782

Area1 0.782

Hillshade9am 0.78

best_features

NumAttrs Attributes Accuracy

1 Elevation 0.762

2 Elevation,HorizDistToWater 0.782

3 Elevation,HorizDistToWater,Hillshade3pm 0.792

Aswecansee,Elevationlookslikefarandawaythemostdiscriminativefeature.Thissuggeststhatthischaracteristicmightplayalargeroleinthebiologyofwhichtreegrowsbest,andthusmighttellussomethingaboutthescience.

Whataboutthehorizontaldistancetowaterandtheamountofshadeat3pm?Itlooksliketheymightalsobepredictive,andthusmightplayaroleinthebiologyaswell--assumingtheapparentimprovementinaccuracyisreal,andwe'renotjustfoolingourselves.

Hold-outsets:Training,Validation,Testing¶

287Features

Supposewebuiltapredictorusingjustthebesttwofeatures,ElevationandHorizDistToWater.Howaccuratewouldweexpectittobe,ontheentirepopulationoflocations?78.2%accurate?more?less?Why?

Well,atthispoint,it'shardtotell.It'sthesameissuewementionedearlieraboutfoolingourselves.We'vetriedmultipledifferentapproaches,andtakenthebest;ifwethenevaluateitonthesamedatasetweusedtoselectwhichisbest,wewillgetabiasednumbers--itmightnotbeanaccurateestimateofthetrueaccuracy.

Why?Ourdatasetisnoisy.We'velookedforcorrelations,andkepttheassociationthathadthehighestcorrelationinourtrainingset.Butwasthatarealrelationship,orwasitjustnoise?Ifyoupickasingleattributeandmeasureitsaccuracyonasample,we'dexpectthistobeareasonableapproximationtoitsaccuracyontheentirepopulation,butwithsomerandomerror.Ifwelookat100possiblecombinationsandchoosethebest,it'spossiblewefoundonewhoseaccuracyontheentirepopulationisindeedlarge--orit'spossiblewewerejustselectingtheonewhoseerrortermhappenedtobethelargestofthe100.

Forthesereasons,wecan'texpecttheaccuracynumberswe'vecomputedabovetonecessarilybeagood,unbiasedmeasureoftheaccuracyontheentirepopulation.

Thewaytogetanunbiasedestimateofaccuracyisthesameaslastlecture:getsomemoredata;orsetsomeasideinthebeginningsowehavemorewhenweneedit.Inthiscase,Isetasidetwoextrachunksofdata,avalidationdatasetandatestdataset.Iusedthevalidationsettoselectafewbestfeatures.Nowwe'regoingtomeasuretheperformanceofthisonthetestset,justtoseewhathappens.

evaluate_features(training,validation,['Elevation'],15)

evaluate_features(training,test,['Elevation'],15)

evaluate_features(training,validation,['Elevation','HorizDistToWater'

288Features

evaluate_features(training,test,['Elevation','HorizDistToWater'],

Whydoyouthinkweseethisdifference?

Togetabetterfeelingforthis,wecanlookatallofourdifferentfeatures,andcompareitsaccuracyonthetrainingsetvsitsaccuracyonthevalidationset,toseehowmuchrandomerrorthereisintheseaccuracyfigures.Iftherewasnoerror,thenbothaccuracynumberswouldalwaysbethesame,butaswe'llsee,inpracticethereissomeerror,becausecomputingtheaccuracyonasampleisonlyanapproximationtotheaccuracyontheentirepopulation.

accuracies=Table(['ValidationAccuracy','Accuracyonanothersample'

another_sample=all_trees.take(range(2000,2500))

forfeatureintraining.drop('Class').labels:

x=evaluate_features(training,validation,[feature],15)

y=evaluate_features(training,another_sample,[feature],15)

accuracies.append((x,y))

accuracies.sort('ValidationAccuracy')

ValidationAccuracy Accuracyonanothersample

0.502 0.346

0.51 0.384

0.51 0.35

0.512 0.358

0.514 0.342

0.516 0.366

0.516 0.352

...(4rowsomitted)

289Features

accuracies.scatter('ValidationAccuracy')

/opt/conda/lib/python3.4/site-packages/matplotlib/collections.py:590:FutureWarning:elementwisecomparisonfailed;returningscalarinstead,butinthefuturewillperformelementwisecomparison

ifself._edgecolors==str('face'):

ThoughtQuestions¶SupposethatthetoptwoattributeshadbeenElevationandHorizDistToRoad.Interpretthisforme.Whatmightthismeanforthebiologyoftrees?Onepossibleexplanationisthatthedistancetothenearestroadaffectswhatkindoftreegrows;canyougiveanyotherpossibleexplanations?

OnceweknowthetoptwoattributesareElevationandHorizDistToWater,supposewenextwantedtoknowhowtheyaffectwhatkindoftreegrows:e.g.,doeshighelevationtendtofavorspruce,ordoesitfavorlodgepolepine?Howwouldyougoaboutansweringthesekindsofquestions?

ThescientistsalsogatheredsomemoredatathatIleftout,forsimplicity:foreachlocation,theyalsogatheredwhatkindofsoilithas,outof40differenttypes.Theoriginaldatasethadacolumnforsoiltype,withnumbersfrom1-40indicatingwhichofthe40typesofsoilwaspresent.SupposeIwantedtoincludethisamongtheothercharacteristics.Whatwouldgowrong,andhowcouldIfixitup?

290Features

Forthisexamplewepicked arbitrarily.Supposewewantedtopickthebestvalueof--theonethatgivesthebestaccuracy.Howcouldwegoaboutdoingthat?Whatarethe

pitfalls,andhowcouldtheybeaddressed?

SupposeIwantedtousefeatureselectiontohelpmeadjustmyonlinedatingprofilepicturetogetthemostresponses.TherearesomecharacteristicsIcan'tchange(suchashowhandsomeIam),andsomeIcan(suchaswhetherIsmileornot).HowwouldIadjustthefeatureselectionalgorithmabovetoaccountforthis?

291Features

Chapter5

Inference

292Inference

ConfidenceIntervalsInteractWheneverweanalyzearandomsampleofapopulation,ourgoalistodrawrobustconclusionsaboutthewholepopulation,despitethefactthatwecanmeasureonlypartofit.Wemustguardagainstbeingmisledbytheinevitablerandomvariationsinasample.Itispossiblethatasamplehasverydifferentcharacteristicsthanthepopulationitself,butthosesamplesareunlikely.Itismuchmoretypicalthatarandomsample,especiallyalargesimplerandomsample,isrepresentativeofthepopulation.Moreimportantly,thechancethatasampleresemblesthepopulationissomethingthatweknowinadvance,basedonhowweselectedourrandomsample.Statisticalinferenceisthepracticeofusingthisinformationtomakeprecisestatementsaboutapopulationbasedonarandomsample.

StatisticsandParameters¶TheBayAreaBikeShareservicepublishedadatasetdescribingeverybicyclerentalfromSeptember2014toAugust2015intheirsystem.Thissetofalltripsisapopulation,andfortunatelywehavethedataforeachandeverytripratherthanjustasample.We'llfocusonlyonthefreetrips,whicharetripsthatlastlessthan1800seconds(halfanhour).

free_trips=Table.read_table("trip.csv").where('Duration',are.below

free_trips

293ConfidenceIntervals

StartStation EndStation Duration

HarryBridgesPlaza(FerryBuilding)

SanFranciscoCaltrain(Townsendat4th) 765

SanAntonioShoppingCenter MountainViewCityHall 1036

PostatKearny 2ndatSouthPark 307

SanJoseCityHall SanSalvadorat1st 409

EmbarcaderoatFolsom EmbarcaderoatSansome 789

YerbaBuenaCenteroftheArts(3rd@Howard)

SanFranciscoCaltrain(Townsendat4th) 293

EmbarcaderoatFolsom EmbarcaderoatSansome 896

EmbarcaderoatSansome SteuartatMarket 255

BealeatMarket TemporaryTransbayTerminal(HowardatBeale) 126

PostatKearny SouthVanNessatMarket 932

Aquantitymeasuredforawholepopulationiscalledaparameter.Forexample,theaveragedurationforanyfreetripbetweenSeptember2014toAugust2015canbecomputedexactlyfromthesedata:550.00.

np.average(free_trips.column('Duration'))

550.00143345658103

Althoughwehaveinformationforthefullpopulation,wewillinvestigatewhatwecanlearnfromasimplerandomsample.Below,wesample100tripswithoutreplacementfromfree_tripsnotjustonce,but40differenttimes.All4000tripsarestoredinatablecalledsamples.

samples=Table(['Sample#','Duration'])

fortripinfree_trips.sample(n).rows:

samples.append([i,trip.item('Duration')])

samples

Sample# Duration

0 1543

Aquantitymeasuredforasampleiscalledastatistic.Forexample,theaveragedurationofafreetripinasampleisastatistic,andwefindthatthereissomevariationacrosssamples.

foriinnp.arange(3):

sample_average=np.average(samples.where('Sample#',i).column

print("Averagedurationinsample",i,"is",sample_average)

Averagedurationinsample0is547.29

Often,thegoalofanalysisistoestimateparametersfromstatistics.Properstatisticalinferenceinvolvesmorethanjustpickinganestimate:weshouldalsosaysomethingabouthowconfidentweshouldbeinthatestimate.Asingleguessisrarelysufficienttomakearobustconclusionaboutapopulation.Wealsoneedtoquantifyinsomewayhowprecisewebelievethatguesstobe.

Alreadywecanseefromtheexampleabovethattheaveragedurationofasampleisaround550,theparameter.Ourgoalistoquantifythisnotionofaround,andonewaytodosoistostatearangeofvalues.

ErrorinEstimates¶

Wecanlearnagreatdealaboutthebehaviorofastatisticbycomputingitformanysamples.Inthescatterdiagrambelow,theorderinwhichthestatisticwascomputedappearsonthehorizontalaxis,andthevalueoftheaveragedurationforasampleofntripsappearsontheverticalaxis.

samples.group(0,np.average).scatter(0,1,s=50)

Thesamplenumberdoesn'ttellusanythingbeyondwhenthesamplewasselected,andwecanseefromthisunstructuredcloudofpointsthatonesampledoesnotaffectthenextinanyconsistentway.Thesampleaveragesareindeedallaround550.Someareabove,andsomearebelow.

samples.group(0,np.average).hist(1,bins=np.arange(480,641,10))

Thereare40averagesfor40samples.Ifwefindanintervalthatcontainsthemiddle38outof40,itwillcontain95%ofthesamples.Onesuchintervalisboundedbythesecondsmallestandthesecondlargestsampleaveragesamongthe40samples.

averages=np.sort(samples.group(0,np.average).column(1))

lower=averages.item(1)

upper=averages.item(38)

print('Empirically,95%ofsampleaveragesfellwithintheinterval'

Empirically,95%ofsampleaveragesfellwithintheinterval502.38to620.04

Thisstatementgivesusourfirstnotionofamarginoferror.Ifthisstatementweretoholdupnotjustforour40samples,butforallsamples,thenareasonablestrategyforestimatingthetruepopulationaveragewouldbetodrawa100-tripsample,computeitssampleaverage,andguessthatthetrueaveragewaswithinmax_deviationofthisquantity.Wewouldberightatleast95%ofthetime.

Theproblemwiththisapproachisthatfindingthemax_deviationrequireddrawingmanydifferentsamples,andwewouldliketobeabletosaysomethingsimilarafterhavingcollectedonlyone100-tripsample.

VariabilityWithinSamples¶

Bycontrast,Theindividualdurationsinthesamplesthemselvesarenotcloselyclusteredaround550.Instead,theyspanthewholerangefrom0to1800.Thedurationsforthefirstthree100-itemsamplesarevisualizedbelow.Allhavesomewhatsimilarshapes;theywere

drawnfromthesamepopulation.

foriinnp.arange(3):

samples.where('Sample#',i).hist(1,bins=np.arange(0,1801,200

Wehavenowobservedtwodifferentwaysinwhichthevariationofarandomsamplecanbeobserved:thestatisticscomputedfromthosesamplesvaryinmagnitude,andthesampledquantitiesthemselvesvaryindistribution.Thatis,thereisvariabilityacrosssamples,whichweseebycomparingsampleaverages,andthereisvariabilitywithineachsample,whichweseeinthedurationhistogramsabove.

Percentiles¶The80thpercentileofacollectionofvaluesisthesmallestvalueinthecollectionthatisatleastaslargeas80%ofthevaluesinthecollection.Likewise,the40thpercentileisatleastaslargeas40%ofthevalues.

Forexample,let'sconsiderthesizesofthefivelargestcontinents:Africa,Antarctica,Asia,NorthAmerica,andSouthAmerica,roundedtothenearestmillionsquaremiles.

sizes=[12,17,6,9,7]

The20thpercentileisthesmallestvaluethatisatleastaslargeas20%oronefifthoftheelements.

percentile(20,sizes)

The60thpercentileisatleastaslargeas60%orthreefifthsoftheelements.

Tofindtheelementcorrespondingtoapercentile ,sorttheelementsandtakethe thelement,where where isthesizeofthecollection.Ingeneral,any

percentilefrom0to100canbecomputedforacollectionofvalues,anditisalwaysanelementofthecollection.If isnotaninteger,roundituptofindthepercentileelement.Forexample,the90thpercentileofafiveelementsetisthefifthelement,because is

4.5,whichroundsupto5.

Whenthepercentilefunctioniscalledwithasingleargument,itreturnsafunctionthatcomputespercentilesofcollections.

tenth=percentile(10)

tenth(sizes)

tenth(np.arange(50))

tenth(np.arange(5000))

ConfidenceIntervals¶Inferencebeginswithapopulation:acollectionthatyoucandescribebutoftencannotmeasuredirectly.Onethingwecanoftendowithapopuluationistodrawasimplerandomsample.Bymeasuringastatisticofthesample,wemayhopetoestimateaparameterofthepopulation.Inthisprocess,thepopulationisnotrandom,norareitsparameters.Thesampleisrandom,andsoisitsstatistic.Thewholeprocessofestimationthereforegivesarandomresult.Itstartsfromapopulation(fixed),drawsasample(random),computesastatistic(random),andthenestimatestheparameterfromthestatistic(random).Bycallingtheprocessrandom,wemeanthatthefinalestimatecouldhavebeendifferentbecausethesamplecouldhavebeendifferent.

Aconfidenceintervalisonenotionofamarginoferroraroundanestimate.Themarginoferroracknowledgesthattheestimationprocesscouldhavebeenmisleadingbecauseofthevariationinthesample.Itgivesarangethatmightcontaintheoriginalparameter,alongwithapercentageofconfidence.Forexample,a95%confidenceintervalisoneinwhichwewouldexpecttheintervaltocontainthetrueparameterfor95%ofrandomsamples.Foranyparticularsample,anyparticularestimate,andanyparticularinterval,theparameteriseithercontainedornot;weneverknowforsure.Thepurposeoftheconfidenceintervalistoquantifyhowoftenourestimationprocessgivesusarangecontainingthetruth.

Therearemanydifferentwaystocomputeaconfidenceinterval.Thechallengeincomputingaconfidenceintervalfromasinglesampleisthatwedon'tknowthepopulation,andsowedon'tknowhowitssampleswillvary.Onewaytomakeprogressdespitethislimitedinformationistoassumethatthevariabilitywithinthesampleissimilartothevariabilitywithinthepopulation.

ResamplingfromaSample¶

Eventhoughwehappentohavethewholepopulationoffree_trips,let'simagineforamomentthatwehaveonlyasingle100-tripsamplefromthispopulation.Thatsamplehas100durations.Ourgoalnowistolearnwhatwecanaboutthepopulationusingjustthissample.

sample=free_trips.sample(n)

sample

StartStation EndStation Duration

GrantAvenueatColumbusAvenue

SanFranciscoCaltrain2(330Townsend) 877

EmbarcaderoatFolsom SanFranciscoCaltrain(Townsendat4th) 597

BealeatMarket BealeatMarket 67

2ndatFolsom WashingtonatKearny 714

ClayatBattery EmbarcaderoatSansome 510

SouthVanNessatMarket PowellStreetBART 494

SanJoseDiridonCaltrainStation MLKLibrary 530

MechanicsPlaza(MarketatBattery) ClayatBattery 286

SouthVanNessatMarket CivicCenterBART(7thatMarket) 172

BroadwayStatBatterySt 5thatHoward 551

...(90rowsomitted)

average=np.average(sample.column('Duration'))

average

518.55999999999995

Atechniquecalledbootstrapresamplingusesthevariabilityofthesampleasaproxyforthevariabilityofthepopulation.Insteadofdrawingmanysamplesfromthepopulation,wewilldrawmanysamplesfromthesample.Sincetheoriginalsampleandthere-sampledsamplehavethesamesize,wemustdrawwithreplacementinordertoseeanyvariabilityatall.

resamples=Table(['Resample#','Duration'])

resample=sample.sample(n,with_replacement=True)

fortripinresample.rows:

resamples.append([i,trip.item('Duration')])

resamples

Resample# Duration

Usingtheresamples,wecaninvestigatethevariabilityofthecorrespondingsampleaverages.

resamples.group(0,np.average).scatter(0,1,s=50)

Thesesampleaveragesarenotnecessarilyclusteredaroundthetrueaverage550.Instead,theywillbeclusteredaroundtheaverageofthesamplefromwhichtheywerere-sampled.

resamples.group(0,np.average).hist(1,bins=np.arange(480,641,10))

Althoughthecenterofthisdistributionisnotthesameasthetruepopulationaverage,itsvariabilityissimilar.

BootstrapConfidenceInterval¶

Atechniquecalledbootstrapresamplingestimatesaconfidenceintervalusingresampling.Tofindaconfidenceinterval,wemustestimatehowmuchthesamplestatisticdeviatesfromthepopulationparameter.Theboostrapassumesthataresampledstatisticwilldeviatefromthesamplestatisticinapproximatelythesamewaythatthesamplestatisticdeviatesfromtheparameter.

Tocarryoutthetechnique,wefirstcomputethedeviationsfromthesamplestatistic(inthiscase,thesampleaverage)formanyresampledsamples.Thenumberisarbitrary,anditdoesn'trequireanycollectingofnewdata;1000ormoreistypical.

deviations=Table(['Resample#','Deviation'])

dev=average-np.average(resample.column(2))

deviations.append([i,dev])

deviations

Resample# Deviation

0 30.95

1 41.16

2 3.37

3 2.54

4 -22.87

5 27.63

6 49.93

7 11.91

8 26.32

9 -5.97

...(990rowsomitted)

Adeviationofzeroindicatesthattheresampledstatisticisthesameasthesampledstatistic.That'sevidencethatthesamplestatisticisperhapsthesameasthepopulationparameter.Apositivedeviationisevidencethatthesamplestatisticmaybegreaterthantheparameter.Aswecansee,foraverages,thedistributionofdeviationsisroughlysymmetric.

deviations.hist(1,bins=np.arange(-100,101,10))

A95%confidenceintervaliscomputedbasedonthemiddle95%ofthisdistributionofdeviations.Themaxandmindevationsarecomputedbythe97.5thpercentileand2.5thpercentilerespectively,becausethespanbetweenthesecontains95%ofthedistribution.

lower=average-percentile(97.5,deviations.column(1))

461.93000000000001

upper=average-percentile(2.5,deviations.column(1))

576.29999999999995

Now,wehavenotonlyanestimate,butanintervalarounditthatexpressesouruncertaintyaboutthevalueoftheparameter.

print('A95%confidenceintervalaround',average,'is',lower,'to'

A95%confidenceintervalaround518.56is461.93to576.3

Andbehold,theintervalhappenstocontainthetrueparameter550.Normally,aftergeneratingaconfidenceinterval,thereisnowaytocheckthatitiscorrect,becausetodosorequiresaccesstotheentirepopulation.Inthiscase,westartedwiththewholepopulationinasingletable,andsoweareabletocheck.

Thefollowingfunctionencapsulatestheentireprocessofproducingaconfidenceintervalforthepopulationaverageusingasinglesample.Eachtimeitisrun,evenforthesamesample,theintervalwillbeslightlydifferentbecauseoftherandomvariationofresampling.However,theupperandlowerboundsgeneratedby2000samplesforthisproblemaretypicallyquitesimilarformultiplecalls.

defaverage_ci(sample,label):

deviations=Table(['Resample#','Deviation'])

n=sample.num_rows

average=np.average(sample.column(label))

dev=np.average(resample.column(label))-average

deviations.append([i,dev])

return(average-percentile(97.5,deviations.column(1)),

average-percentile(2.5,deviations.column(1)))

average_ci(sample,'Duration')

(457.43999999999994,572.63999999999987)

average_ci(sample,'Duration')

(458.7399999999999,575.50999999999988)

However,ifwedrawanentirelynewsample,thentheconfidenceintervalwillchange.

average_ci(free_trips.sample(n),'Duration')

(530.18000000000006,661.21000000000004)

Theintervalisaboutthesamewidth,andthat'stypicalofmultiplesamplesofthesamesizedrawnfromthesamepopulation.However,ifamuchlargersampleisdrawn,thentheconfidenceintervalthatresultsistypicallygoingtobemuchsmaller.We'renotlosingconfidenceinthiscase;we'rejustusingmoreevidencetoperforminference,andsowehavelessuncertaintyaboutthefullpopulation.

average_ci(free_trips.sample(16*n),'Duration')

(527.18937499999993,555.11437499999988)

EvaluatingConfidenceIntervals¶

Whengivenonlyasample,itisnotpossibletocheckwhethertheparameterwasestimatedcorrectly.However,inthecaseoftheBayAreaBikeShare,wehavenotonlythesample,butmanysamplesandthewholepopulation.Therefore,wecancheckhowoftentheconfidenceintervalswecomputeinthiswaydoinfactcontainthetrueparameter.

Beforemeasuringtheaccuracyofourtechnique,wecanvisualizetheresultofresamplingeachsample.Thefollowingscatterdiagramhas6verticallinesfor6samples,andeachlineconsistsof100pointsthatrepresentthestatisticsfromresamplesforthatsample.

samples=Table(['Sample#','Resampleaverage','SampleAverage'])

foriinnp.arange(6):

forjinnp.arange(100):

resample_average=np.average(resample.column('Duration'))

samples.append([i,resample_average,average])

samples.scatter(0,s=80)

Wecanseefromtheplotabovethatforalmosteverysample,therearesomeresampledaveragesabove550andsomebelow.Whenweusetheseresampledaveragestocomputeconfidenceintervals,manyofthoseintervalscontain550asaresult.

Below,wedraw100moresamplesandshowtheupperandlowerboundsoftheconfidenceintervalscomputedfromeachone.

intervals=Table(['Sample#','Lower','Estimate','Upper'])

lower,upper=average_ci(sample,'Duration')

intervals.append([i,lower,average,upper])

Wecancomputepreciselywhichintervalscontainthetruth,andweshouldfindthatabout95ofthemdo.

correct=np.logical_and(intervals.column('Lower')<=550,intervals

np.count_nonzero(correct)

Eachconfidenceintervalisgeneratedfromonlyasinglesampleof100trips.Asaresult,thewidthoftheintervalvariesquiteabitaswell.Comparedtotheintervalwidthwecomputedoriginally(whichrequiredmultiplesamples),weseethatsomeoftheseintervalsaremuchmorenarrowandsomearemuchwider.

Table().with_column('Width',intervals.column('Upper')-intervals.column

Givenasinglesample,itisimpossibletoknowwhetheryouhavecorrectlyestimatedtheparameterornot.However,youcaninferaprecisenotionofconfidencebyfollowingaprocessbasedonrandomsampling.

DistanceBetweenDistributionsInteractBynowwehaveseenmanyexamplesofsituationsinwhichonedistributionappearstobequiteclosetoanother,suchasapopulationanditssample.Thegoalofthissectionistoquantifythedifferencebetweentwodistributions.Thiswillallowouranalysestobebasedonmorethantheassessementsthatweareabletomakebyeye.

Forthis,weneedameasureofthedistancebetweentwodistributions.Wewilldevelopsuchameasureinthecontextofastudyconductedin2010bytheAmericalCivilLibertiesUnion(ACLU)ofNorthernCalifornia.

ThefocusofthestudywastheracialcompositionofjurypanelsinAlamedaCounty.Ajurypanelisagroupofpeoplechosentobeprospectivejurors;thefinaltrialjuryisselectedfromamongthem.Jurypanelscanconsistofafewdozenpeopleorseveralthousand,dependingonthetrial.Bylaw,ajurypanelissupposedtoberepresentativeofthecommunityinwhichthetrialistakingplace.Section197ofCalifornia'sCodeofCivilProceduresays,"Allpersonsselectedforjuryserviceshallbeselectedatrandom,fromasourceorsourcesinclusiveofarepresentativecrosssectionofthepopulationoftheareaservedbythecourt."

Thefinaljuryisselectedfromthepanelbydeliberateinclusionorexclusion:thelawallowspotentialjurorstobeexcusedformedicalreasons;lawyersonbothsidesmaystrikeacertainnumberofpotentialjurorsfromthelistinwhatarecalled"peremptorychallenges";thetrialjudgemightmakeaselectionbasedonquestionnairesfilledoutbythepanel;andsoon.Buttheinitialpanelissupposedtoresemblearandomsampleofthepopulationofeligiblejurors.

TheACLUofNorthernCaliforniacompileddataontheracialcompositionofthejurypanelsin11felonytrialsinAlamedaCountyintheyears2009and2010.Inthosepanels,thetotalnumberofpoeplewhoreportedforjuryservicewas1453.TheACLUgathereddemographicdataonalloftheseprosepctivejurors,andcomparedthosedatawiththecompositionofalleligiblejurorsinthecounty.

Thedataaretabulatedbelowinatablecalledjury.Foreachrace,thefirstvalueisthepercentageofalleligiblejurorcandidatesofthatrace.Thesecondvalueisthepercentageofpeopleofthatraceamongthosewhoappearedfortheprocessofselectionintothejury.

311DistanceBetweenDistributions

jury=Table(["Race","Eligible","Panel"]).with_rows([

["Asian",0.15,0.26],

["Black",0.18,0.08],

["Latino",0.12,0.08],

["White",0.54,0.54],

["Other",0.01,0.04],

jury.set_format([1,2],PercentFormatter(0))

Race Eligible Panel

Asian 15% 26%

Black 18% 8%

Latino 12% 8%

White 54% 54%

Other 1% 4%

Someracesareoverrepresentedandsomeareunderrepresentedonthejurypanelsinthestudy.Abargraphishelpfulforvisualizingthedifferences.

jury.barh('Race')

TotalVariationDistance¶

Tomeasurethedifferencebetweenthetwodistributions,wewillcomputeaquantitycalledthetotalvariationdistancebetweenthem.Tocomputethetotalvariationdistance,takethedifferencebetweenthetwoproportionsineachcategory,adduptheabsolutevaluesofallthedifferences,andthendividethesumby2.

Herearethedifferencesandtheabsolutedifferences:

jury.append_column("Difference",jury.column("Panel")-jury.column

jury.append_column("Abs.Difference",np.abs(jury.column("Difference"

jury.set_format([3,4],PercentFormatter(0))

Race Eligible Panel Difference Abs.Difference

Asian 15% 26% 11% 11%

Black 18% 8% -10% 10%

Latino 12% 8% -4% 4%

White 54% 54% 0% 0%

Other 1% 4% 3% 3%

Andhereisthesumoftheabsolutedifferences,dividedby2:

sum(jury.column(3))/2

1.3877787807814457e-17

Thetotalvariationdistancebetweenthedistributionofeligiblejurorsandthedistributionofthepanelsis0.14.Beforeweexaminethenumercialvalue,letusexaminethereasonsbehindthestepsinthecalculation.

Itisquitenaturaltocomputethedifferencebetweentheproportionsineachcategory.Nowtakealookatthecolumndiffandnoticethatthesumofitsentriesis0:thepositiveentriesaddupto0.14,exactlycancelingthetotalofthenegativeentrieswhichis-0.14.TheproportionsineachofthetwocolumnsPanelandEligibleaddupto1,andsothegive-and-takebetweentheirentriesmustaddupto0.

Toavoidthecancellation,wedropthenegativesignsandthenaddalltheentries.Butthisgivesustwotimesthetotalofthepositiveentries(equivalently,twotimesthetotalofthenegativeentries,withthesignremoved).Sowedividethesumby2.

Wewouldhaveobtainedthesameresultbyjustaddingthepositivedifferences.Butourmethodofincludingalltheabsolutedifferenceseliminatestheneedtokeeptrackofwhichdifferencesarepositiveandwhicharenot.

Thefollowingfunctioncomputesthetotalvariationdistancebetweentwocolumnsofatable.

deftotal_variation_distance(column,other):

returnsum(np.abs(column-other))/2

deftable_tvd(table,label,other):

returntotal_variation_distance(table.column(label),table.column

table_tvd(jury,'Eligible','Panel')

0.14000000000000001

Arethepanelsrepresentativeofthepopulation?¶Wewillnowturntothenumericalvalueofthetotalvariationdistancebetweentheeligiblejurorsandthepanel.Howcanweinterpretthedistanceof0.14?Toanswerthis,letusrecallthatthepanelsaresupposedtobeselectedatrandom.Itwillthereforebeinformativetocomparethevalueof0.14withthetotalvariationdistancebetweentheeligiblejurorsandarandomlyselectedpanel.

Todothis,wewillemployourskillsatsimulation.Therewere1453prosepectivejurorsinthepanelsinthestudy.Soletustakearandomsampleofsize1453fromthepopulationofeligiblejurors.

[Technicalnote.Randomsamplesofprospectivejurorswouldbeselectedwithoutreplacement.However,whenthesizeofasampleissmallrelativetothesizeofthepopulation,samplingwithoutreplacementresemblessamplingwithreplacement;theproportionsinthepopulationdon'tchangemuchbetweendraws.ThepopulationofeligiblejurorsinAlamedaCountyisoveramillion,andcomparedtothat,asamplesizeofabout1500isquitesmall.Wewillthereforesamplewithreplacement.]

Thenp.random.multinomialfunctiondrawsarandomsampleuniformlywithreplacementfromapopulationwhosedistributioniscategorical.Specifically,np.random.multinomialtakesasitsinputasamplesizeandanarrayconsistingoftheprobabilitiesofchoosing

differentcategories.Itreturnsthecountofthenumberoftimeseachcategorywaschosen.

sample_size=1453

random_panel=np.random.multinomial(sample_size,jury["Eligible"])

random_panel

array([208,265,166,805,9])

sum(random_panel)

Wecancomparethisdistributionwiththedistributionofeligiblejurors,byconvertingthecountstoproportions.Todothis,wewilldividethecountsbythesamplesize.Itisclearfromtheresultsthatthetwodistributionsarequitesimilar.

sampled=jury.select(['Race','Eligible','Panel']).with_column(

'RandomPanel',random_panel/sample_size)

sampled.set_format(3,PercentFormatter(0))

Race Eligible Panel RandomPanel

Asian 15% 26% 14%

Black 18% 8% 18%

Latino 12% 8% 11%

White 54% 54% 55%

Other 1% 4% 1%

Asalways,ithelpstovisualize.Thepopulationandsamplearequitesimilar.

sampled.barh('Race',[1,3])

Whenwecomparealsotothepanel,thedifferenceisstark.

sampled.barh('Race',[1,3,2])

Thetotalvariationdistancebetweenthepopulationdistributionandthesampledistributionisquitesmall.

table_tvd(sampled,'Eligible','RandomPanel')

0.016407432897453545

Comparingthistothedistanceof0.14forthepanelquantifiesourobservationthattherandomsampleisclosetothedistributionofeligiblejurors,andthepanelisrelativelyfarfromthedistributionofeligiblejurors.

However,thedistancebetweentherandomsampleandtheeligiblejurorsdependsonthesample;samplingagainmightgiveadifferentresult.

Howdorandomsamplesdifferfromthepopulation?¶

Thetotalvariationdistancebetweenthedistributionoftherandomsampleandthedistributionoftheeligiblejurorsisthestatisticthatweareusingtomeasurethedistancebetweenthetwodistributions.Byrepeatingtheprocessofsampling,wecanmeasurehowmuchthestatisticvariesacrossdifferentrandomsamples.Thecodebelowcomputestheempiricaldistributionofthestatisticbasedonalargenumberofreplicationsofthesamplingprocess.

#ComputeempiricaldistributionofTVDs

sample_size=1453

repetitions=1000

eligible=jury.column('Eligible')

tvds=Table(["TVDfromarandomsample"])

sample=np.random.multinomial(sample_size,eligible)/sample_size

tvd=sum(abs(sample-eligible))/2

tvds.append([tvd])

TVDfromarandomsample

0.0207571

0.0286098

0.0201721

0.0308672

0.0185134

0.0102546

0.0119133

0.0346731

0.0271576

0.0216862

...(990rowsomitted)

Eachrowofthecolumnabovecontainsthetotalvariationdistancebetweenarandomsampleandthepopulationofeligiblejurors.

Thehistogramofthiscolumnshowsthatdrawing1453jurorsatrandomfromthepoolofeligiblecandidatesresultsinadistributionthatrarelydeviatesfromtheeligiblejurors'racedistributionbymorethan0.05.

tvds.hist(bins=np.arange(0,0.2,0.005))

Howdothepanelscomparetorandomsamples?¶

Thepanelsinthestudy,however,werenotquitesosimilartotheeligiblepopulation.Thetotalvariationdistancebetweenthepanelsandthepopulationwas0.14,whichisfaroutinthetailofthehistogramabove.Itdoesnotlooklikeatypicaldistancebetweenarandomsampleandtheeligiblepopulation.

OuranalysissupportstheACLU'sconclusionthatthepanelswerenotrepresentativeofthepopulation.TheACLUreportdiscussesseveralreasonsforthediscrepancies.Forexample,someminoritygroupswereunderrepresentedontherecordsofvoterregistrationandoftheDepartmentofMotorVehicles,thetwomainsourcesfromwhichjurorsareselected;atthetimeofthestudy,thecountydidnothaveaneffectiveprocessforfollowinguponprospectivejurorswhohadbeencalledbuthadfailedtoappear;andsoon.Whateverthereasons,itseemsclearthatthecompositionofthejurypanelswasdifferentfromwhatwewouldhaveexpectedinarandomsample.

AClassicalInterlude:theChi-SquaredStatistic¶

"Dothedatalooklikearandomsample?"isaquestionthathasbeenaskedinmanycontextsformanyyears.Inclassicaldataanalysis,thestatisticmostcommonlyusedinansweringthisquestioniscalledthe ("chisquared")statistic,andiscalculatedasfollows:

Step1.Foreachcategory,definethe"expectedcount"asfollows:

Thisisthecountthatyouwouldexpectinthatcategory,inarandomlychosensample.

Step2.Foreachcategory,compute

Step3.AddupallthenumberscomputedinStep2.

Alittlealgebrashowsthatthisisequivalentto:

AlternativeMethod,Step1.Foreachcategory,compute

AlternativeMethod,Step2.Addupallthenumbersyoucomputedinthestepabove,andmultiplythesumbythesamplesize.

Itmakessensethatthestatisticshouldbebasedonthedifferencebetweenproportionsineachcategory,justasthetotalvariationdistanceis.Theremainingstepsofthemethodarenotveryeasytoexplainwithoutgettingdeeperintomathematics.

Thereasonforchoosingthisstatisticoveranyotherwasthatstatisticianswereabletocomeupaformulaforitsapproximateprobabilitydistributionwhenthesamplesizeislarge.Thedistributionhasanimpressivename:itiscalledthe distributionwithdegreesoffreedomequalto4(onefewerthanthenumberofcategories).Dataanalystswouldcomputethestatisticfortheirsample,andthencompareitwithtabulatednumericalvaluesofthedistributions.

Simulatingthe statistic.

The statisticisjustastatisticlikeanyother.Wecanusethecomputertosimulateitsbehaviorevenifwehavenotstudiedalltheunderlyingmathematics.

ForthepanelsintheACLUstudy,the statisticisabout348:

sum(((jury["Panel"]-jury["Eligible"])**2)/jury["Eligible"])*sample_size

348.07422222222226

Togeneratetheempiricaldistributionofthestatistic,allweneedisasmallmodificationofthecodeweusedforsimulatingtotalvariationdistance:

#Computeempiricaldistributionofchi-squaredstatistic

classical_from_sample=Table(["'Chi-squared'statistic,fromarandomsample"

sample=np.random.multinomial(sample_size,eligible)/sample_size

chisq=sum(((sample-eligible)**2)/eligible)*sample_size

classical_from_sample.append([chisq])

classical_from_sample

'Chi-squared'statistic,fromarandomsample

5.29816

1.83597

5.03969

4.3795

10.738

4.74311

4.15047

5.83918

9.6604

0.127756

...(990rowsomitted)

Hereisahistogramoftheempiricaldistributionofthe statistic.Thesimulatedvaluesofbasedonrandomsamplesareconsiderablysmallerthanthevalueof348thatwegot

fromthejurypanels.

classical_from_sample.hist(bins=np.arange(0,18,0.5),normed=True)

Infact,thelongrunaveragevalueof statisticsisequaltothedegreesoffreedom.Andindeed,theaverageofthesimulatedvaluesof iscloseto4,thedegreesoffreedominthiscase.

np.mean(classical_from_sample.column(0))

3.978631566873136

Inthissituation,randomsamplesareexpectedtoproducea statisticofabout4.Thepanels,ontheotherhand,produceda statisticof348.Thisisyetmoreconfirmationoftheconclusionthatthepanelsdonotresemblearandomsamplefromthepopulationofeligiblejurors.

HypothesisTestingInteractOurinvestigationofjurypanelsisanexampleofhypothesistesting.Wewishedtoknowtheanswertoayes-or-noquestionabouttheworld,andwereasonedaboutrandomsamplesinordertoanswerthequestion.Inthissection,weformalizetheprocess.

SamplingfromaDistribution¶Therowsofatabledescribeindividualelementsofacollectionofdata.Intheprevioussection,weworkedwithatablethatdescribedadistributionovercategories.Tablesthatdescribetheproportionsorcountsofdifferentcategoriesinasampleorpopulationariseoftenindataanalysisasasummaryofadatacollectionprocess.Thesample_from_distributionmethoddrawsasampleofcountsforthecategories.Itsimplementationusesthesamenp.random.multinomialmethodfromtheprevioussection.

ThetablebelowdescribestheproportionoftheeligiblepotentialjurorsinAlamedaCounty(estimatedfrom2000censusdataandothersources)forbroadcategoriesofraceandethnicbackground.ThistablewascompiledbyProfessorWeeks,ademographeratSanDiegoStateUniversity,fortheAlamedaCountytrialPeoplev.StuartAlexanderin2002.

population=Table(["Race","Eligible"]).with_rows([

["Asian",0.15],

["Black",0.18],

["Latino",0.12],

["White",0.54],

["Other",0.01],

population

Race Eligible

Asian 0.15

Black 0.18

Latino 0.12

White 0.54

Other 0.01

322HypothesisTesting

Thesample_from_distributionmethodtakesanumberofsamplesandacolumnindexorlabel.Itreturnsatableinwhichthedistributioncolumnhasbeenreplacedbycategorycountsofthesample.

sample_size=1453

population.sample_from_distribution("Eligible",sample_size)

Race Eligible Eligiblesample

Asian 0.15 233

Black 0.18 245

Latino 0.12 177

White 0.54 787

Other 0.01 11

Eachcountisselectedrandomlyinsuchawaythatthechanceofeachcategoryisthecorrespondingproportionintheoriginal"Eligible"column.Samplingagainwillgivedifferentcounts,butagainthe"White"categoryismuchmorecommonthan"Other"becauseofitsmuchhigherproportion.

population.sample_from_distribution("Eligible",sample_size)

Asian 0.15 227

Black 0.18 247

Latino 0.12 155

White 0.54 819

Other 0.01 5

Theoptionproportions=Truedrawssamplecountsandthendivideseachcountbythetotalnumberofsamples.Theresultisanotherdistribution;onebasedonasample.

sample=population.sample_from_distribution("Eligible",sample_size

sample

Asian 0.15 0.140399

Black 0.18 0.189264

Latino 0.12 0.125258

White 0.54 0.532691

Other 0.01 0.0123882

Thisutsamplecanalsobeusedtogenerateasample.Thesampleofasampleiscalledaresampledsampleorabootstrapsample.Itisnotarandomsampleoftheoriginaldistribution,butsharesmanycharacteristicswithsuchasample.

sample.sample_from_distribution('Eligiblesample',sample_size)

Race Eligible Eligiblesample Eligiblesamplesample

Asian 0.15 0.140399 204

Black 0.18 0.189264 258

Latino 0.12 0.125258 163

White 0.54 0.532691 810

Other 0.01 0.0123882 18

EmpiricalDistributions¶Anempiricaldistributionofastatisticisatablegeneratedbycomputingthestatisticformanysamples.Theprevioussectionusedempiricaldistributionstoreasonaboutwhetherornotasamplehadastatisticthatwastypicalofarandomsample.Thefollowingfunctioncomputestheempiricalhistogramofastatistic,whichisexpressedasafunctionthattakesasample.

defempirical_distribution(table,label,sample_size,k,f):

stats=Table(['Sample#','Statistic'])

foriinnp.arange(k):

sample=table.sample_from_distribution(label,sample_size)

statistic=f(sample)

stats.append([i,statistic])

returnstats

Thisfunctioncanbeusedtocomputeanempiricaldistributionofthetotalvariationdistancefromthepopulationdistribution.

deftvd_from_eligible(sample):

counts=sample.column('Eligiblesample')

distribution=counts/sum(counts)

returntotal_variation_distance(population.column('Eligible'),

empirical_distribution(population,'Eligible',sample_size,1000,tvd_from_eligible

Sample# Statistic

0 0.0248796

1 0.0195664

2 0.00401927

3 0.0165864

4 0.0207502

5 0.013042

6 0.0297041

7 0.0153544

8 0.00830695

9 0.014384

...(990rowsomitted)

Theempiricaldistributioncanbevisualizedusingahistogram.

empirical_distribution(population,'Eligible',sample_size,1000,tvd_from_eligible

U.S.SupremeCourt,1965:Swainvs.Alabama¶Intheearly1960's,inTalladegaCountyinAlabama,ablackmancalledRobertSwainwasconvictedofrapingawhitewomanandwassentencedtodeath.Heappealedhissentence,citingamongotherfactorstheall-whitejury.Atthetime,onlymenaged21orolderwereallowedtoserveonjuriesinTalladegaCounty.Inthecounty,26%oftheeligiblejurorswereblack,buttherewereonly8blackmenamongthe100selectedforthejurypanelinSwain'strial.Noblackmanwasselectedforthetrialjury.

In1965,theSupremeCourtoftheUnitedStatesdeniedSwain'sappeal.Initsruling,theCourtwrote"...theoverallpercentagedisparityhasbeensmallandreflectsnostudiedattempttoincludeorexcludeaspecifiednumberof[blackmen]."

Letususethemethodswehavedevelopedtoexaminethedisparitybetween8outof100and26outof100blackmeninapanelof100drawnatrandomfromamongtheeligiblejurors.

alabama=Table(['Race','Eligible']).with_rows([

["Black",0.26],

["Other",0.74]

alabama.set_format(1,PercentFormatter(0))

Race Eligible

Black 26%

Other 74%

Asourteststatistic,wewillusethenumberofblackmeninarandomsampleofsize100.Hereisanempiricaldistributionofthisstatistic.

defvalue_for(category,category_label,value_label):

defin_table(sample):

returnsample.where(category_label,category).column(value_label

returnin_table

num_black=value_for('Black','Race','Eligiblesample')

black_in_sample=empirical_distribution(alabama,'Eligible',100,

black_in_sample

Sample# Statistic

Thenumbersofblackmeninthefirst10repetitionswerequiteabitlargerthan8.Theempiricalhistogrambelowshowsthedistributionoftheteststatisticoveralltherepetitions.

black_in_sample.hist(1,bins=np.arange(0,50,1))

Ifthe100meninSwain'sjurypanelhadbeenchosenatrandom,itwouldhavebeenextremelyunlikelyforthenumberofblackmenonthepaneltobeassmallas8.Wemustconcludethatthepercentagedisparitywaslargerthanthedisparityexpectedduetochancevariation.

MethodandTerminologyofStatisticalTestsofHypotheses¶Wehavedevelopedsomeofthefundamentalconceptsofstatisticaltestsofhypotheses,inthecontextofexamplesaboutjuryselection.Usingstatisticaltestsasawayofmakingdecisionsisstandardinmanyfieldsandhasastandardterminology.Hereisthesequenceofthestepsinmoststatisticaltests,alongwithsometerminologyandexamples.

STEP1:THEHYPOTHESES

Allstatisticaltestsattempttochoosebetweentwoviewsofhowthedataweregenerated.Thesetwoviewsarecalledhypotheses.

Thenullhypothesis.Thissaysthatthedataweregeneratedatrandomunderclearlyspecifiedassumptionsthatmakeitpossibletocomputechances.Theword"null"reinforcestheideathatifthedatalookdifferentfromwhatthenullhypothesispredicts,thedifferenceisduetonothingbutchance.

Inbothofourexamplesaboutjuryselection,thenullhypothesisisthatthepanelswereselectedatrandomfromthepopulationofeligiblejurors.Thoughtheracialcompositionofthepanelswasdifferentfromthatofthepopulationsofeligiblejurors,therewasnoreasonforthedifferenceotherthanchancevariation.

Thealternativehypothesis.Thissaysthatsomereasonotherthanchancemadethedatadifferfromwhatwaspredictedbythenullhypothesis.Informally,thealternativehypothesissaysthattheobserveddifferenceis"real."

Inbothofourexamplesaboutjuryselection,thealternativehypothesisisthatthepanelswerenotselectedatrandom.Somethingotherthanchanceledtothedifferencesbetweentheracialcompositionofthepanelsandtheracialcompositionofthepopulationsofeligiblejurors.

STEP2:THETESTSTATISTIC

Inordertodecidebetweenthetwohypothesis,wemustchooseastatisticuponwhichwewillbaseourdecision.Thisiscalledtheteststatistic.

IntheexampleaboutjurypanelsinAlamedaCounty,theteststatisticweusedwasthetotalvariationdistancebetweentheracialdistributionsinthepanelsandinthepopulationofeligiblejurors.IntheexampleaboutSwainversustheStateofAlabama,theteststatisticwasthenumberofblackmenonthejurypanel.

Calculatingtheobservedvalueoftheteststatisticisoftenthefirstcomputationalstepinastatisticaltest.InthecaseofjurypanelsinAlamedaCounty,theobservedvalueofthetotalvariationdistancebetweenthedistributionsinthepanelsandthepopulationwas0.14.IntheexampleaboutSwain,thenumberofblackmenonhisjurypanelwas8.

STEP3:THEPROBABILITYDISTRIBUTIONOFTHETESTSTATISTIC,UNDERTHENULLHYPOTHESIS

Thisstepsetsasidetheobservedvalueoftheteststatistic,andinsteadfocusesonwhatthevaluemightbeifthenullhypothesisweretrue.Underthenullhypothesis,thesamplecouldhavecomeoutdifferentlyduetochance.Sotheteststatisticcouldhavecomeoutdifferently.Thisstepconsistsoffiguringoutallpossiblevaluesoftheteststatisticandalltheirprobabilities,underthenullhypothesisofrandomness.

Inotherwords,inthisstepwecalculatetheprobabilitydistributionoftheteststatisticpretendingthatthenullhypothesisistrue.Formanyteststatistics,thiscanbeadauntingtaskbothmathematicallyandcomputationally.Therefore,weapproximatetheprobabilitydistributionoftheteststatisticbytheempiricaldistributionofthestatisticbasedonalargenumberofrepetitionsofthesamplingprocedure.

Thiswastheempiricaldistributionoftheteststatistic–thenumberofblackmenonthejurypanel–intheexampleaboutSwainandtheSupremeCourt:

black_in_sample.hist(1,bins=np.arange(0,50,1))

STEP4.THECONCLUSIONOFTHETEST

ThechoicebetweenthenullandalternativehypothesesdependsonthecomparisonbetweentheresultsofSteps2and3:theobservedteststatisticanditsdistributionaspredictedbythenullhypothesis.

Ifthetwoareconsistentwitheachother,thentheobservedteststatisticisinlinewithwhatthenullhypothesispredicts.Inotherwords,thetestdoesnotpointtowardsthealternativehypothesis;thenullhypothesisisbettersupportedbythedata.

Butifthetwoarenotconsistentwitheachother,asisthecaseinbothofourexamplesaboutjurypanels,thenthedatadonotsupportthenullhypothesis.Inbothourexamplesweconcludedthatthejurypanelswerenotselectedatrandom.Somethingotherthanchanceaffectedtheircomposition.

Howis"consistent"defined?Whethertheobservedteststatisticisconsistentwithitspredicteddistributionunderthenullhypothesisisamatterofjudgment.Werecommendthatyouprovideyourjudgmentalongwiththevalueoftheteststatisticandagraphofitspredicteddistribution.Thatwillallowyourreadertomakehisorherownjudgmentaboutwhetherthetwoareconsistent.

Ifyoudonotwanttomakeyourownjudgment,thereareconventionsthatyoucanfollow.TheseconventionsarebasedonwhatiscalledtheobservedsignificancelevelorP-valueforshort.TheP-valueisachancecomputedusingtheprobabilitydistributionoftheteststatistic,andcanbeapproximatedbyusingtheempiricaldistributioninStep3.

PracticalnoteonP-valuesandconventions.Placetheobservedteststatisticonthehorizontalaxisofthehistogram,andfindtheproportioninthetailstartingatthatpoint.That'stheP-value.

IftheP-valueissmall,thedatasupportthealternativehypothesis.Theconventionsforwhatis"small":

IftheP-valueislessthan5%,theresultiscalled"statisticallysignificant."

IftheP-valueisevensmaller–lessthan1%–theresultiscalled"highlystatisticallysignificant."

MoreformaldefinitionofP-value.TheP-valueisthechance,underthenullhypothesis,thattheteststatisticisequaltothevaluethatwasobservedorisevenfurtherinthedirectionofthealternative.

TheP-valueisbasedoncomparingtheobservedteststatisticandwhatthenullhypothesispredicts.AsmallP-valueimpliesthatunderthenullhypothesis,thevalueoftheteststatisticisunlikelytobeneartheoneweobserved.Whenahypothesisandthedataarenotinaccord,thehypothesishastogo.ThatiswhywerejectthenullhypothesisiftheP-valueissmall.

TheP-valueforthenullhypothesisthatSwain'sjurypanelwasselectedatrandomfromthepopulationofTalladegaisapproximatedusingtheempiricaldistributionasfollows:

np.count_nonzero(black_in_sample.column(1)<=8)/black_in_sample.

HISTORICALNOTEONTHECONVENTIONS

Thedeterminationofstatisticalsignificance,asdefinedabove,hasbecomestandardinstatisticalanalysesinallfieldsofapplication.Whenaconventionissouniversallyfollowed,itisinterestingtoexaminehowitarose.

Themethodofstatisticaltesting–choosingbetweenhypothesesbasedondatainrandomsamples–wasdevelopedbySirRonaldFisherintheearly20thcentury.SirRonaldmighthavesettheconventionforstatisticalsignificancesomewhatunwittingly,inthefollowingstatementinhis1925bookStatisticalMethodsforResearchWorkers.Aboutthe5%level,hewrote,"Itisconvenienttotakethispointasalimitinjudgingwhetheradeviationistobeconsideredsignificantornot."

Whatwas"convenient"forSirRonaldbecameacutoffthathasacquiredthestatusofauniversalconstant.NomatterthatSirRonaldhimselfmadethepointthatthevaluewashispersonalchoicefromamongmany:inanarticlein1926,hewrote,"Ifoneintwentydoesnotseemhighenoughodds,wemay,ifwepreferitdrawthelineatoneinfifty(the2percentpoint),oroneinahundred(the1percentpoint).Personally,theauthorpreferstosetalowstandardofsignificanceatthe5percentpoint..."

Fisherknewthat"low"isamatterofjudgmentandhasnouniquedefinition.Wesuggestthatyoufollowhisexcellentexample.Provideyourdata,makeyourjudgment,andexplainwhyyoumadeit.

PermutationInteract

ComparingTwoGroups¶Intheexamplesabove,weinvestigatedwhetherasampleappearstobechosenrandomlyfromanunderlyingpopulation.Wedidthisbycomparingthedistributionofthesamplewiththedistributionofthepopulation.Asimilarlineofreasoningcanbeusedtocomparethedistributionsoftwosamples.Inparticular,wecaninvestigatewhetherornottwosamplesappeartobedrawnfromthesameunderlyingdistribution.

Example:MarriedCouplesandUnmarriedPartners¶

Ournextexampleisbasedonastudyconductedin2010undertheauspicesoftheNationalCenterforFamilyandMarriageResearch.

IntheUnitedStates,theproportionofcoupleswholivetogetherbutarenotmarriedhasbeenrisinginrecentdecades.Thestudyinvolvedanationalrandomsampleofover1,000heterosexualcoupleswhowereeithermarriedor"cohabitingpartners"–livingtogetherbutunmarried.Oneofthegoalsofthestudywastocomparetheattitudesandexperiencesofthemarriedandunmarriedcouples.

Thetablebelowshowsasubsetofthedatacollectedinthestudy.Eachrowcorrespondstooneperson.Thevariablesthatwewillexamineinthissectionare:

MaritalStatus:marriedorunmarriedEmploymentStatus:oneofseveralcategoriesdescribedbelowGenderAge:Ageinyears

columns=['MaritalStatus','EmploymentStatus','Gender','Age',

couples=Table.read_table('couples.csv').select(columns)

couples

333Permutation

MaritalStatus EmploymentStatus Gender Age

married workingaspaidemployee male 51

married workingaspaidemployee female 53

married working,self-employed male 62

married notworking-other male 53

married notworking-retired female 61

Letusconsiderjustthemalesfirst.Thereare742marriedcouplesand292unmarriedcouples,andallcouplesinthisstudyhadonemaleandonefemale,making1,034malesinall.

#Separatetablesformarriedandcohabitingunmarriedcouples:

married_men=couples.where('Gender','male').where('MaritalStatus'

partnered_men=couples.where('Gender','male').where('MaritalStatus'

#Let'sseehowmanymarriedandunmarriedpeoplethereare:

couples.where('Gender','male').group('MaritalStatus').barh("MaritalStatus"

334Permutation

Societalnormshavechangedoverthedecades,andtherehasbeenagradualacceptanceofcoupleslivingtogetherwithoutbeingmarried.Thusitisnaturaltoexpectthatunmarriedcoupleswillingeneralconsistofyoungerpeoplethanmarriedcouples.

Thehistogramsoftheagesofthemarriedandunmarriedmenshowthatthisisindeedthecase.Wewilldrawthesehistogramsandcomparethem.Inordertocomparetwohistograms,bothshouldbedrawntothesamescale.Letuswriteafunctionthatdoesthisforus.

defplot_age(table,subject):

DrawsahistogramoftheAgecolumninthegiventable.

tableshouldbeaTablewithacolumnofpeople'sagescalledAge.

subjectshouldbeastring--thenameofthegroupwe'redisplaying,

like"marriedmen".

#Drawahistogramofagesrunningfrom15yearsto70years

table.hist('Age',bins=np.arange(15,71,5),unit='year')

#Setthelowerandupperboundsoftheverticalaxissothat

#theplotswemakeareallcomparable.

plt.ylim(0,0.045)

plt.title("Agesof"+subject)

335Permutation

#Agesofmen:

plot_age(married_men,"marriedmen")

plot_age(partnered_men,"cohabitingunmarriedmen")

Thedifferenceisevenmoremarkedwhenwecomparethemarriedandunmarriedwomen.

married_women=couples.where('Gender','female').where('MaritalStatus'

partnered_women=couples.where('Gender','female').where('MaritalStatus'

#Agesofwomen:

plot_age(married_women,"marriedwomen")

plot_age(partnered_women,"cohabitingunmarriedwomen")

336Permutation

Thehistogramsshowthatthemarriedmeninthesampleareingeneralolderthanunmarriedcohabitingmen.Marriedwomenareingeneralolderthanunmarriedwomen.Theseobservationsareconsistentwithwhatwehadpredictedbasedonchangingsocialnorms.

Ifmarriedcouplesareingeneralolder,theymightdifferfromunmarriedcouplesinotherwaysaswell.Letuscomparetheemploymentstatusofthemarriedandunmarriedmeninthesample.

Thetablebelowshowsthemaritalstatusandemploymentstatusofeachmaninthesample.

males=couples.where('Gender','male').select(['MaritalStatus','EmploymentStatus'

males.show(5)

337Permutation

MaritalStatus EmploymentStatus

married workingaspaidemployee

married working,self-employed

married notworking-other

ContingencyTables¶

Toinvestigatetheassociationbetweenemploymentandmarriage,wewouldliketobeabletoaskquestionslike,"Howmanymarriedmenareretired?"

Recallthatthemethodpivotletsusdoexactlythat.Itcross-classifieseachmanaccordingtothetwovariables–maritalstatusandemploymentstatus.Itsoutputisacontingencytablethatcontainsthecountsineachpairofcategories.

employment=males.pivot('MaritalStatus','EmploymentStatus')

employment

EmploymentStatus married partner

notworking-disabled 44 20

notworking-lookingforwork 28 33

notworking-onatemporarylayofffromajob 15 8

notworking-other 16 9

notworking-retired 44 4

workingaspaidemployee 513 170

working,self-employed 82 47

Theargumentsofpivotarethelabelsofthetwocolumnscorrespondingtothevariableswearestudying.Categoriesofthefirstargumentappearascolumns;categoriesofthesecondargumentaretherows.Eachcellofthetablecontainsthenumberofmeninapairofcategories–aparticularemploymentstatusandaparticularmaritalstatus.

Thetableshowsthatregardlessofmaritalstatus,themeninthesamplearemostlikelytobeworkingaspaidemployees.Butitisquitehardtocomparetheentiredistributionsbasedonthistable,becausethenumbersofmarriedandunmarriedmeninthesamplearenotthe

338Permutation

same.Thereare742marriedmenbutonly291unmarriedones.

employment.drop(0).sum()

married partner

742 291

Toadjustforthisdifferenceintotalnumbers,wewillconvertthecountsintoproportions,bydividingallthemarriedcountsby742andallthepartnercountsby291.

proportions=Table().with_columns([

"EmploymentStatus",cc.column("EmploymentStatus"),

"married",employment.column('married')/sum(employment.column

"partner",employment.column('partner')/sum(employment.column

proportions.show()

EmploymentStatus married partner

notworking-disabled 0.0592992 0.0687285

notworking-lookingforwork 0.0377358 0.113402

notworking-onatemporarylayofffromajob 0.0202156 0.0274914

notworking-other 0.0215633 0.0309278

notworking-retired 0.0592992 0.0137457

workingaspaidemployee 0.691375 0.584192

working,self-employed 0.110512 0.161512

Themarriedcolumnofthistableshowsthedistributionofemploymentstatusofthemarriedmeninthesample.Forexample,amongmarriedmen,theproportionwhoareretiredisabout0.059.Thepartnercolumnshowsthedistributionoftheemploymentstatusoftheunmarriedmeninthesample.Amongunmarriedmen,theproportionwhoareretiredisabout0.014.

Thetwodistributionslookdifferentfromeachotherinotherwaystoo,ascanbeseenmoreclearlyinthebargraphsbelow.Itappearsthatalargerproportionofthemarriedmeninthesampleworkaspaidemployees,whereasalargerproportionoftheunmarriedmenarenotworkingbutarelookingforwork.

339Permutation

proportions.barh('EmploymentStatus')

Thedistributionsofemploymentstatusofthemeninthetwogroups–marriedandunmarried–isclearlydifferentinthesample.

Arethetwodistributionsdifferentinthepopulation?¶

Thisraisesthequestionofwhetherthedifferenceisduetorandomnessinthesampling,orwhetherthedistributionsofemploymentstatusareindeeddifferentformarriedandumarriedcohabitingmenintheU.S.Rememberthatthedatathatwehavearefromasampleofjust1,033couples;wedonotknowthedistributionofemploymentstatusofmarriedorunmarriedcohabitingmenintheentirecountry.

Wecananswerthequestionbyperformingastatisticaltestofhypotheses.Letususetheterminologythatwedevelopedforthisintheprevioussection.

Nullhypothesis.IntheUnitedStates,thedistributionofemploymentstatusamongmarriedmenisthesameasamongunmarriedmenwholivewiththeirpartners.

Anotherwayofsayingthisisthatemploymentstatusandmaritalstatusareindependentornotassociated.

Ifthenullhypothesisweretrue,thenthedifferencethatwehaveobservedinthesamplewouldbejustduetochance.

Alternativehypothesis.IntheUnitedStates,thedistributionsoftheemploymentstatusofthetwogroupsofmenaredifferent.Inotherwords,employmentstatusandmaritalstatusareassociatedinsomeway.

Asourteststatistic,wewillusethetotalvariationdistancebetweentwodistributions.

Theobservedvalueoftheteststatisticisabout0.15:

340Permutation

#TVDbetweenthetwodistributionsinthesample

married=proportions.column('married')

partner=proportions.column('partner')

observed_tvd=0.5*sum(abs(married-partner))

observed_tvd

0.15273571011754242

RandomPermutations¶

Inordertocomparethisobservedvalueofthetotalvariationdistancewithwhatispredictedbythenullhypothesis,weneedtoknowhowthetotalvariationdistancewouldvaryacrossallpossiblerandomsamplesifemploymentstatusandmaritalstatuswerenotrelated.

Thisisquitedauntingtoderivebymathematics,butletusseeifwecangetagoodapproximationbysimulation.

Withjustonesampleathand,andnofurtherknowledgeofthedistributionofemploymentstatusamongmenintheUnitedStates,howcanwegoaboutreplicatingthesamplingprocedure?Thekeyistonotethatifmaritalstatusandemploymentstatuswerenotconnectedinanyway,thenwecouldreplicatethesamplingprocessbyreplacingeachman'semploymentstatusbyarandomlypickedemploymentstatusfromamongallthemen,marriedandunmarried.

Doingthisforallthemenisequivalenttorandomlyrearrangingtheentirecolumncontainingemploymentstatus,whileleavingthemaritalstatuscolumnunchanged.Sucharearrangementiscalledarandompermutation.

Thus,underthenullhypothesis,wecanreplicatethesamplingprocessbyassigningtoeachmananemploymentstatuschosenatrandomwithoutreplacementfromtheentriesinthecolumnEmploymentStatus.WecandothereplicationbysimplypermutingtheentireEmploymentStatuscolumnandleavingeverythingelseunchanged.

Let'simplementthisplan.First,wewillshufflethecolumnempl_statususingthesamplemethod,whichjustshufflesalltherowswhenprovidedwithnoarguments.

#Randomlypermutetheemploymentstatusofallmen

shuffled=males.select('EmploymentStatus').sample()

shuffled

341Permutation

EmploymentStatus

working,self-employed

workingaspaidemployee

working,self-employed

notworking-disabled

Thefirsttwocolumnsofthetablebelowaretakenfromtheoriginalsample.ThethirdhasbeencreatedbyrandomlypermutingtheoriginalEmploymentStatuscolumn.

#Constructatableinwhichemploymentstatushasbeenshuffled

males_with_shuffled_empl=Table().with_columns([

"MaritalStatus",males.column('MaritalStatus'),

"EmploymentStatus",males.column('EmploymentStatus'),

"EmploymentStatus(shuffled)",shuffled.column('EmploymentStatus'

males_with_shuffled_empl

342Permutation

MaritalStatus EmploymentStatus EmploymentStatus

(shuffled)

married workingaspaidemployee working,self-employed

married workingaspaidemployee workingaspaidemployee

married working,self-employed workingaspaidemployee

married notworking-other working,self-employed

married notworking-onatemporarylayofffromajob workingaspaidemployee

married notworking-disabled notworking-disabled

married workingaspaidemployee workingaspaidemployee

married notworking-retired workingaspaidemployee

Onceagain,thepivotmethodcomputesthecontingencytable,whichallowsustocalculatethetotalvariationdistancebetweenthedistributionsofthetwogroupsofmenaftertheiremploymentstatushasbeenshuffled.

employment_shuffled=males_with_shuffled_empl.pivot('MaritalStatus'

employment_shuffled

EmploymentStatus(shuffled) married partner

notworking-disabled 48 16

notworking-lookingforwork 44 17

notworking-onatemporarylayofffromajob 16 7

notworking-other 18 7

notworking-retired 39 9

workingaspaidemployee 489 194

working,self-employed 88 41

343Permutation

#TVDbetweenthetwodistributionsinthecontingencytableabove

e_s=employment_shuffled

married=e_s.column('married')/sum(e_s.column('married'))

partner=e_s.column('partner')/sum(e_s.column('partner'))

0.5*sum(abs(married-partner))

0.032423745611841297

Thistotalvariationdistancewascomputedbasedonthenullhypothesisthatthedistributionsofemploymentstatusforthetwogroupsofmenarethesame.Youcanseethatitisnoticeablysmallerthantheobservedvalueofthetotalvariationdistance(0.15)betweenthetwogroupsinouroriginalsample.

APermutationTest¶

Couldthisjustbeduetochancevariation?Wewillonlyknowifwerunmanymorereplications,byrandomlypermutingtheEmploymentStatuscolumnrepeatedly.Thismethodoftestingisknownasapermutationtest.

344Permutation

#Putitalltogetherinaforlooptoperformapermutationtest

repetitions=500

tvds=Table().with_column("TVDbetweenmarriedandpartneredmen",

#Constructapermutedtable

shuffled=males.select('EmploymentStatus').sample()

combined=Table().with_columns([

"MaritalStatus",males.column('MaritalStatus'),

"EmploymentStatus",shuffled.column('EmploymentStatus'

employment_shuffled=combined.pivot('MaritalStatus','EmploymentStatus'

#ComputeTVD

e_s=employment_shuffled

married=e_s.column('married')/sum(e_s.column('married'))

partner=e_s.column('partner')/sum(e_s.column('partner'))

permutation_tvd=0.5*sum(abs(married-partner))

tvds.append([permutation_tvd])

tvds.hist(bins=np.arange(0,0.2,0.01))

Thefigureaboveistheempiricaldistributionofthetotalvariationdistancebetweenthedistributionsoftheemploymentstatusofmarriedandunmarriedmen,underthenullhypothesis.

345Permutation

Theobservedteststatisticof0.15isquitefarinthetail,andsothechanceofobservingsuchanextremevalueunderthenullhypothesisiscloseto0.

Asbefore,thischanceiscalledanempiricalP-value.TheP-valueisthechancethatourteststatistic(TVD)wouldcomeoutatleastasextremeastheobservedvalue(inthiscase0.15orgreater)underthenullhypothesis.

Conclusionofthetest¶

Ourempiricalestimatebasedonrepeatedsamplinggivesusalltheinformationweneedfordrawingconclusionsfromthedata:theobservedstatisticisveryunlikelyunderthenullhypothesis.

ThelowP-valueconstitutesevidenceinfavorofthealternativehypothesis.ThedatasupportthehypothesisthatintheUnitedStates,thedistributionoftheemploymentstatusofmarriedmenisnotthesameasthatofunmarriedmenwholivewiththeirpartners.

Wehavejustcompletedourfirstpermutationtest.Permutationtestsarequitecommoninpracticebecausetheymakeveryfewassumptionsabouttheunderlyingpopulationandarestraightforwardtoperformandinterpret.

NoteabouttheapproximateP-value

OursimulationgivesusanapproximateempiricalP-value,becauseitisbasedonjust500randompermutationsinsteadofallthepossiblerandompermutations.WecancomputethisempiricalP-valuedirectly,withoutdrawingthehistogram:

empirical_p_value=np.count_nonzero(tvds.column(0)>=observed_tvd

empirical_p_value

ComputingtheexactP-valuewouldrequireustoconsiderallpossibleoutcomesofshuffling(whichisverylarge)insteadof500randomshuffles.Ifwehadperformedalltherandomshuffles,therewouldhavebeenafewwithmoreextremeTVDs.ThetrueP-valueisgreaterthanzero,butnotbymuch.

GeneralizingOurHypothesisTest¶

346Permutation

Theexampleaboveincludesasubstantialamountofcodeinordertoinvestigatetherelationshipbetweentwocharacteristics(maritalstatusandemploymentstatus)foraparticularsubsetofthesurveyedpopulation(males).Supposewewouldliketoinvestigatedifferentcharacteristicsoradifferentpopulation.Howcanwereusethecodewehavewrittensofarinordertoexploremorerelationships?

Whenyouareabouttocopyyourcode,youshouldthink,"MaybeIshouldwritesomefunctions."

Whatfunctionstowrite?Agoodwaytomakethisdecisionistothinkaboutwhatyouhavetocomputerepeatedly.

Inourexample,thetotalvariationdistanceiscomputedoverandoveragain.Sowewillbeginwithageneralizedcomputationoftotalvariationdistancebetweenthedistributionofanycolumnofvalues(suchasemploymentstatus)whenseparatedintoanytwoconditions(suchasmaritalstatus)foracollectionofdatadescribedbyanytable.Ourimplementationincludesthesamestatementsasweusedabove,butusesgenericnamesthatarespecifiedbythefinalfunctioncall.

#TVDbetweenthedistributionsofvaluesunderanytwoconditions

deftvd(t,conditions,values):

"""Computethetotalvariationdistance

betweenproportionsofvaluesundertwoconditions.

t(Table)--atable

conditions(str)--acolumnlabelint;shouldhaveonlytwocategories

values(str)--acolumnlabelint

e=t.pivot(conditions,values)

a=e.column(1)

b=e.column(2)

return0.5*sum(abs(a/sum(a)-b/sum(b)))

tvd(males,'MaritalStatus','EmploymentStatus')

0.15273571011754242

347Permutation

Next,wecanwriteafunctionthatperformsapermutationtestusingthistvdfunctiontocomputethesamestatisticonshuffledvariantsofanytable.It'sworthreadingthroughthisimplementationtounderstanditsdetails.

defpermutation_tvd(original,conditions,values):

Performapermutationtestofwhether

thedistributionofvaluesfortwoconditions

isthesameinthepopulation,

usingthetotalvariationdistancebetweentwodistributions

astheteststatistic.

originalisaTablewithtwocolumns.Thevalueoftheargument

conditionsisthenameofonecolumn,andthevalueoftheargument

valuesisthenameoftheothercolumn.Theconditionstableshould

haveonly2possiblevaluescorrespondingto2categoriesinthe

Thevaluescolumnisshuffledmanytimes,andthedataaregrouped

accordingtotheconditionscolumn.Thetotalvariationdistance

betweentheproportionsvaluesinthe2categoriesiscomputed.

ThenwedrawahistogramofallthoseTVdistances.Thisshowsus

whattheTVDbetweenthevaluesofthetwodistributionswouldtypically

looklikeifthevalueswereindependentoftheconditions.

repetitions=500

stats=[]

shuffled=original.sample()

combined=Table().with_columns([

conditions,original.column(conditions),

values,shuffled.column(values)

stats.append(tvd(combined,conditions,values))

observation=tvd(original,conditions,values)

p_value=np.count_nonzero(stats>=observation)/repetitions

348Permutation

print("Observation:",observation)

print("EmpiricalP-value:",p_value)

Table([stats],['Empiricaldistribution']).hist()

permutation_tvd(males,'MaritalStatus','EmploymentStatus')

Observation:0.152735710118

EmpiricalP-value:0.0

Nowthatwehavegeneralizedourpermutationtest,wecanapplyittootherhypotheses.Forexample,wecancomparethedistributionovertheemploymentstatusofwomen,groupingthembytheirmaritalstatus.Inthecaseofmenwefoundadifference,butwhataboutwithwomen?First,wecanvisualizethetwodistributions.

defcompare_bar(t,conditions,values):

"""Bargraphsofdistributionsofvaluesforeachoftwoconditions."""

e=t.pivot(conditions,values)

forlabeline.drop(0).labels:

#Converteachcolumnofcountsintoproportions

e.append_column(label,e.column(label)/sum(e.column(label)))

e.barh(values)

compare_bar(couples.where('Gender','female'),'MaritalStatus','EmploymentStatus'

349Permutation

Aglanceatthefigureshowsthatthetwodistributionsaredifferentinthesample.Thedifferenceinthecategory"Notworking–other"isparticularlystriking:about22%ofthemarriedwomenareinthiscategory,comparedtoonlyabout8%oftheunmarriedwomen.Thereareseveralreasonsforthisdifference.Forexample,thepercentofhomemakersisgreateramongmarriedwomenthanamongunmarriedwomen,possiblybecausemarriedwomenaremorelikelytobe"stay-at-home"mothersofyoungchildren.Thedifferencecouldalsobegenerational:aswesawearlier,themarriedcouplesareolderthantheunmarriedpartners,andolderwomenarelesslikelytobeintheworkforcethanyoungerwomen.

Whilewecanseethatthedistributionsaredifferentinthesample,wearenotreallyinterestedinthesampleforitsownsake.Weareexaminingthesamplebecauseitislikelytoreflectthepopulation.So,asbefore,wewillusethesampletotrytoansweraquestionaboutsomethingunknown:thedistributionsofemploymentstatusofmarriedandunmarriedcohabitingwomenintheUnitedStates.Thatisthepopulationfromwhichthesamplewasdrawn.

Wehavetoconsiderthepossibilitythattheobserveddifferenceinthesamplecouldsimplybetheresultofchancevariation.Rememberthatourdataareonlyfromarandomsampleofcouples.WedonothavedataforallthecouplesintheUnitedStates.

Nullhypothesis:IntheU.S.,thedistributionofemploymentstatusisthesameformarriedwomenasforunmarriedwomenlivingwiththeirpartners.Thedifferenceinthesampleisduetochance.

Alternativehypothesis:IntheU.S.,thedistributionsofemploymentstatusamongmarriedandunmarriedcohabitingwomenaredifferent.

Anotherpermutationtesttocomparedistributions¶

Wecantestthesehypothesesjustaswedidformen,byusingthefunctionpermuation_tvdthatwedefinedforthispurpose.

permutation_tvd(couples.where('Gender','female'),'MaritalStatus'

350Permutation

Asforthemales,theempiricalP-valueis0basedonalarenumberofrepetitions.SotheexactP-valueiscloseto0,whichisevidenceinfavorofthealternativehypothesis.ThedatasupportthehypothesisthatforwomenintheUnitedStates,employmentstatusisassociatedwithwhethertheyaremarriedorunmarriedandlivingwiththeirpartners.

Anotherexample¶

Aregenderandemploymentstatusindependentinthepopulation?Wearenowinapositiontotestthisquiteswiftly:

Nullhypothesis.AmongmarriedandunmarriedcohabitingindividualsintheUnitedStates,genderisindependentofemploymentstatus.

Alternativehypothesis.AmongmarriedandunmarriedcohabitingpeopleintheUnitedStates,genderandemploymentstatusarerelated.

permutation_tvd(couples,'Gender','EmploymentStatus')

351Permutation

Theconclusionofthetestisthatgenderandemploymentstatusarenotindependentinthepopulation.Thisisnosurprise;forexample,becauseofsocietalnorms,olderwomenwerelesslikelytohavegoneintotheworkforcethanmen.

Deflategate:PermutationTestsandQuantitativeVariables¶

OnJanuary18,2015,theIndianapolisColtsandtheNewEnglandPatriotsplayedtheAmericanFootballConference(AFC)championshipgametodeterminewhichofthoseteamswouldplayintheSuperBowl.Afterthegame,therewereallegationsthatthePatriots'footballshadnotbeeninflatedasmuchastheregulationsrequired;theyweresofter.Thiscouldbeanadvantage,assofterballsmightbeeasiertocatch.

Forseveralweeks,theworldofAmericanfootballwasconsumedbyaccusations,denials,theories,andsuspicions:thepresslabeledthetopicDeflategate,aftertheWatergatepoliticalscandalofthe1970's.TheNationalFootballLeague(NFL)commissionedanindependentanalysis.Inthisexample,wewillperformourownanalysisofthedata.

Pressureisoftenmeasuredinpoundspersquareinch(psi).NFLrulesstipulatethatgameballsmustbeinflatedtohavepressuresintherange12.5psiand13.5psi.Eachteamplayswith12balls.Teamshavetheresponsibilityofmaintainingthepressureintheirownfootballs,butgameofficialsinspecttheballs.BeforethestartoftheAFCgame,allthePatriots'ballswereatabout12.5psi.MostoftheColts'ballswereatabout13.0psi.However,thesepre-gamedatawerenotrecorded.

Duringthesecondquarter,theColtsinterceptedaPatriotsball.Onthesidelines,theymeasuredthepressureoftheballanddeterminedthatitwasbelowthe12.5psithreshold.Promptly,theyinformedofficials.

352Permutation

Athalf-time,allthegameballswerecollectedforinspection.Twoofficials,CleteBlakemanandDyrolPrioleau,measuredthepressureineachoftheballs.Herearethedata;pressureismeasuredinpsi.ThePatriotsballthathadbeeninterceptedbytheColtswasnotinspectedathalf-time.NorweremostoftheColts'balls–theofficialssimplyranoutoftimeandhadtorelinquishtheballsforthestartofsecondhalfplay.

football=Table.read_table('football.csv')

football.show()

Team Ball Blakeman Prioleau

0 Patriots1 11.5 11.8

0 Patriots2 10.85 11.2

0 Patriots3 11.15 11.5

0 Patriots4 10.7 11

0 Patriots5 11.1 11.45

0 Patriots6 11.6 11.95

0 Patriots7 11.85 12.3

0 Patriots8 11.1 11.55

0 Patriots9 10.95 11.35

0 Patriots10 10.5 10.9

0 Patriots11 10.9 11.35

1 Colts1 12.7 12.35

1 Colts2 12.75 12.3

1 Colts3 12.5 12.95

1 Colts4 12.55 12.15

Foreachofthe15ballsthatwereinspected,thetwoofficialsgotdifferentresults.Itisnotuncommonthatrepeatedmeasurementsonthesameobjectyielddifferentresults,especiallywhenthemeasurementsareperformedbydifferentpeople.Sowewillassigntoeachtheballtheaverageofthetwomeasurementsmadeonthatball.

353Permutation

football=football.with_column(

'Combined',(football.column('Blakeman')+football.column('Prioleau'

football.show()

Team Ball Blakeman Prioleau Combined

0 Patriots1 11.5 11.8 11.65

0 Patriots2 10.85 11.2 11.025

0 Patriots3 11.15 11.5 11.325

0 Patriots4 10.7 11 10.85

0 Patriots5 11.1 11.45 11.275

0 Patriots6 11.6 11.95 11.775

0 Patriots7 11.85 12.3 12.075

0 Patriots8 11.1 11.55 11.325

0 Patriots9 10.95 11.35 11.15

0 Patriots10 10.5 10.9 10.7

0 Patriots11 10.9 11.35 11.125

1 Colts1 12.7 12.35 12.525

1 Colts2 12.75 12.3 12.525

1 Colts3 12.5 12.95 12.725

1 Colts4 12.55 12.15 12.35

Ataglance,itseemsapparentthatthePatriots'footballswereatalowerpressurethantheColts'balls.Becausesomedeflationisnormalduringthecourseofagame,theindependentanalystsdecidedtocalculatethedropinpressurefromthestartofthegame.RecallthatthePatriots'ballshadallstartedoutatabout12.5psi,andtheColts'ballsatabout13.0psi.ThereforethedropinpressureforthePatriots'ballswascomputedas12.5minusthepressureathalf-time,andthedropinpressurefortheColts'ballswas13.0minusthepressureathalf-time.

354Permutation

football=football.with_column(

'Drop',np.array([12.5]*11+[13.0]*4)-football.column('Combined'

football.show()

Team Ball Blakeman Prioleau Combined Drop

0 Patriots1 11.5 11.8 11.65 0.85

0 Patriots2 10.85 11.2 11.025 1.475

0 Patriots3 11.15 11.5 11.325 1.175

0 Patriots4 10.7 11 10.85 1.65

0 Patriots5 11.1 11.45 11.275 1.225

0 Patriots6 11.6 11.95 11.775 0.725

0 Patriots7 11.85 12.3 12.075 0.425

0 Patriots8 11.1 11.55 11.325 1.175

0 Patriots9 10.95 11.35 11.15 1.35

0 Patriots10 10.5 10.9 10.7 1.8

0 Patriots11 10.9 11.35 11.125 1.375

1 Colts1 12.7 12.35 12.525 0.475

1 Colts2 12.75 12.3 12.525 0.475

1 Colts3 12.5 12.95 12.725 0.275

1 Colts4 12.55 12.15 12.35 0.65

Itisapparentthatthedropwaslarger,onaverage,forthePatriots'footballs.Butcouldthedifferencebejustduetochance?

Toanswerthis,wemustfirstexaminehowchancemightentertheanalysis.Thisisnotasituationinwhichthereisarandomsampleofdatafromalargepopulation.Itisalsonotclearhowtocreateajustifiableabstractchancemodel,astheballswerealldifferent,inflatedbydifferentpeople,andmaintainedunderdifferentconditions.

Onewaytointroducechancesistoaskwhetherthedropsinpressuresofthe11Patriotsballsandthe4Coltsballsresemblearandompermutationofthe15drops.Thenthe4Coltsdropswouldbeasimplerandomsampleofall15drops.Thisgivesusanullhypothesisthatwecantestusingrandompermutations.

355Permutation

Nullhypothesis.Thedropsinthepressuresofthe4Coltsballsarelikearandomsample(withoutreplacement)fromall15drops.

Anewteststatistic¶

Thedataarequantitative,sowecannotcomparethetwodistributionscategorybycategoryusingthetotalvariationdistance.IfwetrytobinthedatainordertousetheTVD,thechoiceofbinscananoticeableeffectonthestatistic.Soinstead,wewillworkwithasimplestatisticbasedonmeans.Wewilljustcomparetheaveragedropsinthetwogroups.

Theobserveddifferencebetweentheaveragedropsinthetwogroupswasabout0.7335psi.

patriots=football.where('Team',0).column('Drop')

colts=football.where('Team',1).column('Drop')

observed_difference=np.mean(patriots)-np.mean(colts)

observed_difference

0.73352272727272805

Nowthequestionbecomes:Ifwetookarandompermutationofthe15drops,howlikelyisitthatthedifferenceinthemeansofthefirst11andthelast4wouldbeatleastaslargeasthedifferenceobservedbytheofficials?

Toanswerthis,wewillrandomlypermutethe15drops,assignthefirst11permutedvaluestothePatriotsandthelast4totheColts.Thenwewillfindthedifferenceinthemeansofthetwopermutedgroups.

drops_shuffled=football.sample().column('Drop')

patriots_shuffled=drops_shuffled[:10]

colts_shuffled=drops_shuffled[11:]

shuffled_difference=np.mean(patriots_shuffled)-np.mean(colts_shuffled

shuffled_difference

0.010000000000000675

356Permutation

Thisisdifferentfromtheobservedvaluewecalculatedearlier.Buttogetabettersenseofthevariabilityunderrandomsamplingwemustrepeattheprocessmanytimes.Letustrymaking5000repetitionsanddrawingahistogramofthe5000differencesbetweenmeans.

repetitions=5000

test_stats=[]

foriinrange(repetitions):

drops_shuffled=football.sample().column('Drop')

patriots_shuffled=drops_shuffled[:10]

colts_shuffled=drops_shuffled[11:]

shuffled_difference=np.mean(patriots_shuffled)-np.mean(colts_shuffled

test_stats.append(shuffled_difference)

observation=np.mean(patriots)-np.mean(colts)

emp_p_value=np.count_nonzero(test_stats>=observation)/repetitions

differences=Table().with_column('DifferenceinMeans',test_stats

differences.hist(bins=20)

print("Observation:",observation)

print("EmpiricalP-value:",emp_p_value)

357Permutation

Theobserveddifferencewasroughly0.7335psi.Accordingtotheempiricaldistributionabove,thereisaverysmallchancethatarandompermutationwouldyieldadifferencethatlarge.Sothedatasupporttheconclusionthatthetwogroupsofpressureswerenotlikearandompermutationofall15pressures.

Theindependentinvestiagtiveteamanalyzedthedatainseveraldifferentways,takingintoaccountthelawsofphysics.Thefinalreportsaid,

"[T]heaveragepressuredropofthePatriotsgameballsexceededtheaveragepressuredropoftheColtsballsby0.45to1.02psi,dependingonvariouspossibleassumptionsregardingthegaugesused,andassuminganinitialpressureof12.5psiforthePatriotsballsand13.0fortheColtsballs."

--InvestigativereportcommissionedbytheNFLregardingtheAFCChampionshipgameonJanuary18,2015

Ouranalysisshowsanaveragepressuredropofabout0.73psi,whichisconsistentwiththeofficialanalysis.

Theall-importantquestioninthefootballworldwaswhethertheexcessdropofpressureinthePatriots'footballswasdeliberate.Tothatquestion,thedatahavenoanswer.Ifyouarecuriousabouttheanswergivenbytheinvestigators,hereisthefullreport.

358Permutation

A/BTestingInteract

UsingtheBootstrapMethodtoTestHypotheses¶Wehaveusedrandompermutationstoseewhethertwosamplesaredrawnfromthesameunderlyingdistribution.Thebootstrapmethodcanalsobeusedinsuchstatisticaltestsofhypotheses.

Theexamplesinthissectionarebasedondataonarandomsampleof1,174pairsofmothersandtheirnewborninfants.Thetablebabycontainsthedata.Eachrowrepresentsamother-babypair.Thevariablesare:

thebaby'sbirthweightinouncesthenumberofgestationaldaysthemother'sageincompletedyearsthemother'sheightininchesthemother'spregnancyweightinpoundswhetherornotthemotherwasasmoker

baby=Table.read_table('baby.csv')

359A/BTesting

BirthWeight

GestationalDays

MaternalAge

MaternalHeight

MaternalSmoker

120 284 27 62 100 False

113 282 33 64 135 False

128 279 28 64 115 True

108 282 23 67 125 True

136 286 25 62 93 False

138 244 33 62 178 False

132 245 23 65 140 False

120 289 25 62 125 False

143 299 30 66 136 True

140 351 27 68 120 False

Supposewewanttocomparethemotherswhosmokeandthemotherswhoarenon-smokers.Dotheydifferinanywayotherthansmoking?Akeyvariableofinterestisthebirthweightoftheirbabies.Tostudythisvariable,webeginbynotingthatthereare715non-smokersamongthewomeninthesample,and459smokers.

baby.group('MaternalSmoker')

MaternalSmoker count

False 715

True 459

Thefirsthistogrambelowdisplaysthedistributionofbirthweightsofthebabiesofthenon-smokersinthesample.Theseconddisplaysthebirthweightsofthebabiesofthesmokers.

baby.where('MaternalSmoker',False).hist('BirthWeight',bins=np.arange

360A/BTesting

baby.where('MaternalSmoker',True).hist('BirthWeight',bins=np.arange

Bothdistributionsareapproximatelybellshapedandcenterednear120ounces.Thedistributionsarenotidentical,ofcourse,whichraisesthequestionofwhetherthedifferencereflectsjustchancevariationoradifferenceinthedistributionsinthepopulation.

Thisquestioncanbeansweredbyatestofthenullandalternativehypothesesbelow.Becausewearetestingfortheequalityoftwodistributions,oneforCategoryA(motherswhodon'tsmoke)andtheotherforCategoryB(motherswhosmoke),themethodisknownratherunimaginativelyasA/Btesting.

Nullhypothesis.Inthepopulation,thedistributionofbirthweightsofbabiesisthesameformotherswhodon'tsmokeasformotherswhodo.Thedifferenceinthesampleisduetochance.

Alternativehypothesis.Thetwodistributionsaredifferentinthepopulation.

361A/BTesting

Teststatistic.Birthweightisaquantitativevariable,soitisreasonabletousethedifferencebetweenmeansastheteststatistic.Thereareotherreasonableteststatisticsaswell;thedifferencebetweenmeansisjustonesimplechoice.

Theobserveddifferencebetweenthemeansofthetwogroupsinthesampleisabout9.27ounces.

nonsmokers_mean=np.mean(baby.where('MaternalSmoker',False).column

smokers_mean=np.mean(baby.where('MaternalSmoker',True).column('BirthWeight'

nonsmokers_mean-smokers_mean

9.2661425720249184

Implementingthebootstrapmethod¶

Toseewhethersuchadifferencecouldhavearisenduetochanceunderthenullhypothesis,wecoulduseapermutationtestjustaswedidintheprevioussection.Analternativeistousethebootstrap,whichwewilldohere.

Underthenullhypothesis,thedistributionsofbirthweightsarethesameforbabiesofwomenwhosmokeandforbabiesofthosewhodon't.Toseehowmuchthedistributionscouldvaryduetochancealone,wewillusethebootstrapmethodanddrawnewsamplesatrandomwithreplacementfromtheentiresetofbirthweights.Theonlydifferencebetweenthisandthepermutationtestoftheprevioussectionisthatrandompermutationsarebasedonsamplingwithoutreplacement.

Wewillperform10,000repetitionsofthebootstrapprocess.Thereare1,174babies,soineachrepetitionofthebootstrapprocesswedrawn1,174timesatrandomwithreplacement.Ofthese1,174draws,715areassignedtothenon-smokingmothersandtheremaining459tothemotherswhosmoke.

Thecodestartsbysortingtherowssothattherowscorrespondingtothe715non-smokersappearfirst.Thiseliminatestheneedtouseconditionalexpressionstoidentifytherowscorrespondingtosmokersandnon-smokersineachreplicationofthesamplingprocess.

362A/BTesting

"""Bootstraptestforthedifferenceinmeanbirthweight

CategoryA:non-smokerCategoryB:smoker"""

sorted_by_smoking=baby.select(['MaternalSmoker','BirthWeight'])

#calculatetheobserveddifferenceinmeans

meanA=np.mean(sorted_by_smoking.column('BirthWeight')[:715])

meanB=np.mean(sorted_by_smoking.column('BirthWeight')[715:])

observed_difference=meanA-meanB

repetitions=10000

diffs=[]

#sampleWITHreplacement,samenumberasoriginalsamplesize

resample=sorted_by_smoking.sample(with_replacement=True)

#Computethedifferenceofthemeansoftheresampledvalues,betweenCategoriesAandB

diff_means=np.mean(resample.column('BirthWeight')[:715])-np

diffs.append([diff_means])

#Displayresults

differences=Table().with_column('diff_in_means',diffs)

differences.hist(bins=20)

plots.xlabel('Approxnulldistributionofdifferenceinmeans')

plots.title('BootstrapA/BTest')

plots.xlim(-4,4)

plots.ylim(0,0.4)

print('Observeddifferenceinmeans:',observed_difference)

Observeddifferenceinmeans:9.26614257202

363A/BTesting

Thefigureshowsthatunderthenullhypothesisofequaldistributionsinthepopulation,thebootstrapempiricaldistributionofthedifferencebetweenthesamplemeansofthetwogroupsisroughlybellshaped,centeredat0,stretchingfromabout ouncesto ounces.Theobserveddifferenceintheoriginalsampleisabout9.27ounces,whichisinconsistentwiththisdistribution.Sotheconclusionofthetestisthatinthepopulation,thedistributionsofbirthweightsofthebabiesofnon-smokersandsmokersaredifferent.

BootstrapA/Btesting¶

Letusdefineafunctionbootstrap_AB_testthatperformsanA/Btestusingthebootstrapmethodandthedifferenceinsamplemeansastheteststatistic.Thenullhypothesisisthatthetwounderlyingdistributionsinthepopulationareequal;thealternativeisthattheyarenot.

Theargumentsofthefunctionare:

thenameofthetablethatcontainsthedataintheoriginalsamplethelabelofthecolumncontainingthecode0forCategoryAand1forCategoryBthelabelofthecolumncontainingthevaluesoftheresponsevariable(thatis,thevariablewhosedistributionisofinterest)thenumberofrepetitionsoftheresamplingprocedure

Thefunctionreturnstheobserveddifferenceinmeans,thebootstrapempiricaldistributionofthedifferenceinmeans,andthebootstrapempiricalP-value.Becausethealternativesimplysaysthatthetwounderlyingdistributionsaredifferent,theP-valueiscomputedastheproportionofsampleddifferencesthatareatleastaslargeinabsolutevalueastheabsolutevalueoftheobserveddifference.

364A/BTesting

"""BootstrapA/Btestforthedifferenceinthemeanresponse

AssumesA=0,B=1"""

defbootstrap_AB_test(table,categories,values,repetitions):

#SortthesampletableaccordingtotheA/Bcolumn;

#thenselectonlythecolumnofeffects.

response=table.sort(categories).select(values)

#FindthenumberofentriesinCategoryA.

n_A=table.where(categories,0).num_rows

#Calculatetheobservedvalueoftheteststatistic.

meanA=np.mean(response.column(values)[:n_A])

meanB=np.mean(response.column(values)[n_A:])

observed_difference=meanA-meanB

#Runthebootstrapprocedureandgetalistofresampleddifferencesinmeans

diffs=[]

resample=response.sample(with_replacement=True)

d=np.mean(resample.column(values)[:n_A])-np.mean(resample

diffs.append([d])

#ComputethebootstrapempiricalP-value

diff_array=np.array(diffs)

empirical_p=np.count_nonzero(abs(diff_array)>=abs(observed_difference

#Displayresults

differences=Table().with_column('diff_in_means',diffs)

differences.hist(bins=20,normed=True)

plots.xlabel('Approxnulldistributionofdifferenceinmeans')

plots.title('BootstrapA/BTest')

print("Observeddifferenceinmeans:",observed_difference)

print("BootstrapempiricalP-value:",empirical_p)

Wecannowusethefunctionbootstrap_AB_testtocomparethesmokersandnon-smokerswithrespecttoseveraldifferentresponsevariables.Thetestsshowastatisticallysignificantdifferencebetweenthetwogroupsinbirthweight(asshownearlier),gestationaldays,

365A/BTesting

maternalage,andmaternalpregnancyweight.Itcomesasnosurprisethatthetwogroupsdonotdiffersignificantlyintheirmeanheights.

bootstrap_AB_test(baby,'MaternalSmoker','BirthWeight',10000)

BootstrapempiricalP-value:0.0

bootstrap_AB_test(baby,'MaternalSmoker','GestationalDays',10000

366A/BTesting

bootstrap_AB_test(baby,'MaternalSmoker','MaternalAge',10000)

bootstrap_AB_test(baby,'MaternalSmoker','MaternalPregnancyWeight'

367A/BTesting

bootstrap_AB_test(baby,'MaternalSmoker','MaternalHeight',10000

Observeddifferenceinmeans:-0.0905891494127

368A/BTesting

RegressionInferenceInteract

Regression:aReview¶Earlierinthecourse,wedevelopedtheconceptsofcorrelationandregressionaswaystodescriberelationsbetweenquantitativevariables.Wewillnowseehowtheseconceptscanbecomepowerfultoolsforinference,whenusedappropriately.

First,letusbrieflyreviewthemainideasofcorrelationandregression,alongwithcodeforcalculations.

Itisoftenconvenienttomeasurevariablesinstandardunits.Whenyouconvertavaluetostandardunits,youaremeasuringhowmanystandarddeviationsaboveaveragethevalueis.

Thefunctionstandard_unitstakesanarrayasitsargumentandreturnsthearrayconvertedtostandardunits.

defstandard_units(x):

return(x-np.mean(x))/np.std(x)

Thecorrelationbetweentwovariablesmeasuresthedegreetowhichascatterplotofthetwovariablesisclusteredaroundastraightline.Ithasnounitsandisanumberbetween-1and1.Itiscalculatedastheaverageoftheproductofthetwovariables,whenbothvariablesaremeasuredinstandardunits.

Thefunctioncorrelationtakesthreearguments–thenameofatableandthelabelsoftwocolumnsofthetable–andreturnsthecorrelationbetweenthetwocolumns.

defcorrelation(table,x,y):

x_in_standard_units=standard_units(table.column(x))

y_in_standard_units=standard_units(table.column(y))

returnnp.mean(x_in_standard_units*y_in_standard_units)

Hereisthescatterplotofbirthweight(they-variable)versusgestationaldays(thex-variable)ofthebabiesinthetablebaby.Theplotshowsapositiveassociation,andcorrelationreturnsavalueofjustover0.4.

369RegressionInference

correlation(baby,'GestationalDays','BirthWeight')

0.40754279338885108

baby.scatter('GestationalDays','BirthWeight')

Whenscatteriscalledwiththeoptionfit_line=True,theregressionlineisdrawnonthescatterplot.

baby.scatter('GestationalDays','BirthWeight',fit_line=True)

Theregressionlineisthebestamongallstraightlinesthatcouldbeusedtoestimatebasedon ,inthesensethatitminimizesthemeansquarederrorofestimation.

Theslopeoftheregressionlineisgivenby:

where isthecorrelation.

Theinterceptoftheregressionlineisgivenby

Thefunctionsslopeandinterceptcalculatethesetwoquantities.Eachtakesthesamethreeargumentsascorrelation:thenameofthetableandthelabelsofcolumns and ,inthatorder.

defslope(table,x,y):

returnr*np.std(table.column(y))/np.std(table.column(x))

defintercept(table,x,y):

a=slope(table,x,y)

returnnp.mean(table.column(y))-a*np.mean(table.column(x))

Wecannowcalculatetheslopeandtheinterceptoftheregressionlinedrawnabove.Theslopeisabout0.47ouncespergestationalday.

slope(baby,'GestationalDays','BirthWeight')

0.46655687694921522

Theinterceptisabout-10.75ounces.

intercept(baby,'GestationalDays','BirthWeight')

-10.754138914450252

Thustheequationoftheregressionlineforestimatingbirthweightbasedongestationaldaysis:

Thefittedvalueatagivenvalueof istheestimateof basedonthatvalueof .Inotherwords,thefittedvalueatagivenvalueof istheheightoftheregressionlineatthat .

Thefunctionfitted_valuecomputesthisheight.Likethefunctionscorrelation,slope,andintercept,itsargumentsincludethenameofthetableandthelabelsofthe andcolumns.Butitalsorequiresafourthargument,whichisthevalueof atwhichtheestimatewillbemade.

deffitted_value(table,x,y,given_x):

a=slope(table,x,y)

returna*given_x+b

Thefittedvalueat300gestationaldaysisabout129.2ounces.Forapregnancythathaslength300gestationaldays,ourestimateforthebaby'sweightisabout129.2ounces.

fitted_value(baby,'GestationalDays','BirthWeight',300)

129.2129241703143

Thefunctionfitreturnsanarrayconsistingofthefittedvaluesatallofthevaluesof inthetable.

deffit(table,x,y):

a=slope(table,x,y)

returna*table.column(x)+b

Asweknow,theTablemethodscattercanbeusedtodrawascatterplotwitharegressionlinethroughit.Hereisanotherway,usingfit.Thefunctionscatter_fittakesthesamethreeargumentsasfitandreturnsascatterdiagramalongwiththestraightlineoffittedvalues.

defscatter_fit(table,x,y):

plots.scatter(table.column(x),table.column(y),s=15)

plots.plot(table.column(x),fit(table,x,y),lw=2,color='darkblue'

plots.xlabel(x)

plots.ylabel(y)

scatter_fit(baby,'GestationalDays','BirthWeight')

#Thenexttwolinesjustmaketheploteasiertoread

plots.ylim([0,200])

Assumptionsofrandomness:a"regressionmodel"¶Thusfarinthissection,ouranalysishasbeenpurelydescriptive.Butrecallthatthedataarearandomsampleofallthebirthsinasystemofhospitals.Whatdothedatasayaboutthewholepopulationofbirths?Whatcouldtheysayaboutanewbirthatoneofthehospitals?

Questionsofinferencemayariseifwebelievethatascatterplotreflectstheunderlyingrelationbetweenthetwovariablesbeingplottedbutdoesnotspecifytherelationcompletely.Forexample,ascatterplotofbirthweightversusgestationaldaysshowsusthepreciserelationbetweenthetwovariablesinoursample;butwemightwonderwhetherthatrelationholdstrue,oralmosttrue,forallbabiesinthepopulationfromwhichthesamplewasdrawn,orindeedamongbabiesingeneral.

Asalways,inferentialthinkingbeginswithacarefulexaminationoftheassumptionsaboutthedata.Setsofassumptionsareknownasmodels.Setsofassumptionsaboutrandomnessinroughlylinearscatterplotsarecalledregressionmodels.

Inbrief,suchmodelssaythattheunderlyingrelationbetweenthetwovariablesisperfectlylinear;thisstraightlineisthesignalthatwewouldliketoidentify.However,wearenotabletoseethelineclearly.Whatweseearepointsthatarescatteredaroundtheline.Ineachofthepoints,thesignalhasbeencontaminatedbyrandomnoise.Ourinferentialgoal,therefore,istoseparatethesignalfromthenoise.

Ingreaterdetail,theregressionmodelspecifiesthatthepointsinthescatterplotaregeneratedatrandomasfollows.

Therelationbetween and isperfectlylinear.Wecannotseethis"trueline"but

Tychecan.SheistheGoddessofChance.Tychecreatesthescatterplotbytakingpointsonthelineandpushingthemoffthelinevertically,eitheraboveorbelow,asfollows:

Foreach ,Tychefindsthecorrespondingpointonthetrueline,andthenaddsanerror.Theerrorsaredrawnatrandomwithreplacementfromapopulationoferrorsthathasanormaldistributionwithmean0.Tychecreatesapointwhosehorizontalcoordinateis andwhoseverticalcoordinateis"theheightofthetruelineat ,plustheerror".

Finally,Tycheerasesthetruelinefromthescatter,andshowsusjustthescatterplotofherpoints.

Basedonthisscatterplot,howshouldweestimatethetrueline?Thebestlinethatwecanputthroughascatterplotistheregressionline.Sotheregressionlineisanaturalestimateofthetrueline.

Thesimulationbelowshowshowclosetheregressionlineistothetrueline.ThefirstpanelshowshowTychegeneratesthescatterplotfromthetrueline;thesecondshowthescatterplotthatwesee;thethirdshowstheregressionlinethroughtheplot;andthefourthshowsboththeregressionlineandthetrueline.

Torunthesimulation,callthefunctiondraw_and_comparewiththreearguments:theslopeofthetrueline,theinterceptofthetrueline,andthesamplesize.

Runthesimulationafewtimes,withdifferentvaluesfortheslopeandinterceptofthetrueline,andvaryingsamplesizes.Becauseallthepointsaregeneratedaccordingtothemodel,youwillseethattheregressionlineisagoodestimateofthetruelineifthesamplesizeismoderatelylarge.

defdraw_and_compare(true_slope,true_int,sample_size):

x=np.random.normal(50,5,sample_size)

xlims=np.array([np.min(x),np.max(x)])

eps=np.random.normal(0,6,sample_size)

y=(true_slope*x+true_int)+eps

tyche=Table([x,y],['x','y'])

plots.subplot(4,1,1)

plots.scatter(tyche['x'],tyche['y'],s=15)

plots.plot(xlims,true_slope*xlims+true_int,lw=2,color='green'

plots.title('WhatTychedraws')

plots.scatter(tyche['x'],tyche['y'],s=15)

plots.title('Whatwegettosee')

scatter_fit(tyche,'x','y')

plots.xlabel("")

plots.ylabel("")

plots.title('Regressionline:ourestimateoftrueline')

scatter_fit(tyche,'x','y')

xlims=np.array([np.min(tyche['x']),np.max(tyche['x'])])

plots.plot(xlims,true_slope*xlims+true_int,lw=2,color='green'

plots.title("Regressionlineandtrueline")

#Tyche'strueline,

#thepointsshecreates,

#andourestimateofthetrueline.

#Arguments:trueslope,trueintercept,numberofpoints

draw_and_compare(3,-5,25)

PredictionusingRegression¶Inreality,ofcourse,wearenotTyche,andwewillneverseethetrueline.Whatthesimulationshowsthatiftheregressionmodellooksplausible,andifwehavealargesample,thentheregressionlineisagoodapproximationtothetrueline.

Thescatterdiagramofbirthweightsversusgestationaldayslooksroughlylinear.Letusassumethattheregressionmodelholds.Supposenowthatatthehospitalthereisanewbabywhohas300gestationaldays.Wecanusetheregressionlinetopredictthebirthweightofthisbaby.

Aswesawearlier,thefittedvalueat300gestationaldayswasabout129.2ounces.Thatisourpredictionforthebirthweightofthenewbaby.

fit_300=fitted_value(baby,'GestationalDays','BirthWeight',300

fit_300

129.2129241703143

Thefigurebelowshowswherethepredictionliesontheregressionline.Theredlineisat.

plots.scatter(300,fit_300,color='red',s=20)

plots.plot([300,300],[0,fit_300],color='red',lw=1)

plots.ylim([0,200])

TheVariabilityofthePrediction¶Wehavedevelopedamethodforpredictinganewbaby'sbirthweightbasedonthenumberofgestationaldays.Butasdatascientists,weknowthatthesamplemighthavebeendifferent.Hadthesamplebeendifferent,theregressionlinewouldhavebeendifferenttoo,andsowouldourprediction.Toseehowgoodourpredictionis,wemustgetasenseofhowvariablethepredictioncanbe.

Onewaytodothiswouldbebygeneratingnewrandomsamplesofpointsandmakingapredictionbasedoneachnewsample.Togeneratenewsamples,wecanbootstrapthescatterplot.

Specifically,wecansimulatenewsamplesbyrandomsamplingwithreplacementfromtheoriginalscatterplot,asmanytimesastherearepointsinthescatter.

Hereistheoriginalscatterdiagramfromthesample,andfourreplicationsofthebootstrapresamplingprocedure.Noticehowtheresampledscatterplotsareingeneralalittlemoresparsethantheoriginal.Thatisbecausesomeoftheoriginalpointdonotgetselectedinthesamples.

plots.scatter(baby[1],baby[0],s=10)

plots.xlim([150,400])

plots.title('Originalsample')

foriinnp.arange(1,5,1):

plots.subplot(5,1,i+1)

rep=baby.sample(with_replacement=True)

plots.scatter(rep[1],rep[0],s=10)

plots.xlim([150,400])

plots.title('Bootstrapsample'+str(i))

Thenextstepistofittheregressionlinetothescatterplotineachreplication,andmakeapredictionbasedoneachline.Thefigurebelowshows10suchlines,andthecorrespondingpredictedbirthweightat300gestationaldays.

lines=Table(['slope','intercept'])

foriinrange(10):

a=slope(rep,'GestationalDays','BirthWeight')

b=intercept(rep,'GestationalDays','BirthWeight')

lines.append([a,b])

lines['predictionatx='+str(x)]=lines.column('slope')*x+lines.

xlims=np.array([291,309])

left=xlims[0]*lines[0]+lines[1]

right=xlims[1]*lines[0]+lines[1]

fit_x=x*lines['slope']+lines['intercept']

foriinrange(10):

plots.plot(xlims,np.array([left[i],right[i]]),lw=1)

plots.scatter(x,fit_x[i],s=20)

Thepredictionsvaryfromonelinetothenext.Thetablebelowshowstheslopeandinterceptofeachofthe10lines,alongwiththeprediction.

slope intercept predictionatx=300

0.446731 -5.08136 128.938

0.417925 3.65604 129.034

0.395636 8.95328 127.644

0.512511 -22.9981 130.755

0.477034 -13.4303 129.68

0.515773 -24.5881 130.144

0.48213 -15.7025 128.936

0.492106 -18.7225 128.909

0.434989 -1.72409 128.773

0.486076 -16.5506 129.272

BootstrapPredictionInterval¶Ifweincreasethenumberofrepetitionsoftheresamplingprocess,wecangenerateanempiricalhistogramofthepredictions.Thiswillallowustocreateapredictioninterval,usingmethodslikethoseweusedearliertocreatebootstrapconfidenceintervalsfornumericalparameters.

Letusdefineafunctioncalledbootstrap_predictiontodothis.Thefunctiontakesfivearguments:

thenameofthetablethecolumnlabelsofthepredictorandresponsevariables,inthatorderthevalueof atwhichtomakethepredictionthedesirednumberofbootstraprepetitions

Ineachrepetition,thefunctionbootstrapstheoriginalscatterplotandcalculatestheequationoftheregressionlinethroughthebootstrappedplot.Itthenusesthislinetofindthepredictedvalueof basedonthespecifiedvalueof .Specifically,itcallsthefunctionfitted_valuethatwedefinedearlierinthissection,tocalculatetheslopeandinterceptoftheregressionlineandthencalcuatethefittedvalueatthespecified .

Finally,itcollectsallofthepredictedvaluesintoacolumnofatable,drawstheempiricalhistogramofthesevalues,andprintstheintervalconsistingofthe"middle95%"ofthepredictedvalues.Italsoprintsthepredictedvaluebasedontheregressionlinethroughtheoriginalscatterplot.

#Bootstrappredictionofvariableyatnew_x

#Datacontainedintable;predictionbyregressionofybasedonx

#repetitions=numberofbootstrapreplicationsoftheoriginalscatterplot

defbootstrap_prediction(table,x,y,new_x,repetitions):

#Foreachrepetition:

#Bootstrapthescatter;

#gettheregressionpredictionatnew_x;

#augmentthepredictionslist

predictions=[]

bootstrap_sample=table.sample(with_replacement=True)

predicted_value=fitted_value(bootstrap_sample,x,y,new_x

predictions.append(predicted_value)

#Predictionbasedonoriginalsample

original=fitted_value(table,x,y,new_x)

#Displayresults

pred=Table().with_column('Prediction',predictions)

pred.hist(bins=20)

plots.xlabel('predictionsatx='+str(new_x))

print('Heightofregressionlineatx='+str(new_x)+':',original

print('Approximate95%-confidenceinterval:')

print((pred.percentile(2.5).rows[0][0],pred.percentile(97.5).rows

bootstrap_prediction(baby,'GestationalDays','BirthWeight',300,

Heightofregressionlineatx=300:129.21292417

Approximate95%-confidenceinterval:

(127.25683835787581,131.33042912205815)

Thefigureaboveshowsabootstrapempiricalhistogramofthepredictedbirthweightofababyat300gestationaldays,basedon5,000repetitionsofthebootstrapprocess.Theempiricaldistributionisroughlynormal.

Anapproximate95%predictionintervalofscoreshasbeenconstructedbytakingthe"middle95%"ofthepredictions,thatis,theintervalfromthe2.5thpercentiletothe97.5thpercentileofthepredictions.Theintervalrangesfromabout127toabout131.Thepredictionbasedontheoriginalsamplewasabout129,whichisclosetothecenteroftheinterval.

Thefigurebelowshowsthehistogramof5,000bootstrappredictionsat285gestationaldays.Thepredictionbasedontheoriginalsampleisabout122ounces,andtheintervalrangesfromabout121ouncestoabout123ounces.

bootstrap_prediction(baby,'GestationalDays','BirthWeight',285,

Heightofregressionlineatx=285:122.214571016

Approximate95%-confidenceinterval:

(121.17047827189757,123.29506281889765)

Noticethatthisintervalisnarrowerthanthepredictionintervalat300gestationaldays.Letusinvestigatethereasonforthis.

Themeannumberofgestationaldaysisabout279days:

np.mean(baby['GestationalDays'])

279.10136286201021

So285isnearertothecenterofthedistributionthan300is.Typically,theregressionlinesbasedonthebootstrapsamplesareclosertoeachothernearthecenterofthedistributionofthepredictorvariable.Thereforeallofthepredictedvaluesareclosertogetheraswell.Thisexplainsthenarrowerwidthofthepredictioninterval.

Youcanseethisinthefigurebelow,whichshowspredictionsat and foreachoftenbootstrapreplications.Typically,thelinesarefartherapartat thanat

,andthereforethepredictionsat aremorevariable.

x1=300

x2=285

lines=Table(['slope','intercept'])

foriinrange(10):

a=slope(rep,'GestationalDays','BirthWeight')

b=intercept(rep,'GestationalDays','BirthWeight')

lines.append([a,b])

xlims=np.array([260,310])

left=xlims[0]*lines[0]+lines[1]

right=xlims[1]*lines[0]+lines[1]

fit_x1=x1*lines['slope']+lines['intercept']

fit_x2=x2*lines['slope']+lines['intercept']

plots.xlim(xlims)

foriinrange(10):

plots.plot(xlims,np.array([left[i],right[i]]),lw=1)

plots.scatter(x1,fit_x1[i],s=20)

plots.scatter(x2,fit_x2[i],s=20)

Istheobservedslopereal?¶

Wehavedevelopedamethodofpredictingthebirthweightofanewbabyinthehospitalsystem,basedonthebaby'snumberofgestationaldays.Ourpredictionmakesuseofthepositiveassociationthatweobservedbetweenbirthweightandgestationaldays.

Butwhatifthatobservationisspurious?Inotherwords,whatifTyche'struelinewasflat,andthepositiveassociationthatweobservedwasjustduetorandomnessinthepointsthatshegenerated?

Hereisasimulationthatillustrateswhythisquestionarises.Wewillonceagaincallthefunctiondraw_and_compare,thistimerequiringTyche'struelinetohaveslope0.Ourgoalistoseewhetherourregressionlineshowsaslopethatisnot0.

Rememberthattheargumentstothefunctiondraw_and_comparearetheslopeandtheinterceptofTyche'strueline,andthenumberofpointsthatshegenerates.

draw_and_compare(0,10,25)

Runthesimulationafewtimes,keepingtheslopeofTyche'sline0eachtime.Youwillnoticethatwhiletheslopeofthetruelineis0,theslopeoftheregressiontypicallyisnot0.Theregressionlinesometimesslopesupwards,andsometimesdownwards,eachtimegivingusafalseimpressionthatthetwovariablesarecorrelated.

Testingwhethertheslopeofthetruelineis0¶Thesesimulationsshowthatbeforeweputtoomuchfaithinourregressionpredictions,itisworthcheckingwhetherthetruelinedoesindeedhaveaslopethatisnot0.

Weareinagoodpositiontodothis,becauseweknowhowtobootstrapthescatterdiagramanddrawaregressionlinethrougheachbootstrappedplot.Eachofthoselineshasaslope,sowecansimplycollectalltheslopesanddrawtheirempiricalhistogram.Wecanthenconstructanapproximate95%confidenceintervalfortheslopeofthetrueline,usingthebootstrappercentilemethodthatisnowsofamiliar.

Iftheconfidenceintervalforthetrueslopedoesnotcontain0,thenwecanbeabout95%confidentthatthetrueslopeisnot0.Iftheintervaldoescontain0,thenwecannotconcludethatthereisagenuninelinearassociationbetweenthevariablethatwearetryingtopredictandthevariablethatwehavechosentouseasapredictor.Wemightthereforewanttolookforadifferentpredictorvariableonwhichtobaseourpredictions.

Thefunctionbootstrap_slopeexecutesourplan.Itsargumentsarethenameofthetableandthelabelsofthepredictorandresponsevariables,andthedesirednumberofbootstrapreplications.Ineachreplication,thefunctionbootstrapstheoriginalscatterplotandcalculatestheslopeoftheresultingregressionline.Itthendrawsthehistogramofallthegeneratedslopesandprintstheintervalconsistingofthe"middle95%"oftheslopes.

Noticethatthecodeforbootstrap_slopeisalmostidenticaltothatforbootstrap_prediction,exceptthatnowallweneedfromeachreplicationistheslope,notapredictedvalue.

defbootstrap_slope(table,x,y,repetitions):

#Foreachrepetition:

#Bootstrapthescatter,gettheslopeoftheregressionline,

#augmentthelistofgeneratedslopes

slopes=[]

bootstrap_scatter=table.sample(with_replacement=True)

slopes.append(slope(bootstrap_scatter,x,y))

#Slopeoftheregressionlinefromtheoriginalsample

observed_slope=slope(table,x,y)

#Displayresults

slopes=Table().with_column('slopes',slopes)

slopes.hist(bins=20)

print('Slopeofregressionline:',observed_slope)

print('Approximate95%-confidenceintervalforthetrueslope:'

print((slopes.percentile(2.5).rows[0][0],slopes.percentile(97.5

Letuscallbootstrap_slopetoseeifthatthetruelinearrelationbetweenbirthweightandgestationaldayshasslope0;inotherwords,ifTyche'struelineisflat.

bootstrap_slope(baby,'GestationalDays','BirthWeight',5000)

Slopeofregressionline:0.466556876949

Approximate95%-confidenceintervalforthetrueslope:

(0.37981209488027268,0.55753912103512426)

Theapproximate95%confidenceintervalforthetruesloperunsfromabout0.38ouncesperdaytoabout0.56ouncesperday.Thisintervaldoesnotcontain0.Thereforewecanconclude,withabout95%confidence,thattheslopeofthetruelineisnot0.

Thisconclusionsupportsourchoiceofusinggestationaldaystopredictbirthweight.

Supposenowwetrytoestimatethebirthweightofthebabybasedonthemother'sage.Basedonthesample,theslopeoftheregressionlineforestimatingbirthweightbasedonmaternalageispositive,about0.08ouncesperyear:

slope(baby,'MaternalAge','BirthWeight')

0.085007669415825132

Butanapproximate95%bootstrapconfidenceintervalforthetrueslopehasanegativeleftendpointandapositiverightendpoint–inotherwords,theintervalcontains0.Thereforewecannotrejectthehypothesisthattheslopeofthetruelinearrelationbetweenmaternalageandbaby'sbirthweightis0.Basedonthisobservation,itwouldbeunwisetopredictbirthweightbasedontheregressionmodelwithmaternalageasthepredictor.

bootstrap_slope(baby,'MaternalAge','BirthWeight',5000)

Slopeofregressionline:0.0850076694158

Approximate95%-confidenceintervalforthetrueslope:

(-0.10015674488514391,0.26974979944992733)

Usingconfidenceintervalstotesthypotheses¶Noticehowwehaveusedconfidenceintervalsforthetrueslopetotestthehypothesisthatthetrueslopeis0.Constructingaconfidenceintervalforaparameterisonewaytotestthenullhypothesisthattheparameterhasaspecifiedvalue.Ifthatvalueisnotintheconfidenceinterval,wecanrejectthenullhypothesisthattheparameterhasthatvalue.Ifthevalueisintheconfidenceinterval,wedonothaveenoughevidencetorejectthenullhypothesis.

Ifthesamplesizeislargeandiftheempiricaldistributionoftheestimateoftheparameterisroughlynormal,thenthismethodoftestingviaconfidenceintervalsisjustifiable.Usinga95%confidenceintervalcorrespondstousinga5%cutofffortheP-value.Thejustificationsofthesestatementsarerathertechnical,andwewillleavethemtoamoreadvancedcourse.Fornow,itisenoughtonotethatconfidenceintervalscanbeusedtotesthypothesesinthisway.

Indeed,themethodismoreconstructivethanothermethodsoftestinghypotheses,becausenotonlydoesitresultinadecisionaboutwhetherornottorejectthenullhypothesis,italsoprovidesanestimateoftheparameter.Forexample,intheexampleofusinggestationaldaystoestimatebirthweights,wedidn'tjustconcludethatthetrueslopewasnot0–wealsoestimatedthatthetrueslopeisbetween0.38ouncesperdayto0.56ouncesperday.

Wordsofcaution¶Allofthepredictionsandteststhatwehaveperformedinthissectionassumethattheregressionmodelholds.Specifically,themethodsassumethatthescatterplotresemblespointsgeneratedbystartingwithpointsthatareonastraightlineandthenpushingthemoffthelinebyaddingrandomnoise.

Ifthescatterplotdoesnotlooklikethat,thenperhapsthemodeldoesnotholdforthedata.Ifthemodeldoesnothold,thencalculationsthatassumethemodeltobetruearenotvalid.

Therefore,wemustfirstdecidewhethertheregressionmodelholdsforourdata,beforewestartmakingpredictionsbasedonthemodelortestinghypothesesaboutparametersofthemodel.Asimplewayistodowhatwedidinthissection,whichistodrawthescatterdiagramofthetwovariablesandseewhetheritlooksroughlylinearandevenlyspreadoutaroundaline.Inthenextsection,wewillstudysomemoretoolsforassessingthemodel.

RegressionDiagnosticsInteractThemethodsthatwehavedevelopedforinferenceinregressionarevalidiftheregressionmodelholdsforourdata.Ifthedatadonotsatisfytheassumptionsofthemodel,thenitcanbehardtojustifycalculationsthatassumethatthemodelisgood.

Inthissectionwewilldevelopsomesimpledescriptivemethodsthatcanhelpusdecidewhetherourdatasatisfytheregressionmodel.

Wewillstartwithareviewofthedetailsofthemodel,whichspecifiesthatthepointsinthescatterplotaregeneratedatrandomasfollows.

Tychecreatesthescatterplotbystartingwithpointsthatlieperfectlyonastraightlineandthenpushingthemoffthelinevertically,eitheraboveorbelow,asfollows:

Foreach ,Tychefindsthecorrespondingpointonthetrueline,andthenaddsanerror.Theerrorsaredrawnatrandomwithreplacementfromapopulationoferrorsthathasanormaldistributionwithmean0andanunknownstandarddeviation.Tychecreatesapointwhosehorizontalcoordinateis andwhoseverticalcoordinateis"theheightofthetruelineat ,plustheerror".Iftheerrorispositive,thepointisabovetheline.Iftheerrorisnegative,thepointisbelowtheline.

WegettoseetheresultingpointsbutnotthetruelineonwhichthepointsstartedoutnortherandomerrorsthatTycheadded.

FirstDiagnostic:TheScatterPlot¶

Itisalwayshelpfultostartwithavisualization.Continuingtheexampleofestimatingbirthweightbasedongestationaldays,letuslookatascatterplotofthetwovariables.

baby.scatter('GestationalDays','BirthWeight')

395RegressionDiagnostics

Thepointsdoappeartoberoughlyclusteredaroundastraightline.Thereisnostrikinglynoticeablecurveinthescatter.

Aplotoftheregressionlinethroughthescatterplotshowsthatabouthalfthepointsareabovethelineandhalfarebelowthelinealmostsymmetrically,supportingthemodel'sassumptionthatTycheisaddingrandomerrorsthatarenormallydistributedwithmean0.

SecondDiagnostic:AResidualPlot¶

Wecannotseethetruelinethatisthebasisoftheregressionmodel,buttheregressionlinethatwecalculateactsasasubstituteforit.SoalsotheverticaldistancesofthepointsfromtheregressionlineactassubstitutesfortherandomerrorsthatTycheusestopushthepointsverticallyoffthetrueline.

Aswehaveseenearlier,theverticaldistancesofthepointsfromtheregressionlinearecalledresiduals.Thereisoneresidualforeachpointinthescatterplot.Theresidualisthedifferencebetweentheobservedvalueof andthefittedvalueof :

Forthepoint ,

Thetablebaby_regressioncontainsthedata,thefittedvaluesandtheresiduals:

baby2=baby.select(['GestationalDays','BirthWeight'])

fitted=fit(baby,'GestationalDays','BirthWeight')

residuals=baby.column('BirthWeight')-fitted

baby_regression=baby2.with_columns([

'FittedValue',fitted,

'Residual',residuals

baby_regression

GestationalDays BirthWeight FittedValue Residual

284 120 121.748 -1.74801

282 113 120.815 -7.8149

279 128 119.415 8.58477

282 108 120.815 -12.8149

286 136 122.681 13.3189

244 138 103.086 34.9143

245 132 103.552 28.4477

289 120 124.081 -4.0808

299 143 128.746 14.2536

351 140 153.007 -13.0073

Onceagain,ithelpstovisualize.Thefigurebelowshowsfouroftheresiduals;foreachofthefourpoints,thelengthoftheredsegmentistheresidualforthatpoint.Thereisonesuchresidualforeachoftheother1170pointsaswell.

points=Table().with_columns([

'GestationalDays',[148,244,338,351],

'BirthWeight',[116,138,102,140]

foriinrange(4):

f=fitted_value(baby,'GestationalDays','BirthWeight',points

plots.plot([points.column(0)[i]]*2,[f,points.column(1)[i]],color

Accordingtotheregressionmodel,Tychegenerateserrorsbydrawingatrandomwithreplacementfromapopulationoferrorsthatisnormallydistributedwithmean0.Toseewhetherthislooksplausibleforourdata,wewillexaminetheresiduals,whichareoursubstitutesforTyche'serrors.

Aresidualplotcanbedrawnbyplottingtheresidualsagainst .Thefunctionresidual_plotdoesjustthat.Thefunctionregression_diagnostic_plotsdrawstheorigbinalscatterplotaswellastheresidualplotforeaseofcomparison.

defresidual_plot(table,x,y):

fitted=fit(table,x,y)

residuals=table[y]-fitted

t=Table().with_columns([

x,table[x],

'residuals',residuals

t.scatter(x,'residuals',color='r')

xlims=[min(table[x]),max(table[x])]

plots.plot(xlims,[0,0],color='darkblue',lw=2)

plots.title('ResidualPlot')

defregression_diagnostic_plots(table,x,y):

scatter_fit(table,x,y)

residual_plot(table,x,y)

regression_diagnostic_plots(baby,'GestationalDays','BirthWeight'

Theheightofeachredpointintheplotaboveisaresidual.Noticehowtheresidualsaredistributedfairlysymmetricallyaboveandbelowthehorizontallineat0.Noticealsothattheverticalspreadoftheplotisfairlyevenacrossthemostcommonvaluesof ,inaboutthe200-300rangeofgestationaldays.Inotherwords,apartfromafewoutlyingpoints,theplotisn'tnarrowerinsomeplacesandwiderinothers.

Allofthisisconsistentwiththemodel'sassumptionsthattheerrorsaredrawnatrandomwithreplacementfromanormallydistributedpopulationwithmean0.Indeed,ahistogramoftheresidualsiscenteredat0andlooksroughlybellshaped.

baby_regression.select('Residual').hist()

Theseobservationsshowusthatthedataarefairlyconsistentwiththeassumptionsoftheregressionmodel.Thereforeitmakessensetodoinferencebasedonthemodel,aswedidintheprevioussection.

UsingtheResidualPlottoDetectNonlinearity¶

Residualplotscanbeusedtodetectwhethertheregressionmodeldoesnothold.Themostseriousdeparturefromthemodelisnon-linearity.

Iftwovariableshaveanon-linearrelation,thenon-linearityissometimesvisibleinthescatterplot.Often,however,itiseasiertospotnon-linearityinaresidualplotthanintheoriginalscatterplot.Thisisusuallybecauseofthescalesofthetwoplots:theresidualplotallowsustozoominontheerrorsandhencemakesiteasiertospotpatterns.

Asanexample,wewillstudyadatasetontheageandlengthofdugongs,whicharemarinemammalsrelatedtomanateesandseacows.Thedataareinatablecalleddugong.Ageismeasuredinyearsandlengthinmeters.Theagesareestimated;mostdugongsdon'tkeeptrackoftheirbirthdays.

dugong=Table.read_table('http://www.statsci.org/data/oz/dugongs.txt'

dugong

Age Length

1.5 1.85

1.5 1.87

1.5 1.77

2.5 2.02

4 2.27

5 2.15

5 2.26

7 2.35

8 2.47

...(17rowsomitted)

Hereisaregressionoflength(theresponse)onage(thepredictor).Thecorrelationbetweenthetwovariablesisabout0.83,whichisquitehigh.

correlation(dugong,'Age','Length')

0.82964745549057139

scatter_fit(dugong,'Age','Length')

Highcorrelationnotwithstanding,theplotshowsacurvedpatternthatismuchmorevisibleintheresidualplot.

regression_diagnostic_plots(dugong,'Age','Length')

Atthelowendofages,theresidualsarealmostallnegative;thentheyarealmostallpositive;thennegativeagainatthehighendofages.SuchadiscerniblepatternwouldhavebeenunlikelyhadTychebeenpickingerrorsatrandomwithreplacementandusingthemtopushpointsoffastraightline.Thereforetheregressionmodeldoesnotholdforthesedata.

Asanotherexample,thedatasetus_womencontainstheaverageweightsofU.S.womenofdifferentheightsinthe1970s.Weightsaremeasuredinpoundsandheightsininches.Thecorrelationisspectacularlyhigh–morethan0.99.

us_women=Table.read_table('us_women.csv')

correlation(us_women,'height','aveweight')

0.99549476778421608

Thepointsseemtohugtheregressionlineprettyclosely,buttheresidualplotflashesawarningsign:itisU-shaped,indicatingthattherelationbetweenthetwovariablesisnotlinear.

regression_diagnostic_plots(us_women,'height','aveweight')

Theseexamplesdemonstratethatevenwhencorrelationishigh,fittingastraightlinemightnotbetherightthingtodo.Residualplotshelpusdecidewhetherornottouselinearregression.

UsingtheResidualPlottoDetectHeteroscedasticity¶

HeteroscedasticityisawordthatwillsurelybeofinteresttothosewhoarepreparingforSpellingBees.Fordatascientists,itsinterestliesinitsmeaning,whichis"unevenspread".

RecallthetablehybridthatcontainsdataonhybridcarsintheU.S.Hereisaregressionoffuelefficiencyontherateofacceleration.Theassociationisnegative:carsthataccelearatequicklytendtobelessefficient.

regression_diagnostic_plots(hybrid,'acceleration','mpg')

Noticehowtheresidualplotflaresouttowardsthelowendoftheaccelerations.Inotherwords,thevariabilityinthesizeoftheerrorsisgreaterforlowvaluesofaccelerationthanforhighvalues.Suchapatternwouldbeunlikelyundertheregressionmodel,becausethemodelassumesthatTychepickstheerrorsatrandomwithreplacementfromthesamenormallydistributedpopulation.

Unevenverticalvariationinaresidualplotcanindicatethattheregressionmodeldoesnothold.Unevenvariationisoftenmoreeasilynoticedinaresidualplotthanintheoriginalscatterplot.

TheAccuracyoftheRegressionEstimate¶

Wewillendourdiscussionofregressionbypointingoutsomeinterestingmathematicalfactsthatleadtoameasureoftheaccuracyofregression.Wewillleavetheproofstoanothercourse,butwewilldemonstratethefactsbyexamples.Notethatallofthefactsholdforallscatterplots,whetherornottheregressionmodelisgood.

Fact1.Nomatterwhattheshapeofthescatterdiagram,theresidualshaveanaverageof0.

Inalltheresidualplotsabove,youhaveseenthehorizontallineat0goingthroughthecenteroftheplot.Thatisavisualizationofthisfact.

Asanumericalexample,hereistheaverageoftheresidualsintheregressionofbirthweightongestationaldaysofbabies:

np.mean(baby_regression.column('Residual'))

-6.3912532279613784e-15

Thatdoesn'tlooklike0,butinfactitisaverysmallnumberthatis0apartfromroundingerror.

Hereistheaverageoftheresidualsintheregressionofthelengthofdugongsontheirage:

dugong_residuals=dugong.column('Length')-fit(dugong,'Age','Length'

np.mean(dugong_residuals)

-2.8783559897689243e-16

Onceagainthemeanoftheresidualsis0apartfromroundingerror.

Thisisanalogoustothefactthatifyoutakeanylistofnumbersandcalculatethelistofdeviationsfromaverage,theaverageofthedeviationsis0.

Fact2.Nomatterwhattheshapeofthescatterdiagram,theSDofthefittedvaluesistimestheSDoftheobservedvaluesof .

Tounderstandthisresult,itisworthnotingthatthefittedvaluesareallontheregressionlinewhereastheobservedvaluesof aretheheightsofallthepointsinthescatterplotandaremorevariable.

Thefittedvaluesofbirthweightare,therefore,lessvariablethantheobservedvaluesofbirthweight.Buthowmuchlessvariable?

Thescatterplotshowsacorrelationofabout0.4:

correlation(baby,'GestationalDays','BirthWeight')

0.40754279338885108

HereisratiooftheSDofthefittedvaluesandtheSDoftheobservedvaluesofbirthweight:

fitted_birth_weight=baby_regression.column('FittedValue')

observed_birth_weight=baby_regression.column('BirthWeight')

np.std(fitted_birth_weight)/np.std(observed_birth_weight)

0.40754279338885108

Theratiois .ThustheSDofthefittedvaluesis timestheSDoftheobservedvaluesof .

Theexampleoffuelefficiencyandaccelerationisonewherethecorrelationisnegative:

correlation(hybrid,'acceleration','mpg')

-0.5060703843771186

fitted_mpg=fit(hybrid,'acceleration','mpg')

observed_mpg=hybrid.column('mpg')

np.std(fitted_mpg)/np.std(observed_mpg)

0.5060703843771186

TheratioofthetwoSDsis .ThustheSDofthefittedvaluesis timestheSDoftheobservedvaluesoffuelefficiency.

Therelationsaysthatthecloserthescatterdiagramistoastraightline,thecloserthefittedvalueswillbetotheobservedvaluesof ,andthereforethecloserthevarianceofthefittedvalueswillbetothevarianceoftheobservedvalues.

Therefore,thecloserthescatterdiagramistoastraightline,thecloser willbeto1,andthecloser willbeto1or-1.

Fact3.Nomatterwhattheshapeofthescatterdiagram,theSDoftheresidualsistimestheSDoftheobservedvaluesof .Inotherwords,therootmeansquared

errorofregressionis timestheSDof .

Thisfactgivesameasureoftheaccuracyoftheregressionestimate.

Again,wewilldemonstratethisbyexample.Inthecaseofgestationaldaysandbirth

weights, isabout0.4and isabout0.91:

r=correlation(baby,'GestationalDays','BirthWeight')

np.sqrt(1-r**2)

0.91318611003278638

residuals=baby_regression.column('Residual')

observed_birth_weight=baby_regression.column('BirthWeight')

np.std(residuals)/np.std(observed_birth_weight)

0.9131861100327866

ThustheSDoftheresidualsis timestheSDoftheobservedbirthweights.

Itwillcomeasnosurprisethatthesamerelationistruefortheregressionoffuelefficiencyonacceleration:

r=correlation(hybrid,'acceleration','mpg')

np.sqrt(1-r**2)

0.862492183185677

residuals=hybrid.column('mpg')-fitted_mpg

observed_mpg=hybrid.column('mpg')

np.std(residuals)/np.std(observed_mpg)

0.862492183185677

LetusexaminetheextremecasesofthegeneralfactthattheSDoftheresidualsis

timestheSDoftheobservedvaluesof .Remembertheresidualsalwaysaverageoutto0.

If or ,then .Thereforetheresidualshaveanaverageof0andanSDof0aswell;therefore,theresidualsareallequalto0.Thisisconsistentwiththeobservationthatif ,thescatterplotisaperfectstraightlineandthereisnoerrorintheregressionestimate.

If ,the andtheSDoftheregressionestimateisequaltotheSDof .Thisisconsistentwiththeobservationthatif thentheregressionlineisaflatlineataverageof ,andtherootmeansquareerrorofregressionistherootmeansquareddeviationfromtheaverageof ,whichistheSDof .

Foreveryothervalueof ,theroughsizeoftheerroroftheregressionestimateissomewherebetween0andtheSDof .

Chapter5

Probability

412Probability

ConditionalProbabilityInteractInspiredbyPeterNorvig'sAConcreteIntroductiontoProbability

Probabilityisthestudyoftheobservationsthataregeneratedbysamplingfromaknowndistribution.Thestatisticalmethodswehavestudiedsofarallowustoreasonabouttheworldfromdata.Probabilityallowsustoreasonintheoppositedirections:whatobservationsresultfromknowfactsabouttheworld.

Inpractice,werarelyknowpreciselyhowrandomprocessesintheworldwork.Eventhesimplestrandomprocesssuchasflippingacoinisnotassimpleasonemightassume.Nonetheless,ourabilitytoreasonaboutwhatdatawillresultfromaknownrandomprocessisusefulinmanyaspectsofdatascience.Fundamentally,therulesofprobabilityallowustoreasonabouttheconsequencesofassumptionswemake.

ProbabilityDistributions¶Probabilityusesthefollowingvocabularytodescribetherelationshipbetweenknowndistributionsandtheirobservedoutcomes.

Experiment:Anoccurrencewithanuncertainoutcomethatwecanobserve.Forexample,theresultofrollingtwodiceinsequence.Outcome:Theresultofanexperiment;oneparticularstateoftheworld.Forexample:thefirstdiecomesup4andthesecondcomesup2.Thisoutcomecouldbesummarizedas(4,2).SampleSpace:Thesetofallpossibleoutcomesfortheexperiment.Forexample,(1,1),(1,2),(1,3),...,(1,6),(2,1),(2,2),...(6,4),(6,5),(6,6)areallpossibleoutcomesofrollingtwodiceinsequence.Thereare6*6=36differentoutcomes.Event:Asubsetofpossibleoutcomesthattogetherhavesomepropertyweareinterestedin.Forexample,theevent"thetwodicesumto5"isthesetofoutcomes(1,4),(2,3),(3,2),(4,1).Probability:Theproportionofexperimentsforwhichtheeventoccurs.Forexample,theprobabilitythatthetwodicesumto5is4/36or1/9.Distribution:Theprobabilityofallevents.

413ConditionalProbability

Theimportantpartofthisterminologyisthatthesamplespaceisthesetofalloutcomes,andaneventisasubsetofthesamplespace.Foranoutcomethatisanelementofanevent,wesaythatitisanoutcomeforwhichthateventoccurs.

MultinomialDistributions¶

Therearemanykindsofdistributions,butwewillfocusonthecasewherethesamplespaceisafixed,finitesetofmutuallyexclusiveoutcomes,calledamultinomialdistribution.

Forexample,eitherthetwodicecomeup(2,1)ortheycomeup(4,5)inasingleroll(butbothcannotoccur),andthereareexactly36outcomepossibilitiesthateachoccurwithequalchance.

two_dice=Table(['First','Second','Chance'])

forfirstinnp.arange(1,7):

forsecondinnp.arange(1,7):

two_dice.append([first,second,1/36])

two_dice.set_format('Chance',PercentFormatter(1))

First Second Chance

1 1 2.8%

1 2 2.8%

1 3 2.8%

1 4 2.8%

1 5 2.8%

1 6 2.8%

2 1 2.8%

2 2 2.8%

2 3 2.8%

2 4 2.8%

...(26rowsomitted)

Theoutcomesarenotalwaysequallylikely.Intheexampleoftwodicerolledinsequence,thereare36differentoutcomesthateachhavea1/36chanceofoccurring.Iftwodicearerolledtogetherandonlythesumoftheresultisobserved,insteadthereare11outcomesthathavevariousdifferentchances.

two_dice_sums=Table(['Sum','Chance']).with_rows([

[2,1/36],[3,2/36],[4,3/36],[5,4/36],[6,5/36],[

[12,1/36],[11,2/36],[10,3/36],[9,4/36],[8,5/36],

]).sort(0)

two_dice_sums.set_format('Chance',PercentFormatter(1))

Sum Chance

2 2.8%

3 5.6%

4 8.3%

5 11.1%

6 13.9%

7 16.7%

8 13.9%

9 11.1%

10 8.3%

11 5.6%

...(1rowsomitted)

OutcomesandEvents¶

Thesetwodifferentdicedistributionsarecloselyrelatedtoeachother.Let'sseehow.

Inamultinomialdistribution,eachoutcomealoneisaneventwithitsownchance(orprobability).Thechancesofalloutcomessumto1.Theprobabilityofanyeventisthesumoftheprobabilitiesoftheoutcomesinthatevent.Therefore,bywritingdownthechanceofeachoutcome,weimplicitlydescribetheprobabilityofeveryeventforthewholedistribution.

Let'sconsidertheeventthatthesumofthediceintwosequentialrollsis5.Wecanrepresentthateventasthematchingrowsinthetwo_dicetable.

dice_sums=two_dice.column('First')+two_dice.column('Second')

sum_of_5=two_dice.where(dice_sums==5)

sum_of_5

First Second Chance

1 4 2.8%

2 3 2.8%

3 2 2.8%

4 1 2.8%

Theprobabilityoftheeventisthesumoftheresultingchances.

sum(sum_of_5.column('Chance'))

0.1111111111111111

IfweconsistentlystoretheprobabilitiesofindividualoutcomesinacolumncalledChance,thenwecandefinetheprobabilityofanyeventusingthePfunction.

defP(event):

returnsum(event.column('Chance'))

P(sum_of_5)

0.1111111111111111

Onedistribution'seventscanbeanotherdistribution'soutcomes,asisthecasewiththetwo_diceandtwo_dice_sumdistributions.Thereare11differentpossiblesumsinthetwo_dicetable,andtheseare11differentmutuallyexclusiveevents.

with_sums=two_dice.with_column('Sum',dice_sums)

with_sums

First Second Chance Sum

1 1 2.8% 2

1 2 2.8% 3

1 3 2.8% 4

1 4 2.8% 5

1 5 2.8% 6

1 6 2.8% 7

2 1 2.8% 3

2 2 2.8% 4

2 3 2.8% 5

2 4 2.8% 6

...(26rowsomitted)

Groupingtheseoutcomesbytheirsumshowshowmanydifferentoutcomesappearineachevent.

with_sums.group('Sum')

Sum count

...(1rowsomitted)

Finally,thechanceofeachsumeventisthetotalchanceofalloutcomesthatmatchthesumevent.Inthisway,wecangeneratethetwo_dice_sumsdistributionviaaddition.

grouped=with_sums.select(['Sum','Chance']).group('Sum',sum)

grouped.relabeled(1,'Chance').set_format('Chance',PercentFormatter

Sum Chance

2 2.8%

3 5.6%

4 8.3%

5 11.1%

6 13.9%

7 16.7%

8 13.9%

9 11.1%

10 8.3%

11 5.6%

...(1rowsomitted)

Bothdistributionswillassignthesameprobabilitytocertainevents.Forexample,theprobabilityoftheeventthatthesumofthediceisabove8canbecomputedfromeitherdistribution.

P(with_sums.where('Sum',8))

0.1388888888888889

P(two_dice_sums.where('Sum',8))

0.1388888888888889

U.S.BirthTimes¶

Here'sanexamplebasedondatacollectedintheworld.

MorebabiesintheUnitedStatesareborninthemorningthanthenight.In2013,theCenterforDiseaseControlmeasuredthefollowingproportionsofbabiesforeachhouroftheday.TheTimecolumnisadescriptionofeachhour,andtheHourcolumnisitsnumericequivalent.

birth=Table.read_table('birth_time.csv').select(['Time','Hour',

birth.set_format('Chance',PercentFormatter(1)).show()

Time Hour Chance

6a.m. 6 2.9%

7a.m. 7 4.5%

8a.m. 8 6.3%

9a.m. 9 5.0%

10a.m. 10 5.0%

11a.m. 11 5.0%

Noon 12 6.0%

1p.m. 13 5.7%

2p.m. 14 5.1%

3p.m. 15 4.8%

4p.m. 16 4.9%

5p.m. 17 5.0%

6p.m. 18 4.5%

7p.m. 19 4.0%

8p.m. 20 4.0%

9p.m. 21 3.7%

10p.m. 22 3.5%

11p.m. 23 3.3%

Midnight 0 2.9%

1a.m. 1 2.9%

2a.m. 2 2.8%

3a.m. 3 2.7%

4a.m. 4 2.7%

5a.m. 5 2.8%

Ifweassumethatthisdistributiondescribestheworldcorrectly,whatisthechancethatababywillbebornbetween8a.m.and6p.m.?Itturnsoutthatmorethanhalfofallbabiesarebornduring"businesshours",eventhoughthistimeintervalonlycovers5/12oftheday.

business_hours=birth.where('Hour',are.between(8,18))

business_hours

Time Hour Chance

8a.m. 8 6.3%

9a.m. 9 5.0%

10a.m. 10 5.0%

11a.m. 11 5.0%

Noon 12 6.0%

1p.m. 13 5.7%

2p.m. 14 5.1%

3p.m. 15 4.8%

4p.m. 16 4.9%

5p.m. 17 5.0%

P(business_hours)

0.52800000000000002

Ontheotherhand,birthslateatnightareuncommon.Thechanceofhavingababybetweenmidnightand6a.m.ismuchlessthan25%,eventhoughthistimeintervalcovers1/4oftheday.

P(birth.where('Hour',are.between(0,6)))

0.16799999999999998

ConditionalDistributions¶Acommonwaytotransformonedistributionintoanotheristoconditiononanevent.Conditioningmeansassumingthattheeventoccurs.Theconditionaldistributiongivenaneventtakesthechancesofalloutcomesforwhichtheeventoccursandscalesuptheirchancessothattheysumto1.

Toscaleupthechances,wedividethechanceofeachoutcomeintheeventbytheprobabilityoftheevent.Forexample,ifweconditionontheeventthatthesumoftwodiceisabove8,wearriveatthefollowingconditionaldistribution.

above_8=two_dice_sums.where('Sum',are.above(8))

given_8=above_8.with_column('Chance',above_8.column('Chance')/

given_8

Sum Chance

9 40.0%

10 30.0%

11 20.0%

12 10.0%

Aconditionaldistributiondescribesthechancesundertheoriginaldistribution,updatedtoreflectanewpieceofinformation.Inthiscase,weseethechancesundertheoriginaltwo_dice_sumsdistribution,giventhatthesumisabove8.

Theconditioningoperationcanbeexpressedbythegivenfunction,whichtakesaneventtableandreturnsthecorrespondingconditionaldistributon.

defgiven(event):

returnevent.with_column('Chance',event.column('Chance')/P(event

given(two_dice_sums.where('Sum',are.above(8)))

Sum Chance

9 40.0%

10 30.0%

11 20.0%

12 10.0%

GiventhatababyisbornintheUSduringbusinesshours(between8a.m.and6p.m.),whatarethechancesthatitisbornbeforenoon?Toanswerthisquestion,wefirstformtheconditionaldistributiongivenbusinesshours,thencomputetheprobabilityoftheeventthatthebabyisbornbeforenoon.

given(business_hours)

Time Hour Chance

8a.m. 8 11.9%

9a.m. 9 9.5%

10a.m. 10 9.5%

11a.m. 11 9.5%

Noon 12 11.4%

1p.m. 13 10.8%

2p.m. 14 9.7%

3p.m. 15 9.1%

4p.m. 16 9.3%

5p.m. 17 9.5%

P(given(business_hours).where('Hour',are.below(12)))

0.40340909090909094

Thisresultisnotthesameasthechancethatababyisbornbetween8a.m.andnooningeneral.

morning=birth.where('Hour',are.between(8,12))

P(morning)

0.21300000000000002

Norisitthechancethatababyisbornbeforenooningeneral.

P(birth.where('Hour',are.below(12)))

0.45500000000000007

Instead,itistheprobabilitythatboththeeventandtheconditioningeventaretrue(i.e.,thatthebirthisinthemorning),dividedbytheprobabilityoftheconditioningevent(i.e.,thatthebirthisduringbusinesshours).

P(morning)/P(business_hours)

0.40340909090909094

Themorningeventisajointeventthatcanbedescribedasbothduringbusinesshoursandbeforenoon.Ajointeventisjustanevent,butonethatisdescribedbytheintersectionoftwootherevents.

business_hours.where('Hour',are.below(12))

Time Hour Chance

8a.m. 8 6.3%

9a.m. 9 5.0%

10a.m. 10 5.0%

11a.m. 11 5.0%

Ingeneral,theprobabilityofaneventB(e.g.,beforenoon)conditionedonaneventA(e.g.,duringbusinesshours)istheprobabilityofthejointeventofAandBdividedbytheprobabilityoftheconditioningeventA.

P(business_hours.where('Hour',are.below(12)))/P(business_hours)

0.40340909090909094

Thestandardnotationforthisrelationshipusesaverticalbarforconditioningandacommaforajointevent.

JointDistributions¶Ajointeventisoneinwhichtwodifferenteventsbothoccur.Similarly,ajointoutcomeisanoutcomethatiscomposedoftwodifferentoutcomes.Forexample,theoutcomesofthetwo_dicedistributionareeachjointoutcomesofthefirstandseconddicerolls.

two_dice

First Second Chance

1 1 2.8%

1 2 2.8%

1 3 2.8%

1 4 2.8%

1 5 2.8%

1 6 2.8%

2 1 2.8%

2 2 2.8%

2 3 2.8%

2 4 2.8%

...(26rowsomitted)

Anothercommonwaytotransformdistributionsistoformajointdistributionfromconditionaldistributions.Thedefinitionofaconditionalprobabilityprovidesaformulaforcomputingajointprobability,simplybymultiplyingbothsidesoftheequationaboveby .

Thechanceofeachoutcomeofajointdistributioncanbecomputedfromtheprobabilityofthefirstoutcomeandtheconditionalprobabilityofthesecondoutcomegiventhefirstoutcome.Inthetwo_dicetableabove,thechanceofanyfirstoutcomeis1/6,andthechanceofanysecondoutcomegivenanyfirstoutcomeis1/6,sothechanceofanyjointoutcomeis1/36or2.8%.

U.S.BirthDays¶

TheCDCalsopublishedthatin2013,78.25%ofbabiesbornonweekdays(MondaythroughFriday)and21.75%ofbabieswerebornonweekends(SaturdayandSunday).Notably,thisdistributionisnot5/7(71.4%)to2/7(28.6%),butinsteadfavorsweekdays.

Moreover,theCDCpublishedtheconditionaldistributionsof2013birthtimes,givenaweekdaybirth(green)andgivenaweekendbirth(gray).Theblacklineisthedistributionofallbirths.

Alloftheproportionsusedtogeneratethischartappearinthebirth_daytable.

birth_day=Table.read_table('birth_time.csv').drop('Chance')

birth_day.set_format([2,3],PercentFormatter(1))

Time Hour Weekday Weekend

6a.m. 6 2.7% 3.6%

7a.m. 7 4.7% 3.8%

8a.m. 8 6.7% 4.6%

9a.m. 9 5.1% 5.0%

10a.m. 10 5.0% 5.0%

11a.m. 11 5.0% 4.9%

Noon 12 6.3% 5.0%

1p.m. 13 5.9% 4.7%

2p.m. 14 5.3% 4.6%

3p.m. 15 4.9% 4.6%

...(14rowsomitted)

TheWeekdayandWeekendcolumnsarethechancesoftwodifferentconditionaldistributions.

weekday=birth_day.select(['Hour','Weekday']).relabeled(1,'Chance'

weekend=birth_day.select(['Hour','Weekend']).relabeled(1,'Chance'

Thejointdistributioniscreatedbymultiplyingeachconditionalchancecolumnbythecorrespondingprobabilityoftheconditioningevent.Inthiscase,wemultiplyweekdaychancesbythechancethatthebabywasbornonaweekday(78.25%);likewise,wemultiplyweekendchancesbythechancethatthebabywasbornonaweekend(21.75%).

birth_joint=Table(['Day','Hour','Chance'])

forrowinweekday.rows:

birth_joint.append(['Weekday',row.item('Hour'),row.item('Chance'

forrowinweekend.rows:

birth_joint.append(['Weekend',row.item('Hour'),row.item('Chance'

birth_joint.set_format('Chance',PercentFormatter(1))

Day Hour Chance

Weekday 6 2.1%

Weekday 7 3.7%

Weekday 8 5.2%

Weekday 9 4.0%

Weekday 10 3.9%

Weekday 11 3.9%

Weekday 12 4.9%

Weekday 13 4.6%

Weekday 14 4.1%

Weekday 15 3.8%

...(38rowsomitted)

Thisjointdistributionhas48jointoutcomes,andthetotalchancesumsto1.

P(birth_joint)

Now,wecancomputetheprobabilityofanyeventthatinvolvesdaysandhours.Forexample,thechanceofaweekdaymorningbirthisquitehigh.

P(birth_joint.where('Day','Weekday').where('Hour',are.between(8,

0.17058499999999999

Wehaveseenthatlessthan22%ofbabiesarebornonweekends.However,ifweknowthatababywasbornbetween5a.m.and6a.m.,thenthechancesthatitisaweekendbabyarequiteabithigherthan22%.

early_morning=birth_joint.where('Hour',5)

early_morning

Day Hour Chance

Weekday 5 2.0%

Weekend 5 0.8%

P(given(early_morning).where('Day','Weekend'))

0.28584466551063248

Bayes'Rule¶Inthefinalexampleabove,webeganwithadistributionoverweekdayandweekendbirths,

,alongwithconditionaldistributionsoverhours, .Wewereabletocomputetheprobabilityofaweekendbirthconditionedonanhour,anoutcomeinthedistribution

.Bayes'ruleistheformulathatexpressestherelationshipbetweenallofthesequantities.

Thenumeratorontheright-handsideisjustthejointprobabilityofAandB.Bayes'rulewritesthisprobabilityinitsexpandedformbecausesooftenwearegiventhesetwocomponentsandmustformthejointdistributionthroughmultiplication.

Thedenominatorontheright-handsideisaneventthatcanbecomputedfromthejointprobability.Forexample,theearly_morningeventintheexampleaboveisaneventjustabouthours,butitincludesalljointoutcomeswherethecorrecthouroccurs.

Eachprobabilityinthisequationhasaname.ThenamesarederivedfromthemosttypicalapplicationofBayes'rule,whichistoupdateone'sbeliefsbasedonnewevidence.

iscalledtheprior;itistheprobabilityofeventAbeforeanyevidenceisobserved.

iscalledthelikelihood;itistheconditionalprobabilityoftheevidenceeventBgiventheeventA.

iscalledtheevidence;itistheprobabilityoftheevidenceeventBforanyoutcome.

iscalledtheposterior;itistheprobabilytofeventAafterevidenceeventBisobserved.

DependingonthejointdistributionofAandB,observingsomeBcanmakeAmoreorlesslikely.

DiagnosticExample¶

Inapopulation,thereisararedisease.Researchershavedevelopedamedicaltestforthedisease.Mostly,thetestcorrectlyidentifieswhetherornotthetestedpersonhasthedisease.Butsometimes,thetestiswrong.Herearetherelevantproportions.

1%ofthepopulationhasthediseaseIfapersonhasthedisease,thetestreturnsthecorrectresultwithchance99%.Ifapersondoesnothavethedisease,thetestreturnsthecorrectresultwithchance99.5%.

Onepersonispickedatrandomfromthepopulation.Giventhatthepersontestspositive,whatisthechancethatthepersonhasthedisease?

Webeginbypartitioningthepopulationintofourcategoriesinthetreediagrambelow.

ByBayes'Rule,thechancethatthepersonhasthediseasegiventhatheorshehastestedpositiveisthechanceofthetop"TestPositive"branchrelativetothetotalchanceofthetwo"TestPositive"branches.Theansweris

#Thepersonispickedatrandomfromthepopulation.

#ByBayes'Rule:

#Chancethatthepersonhasthedisease,giventhattestwas+

(0.01*0.99)/(0.01*0.99+0.99*0.005)

0.6666666666666666

Thisis2/3,anditseemsrathersmall.Thetesthasveryhighaccuracy,99%orhigher.Yetisouranswersayingthatifapatientgetstestedandthetestresultispositive,thereisonlya2/3chancethatheorshehasthedisease?

Tounderstandouranswer,itisimportanttorecallthechancemodel:ourcalculationisvalidforapersonpickedatrandomfromthepopulation.Amongallthepeopleinthepopulation,thepeoplewhotestpositivesplitintotwogroups:thosewhohavethediseaseandtestpositive,andthosewhodon'thavethediseaseandtestpositive.Thelattergroupiscalledthegroupoffalsepositives.Theproportionoftruepositivesistwiceashighasthatof

thefalsepositives– comparedto –andhencethechanceofatruepositivegivenapositivetestresultis .Thechanceisaffectedbothbytheaccuracyofthetestandalsobytheprobabilitythatthesampledpersonhasthedisease.

Thesameresultcanbecomputedusingatable.Below,webeginwiththejointdistributionovertruehealthandtestresults.

rare=Table(['Health','Test','Chance']).with_rows([

['Diseased','Positive',0.01*0.99],

['Diseased','Negative',0.01*0.01],

['NotDiseased','Positive',0.99*0.005],

['NotDiseased','Negative',0.99*0.995]

Health Test Chance

Diseased Positive 0.0099

Diseased Negative 0.0001

NotDiseased Positive 0.00495

NotDiseased Negative 0.98505

Thechancethatapersonselectedatrandomisdiseased,giventhattheytestedpositive,iscomputedfromthefollowingexpression.

positive=rare.where('Test','Positive')

P(given(positive).where('Health','Diseased'))

0.66666666666666663

Butapatientwhogoestogettestedforadiseaseisnotwellmodeledasarandommemberofthepopulation.Peoplegettestedbecausetheythinktheymighthavethedisease,orbecausetheirdoctorthinksso.Insuchacase,sayingthattheirchanceofhavingthediseaseis1%isnotjustified;theyarenotpickedatrandomfromthepopulation.

So,whileouransweriscorrectforarandommemberofthepopulation,itdoesnotanswerthequestionforapersonwhohaswalkedintoadoctor'sofficetobetested.

Toanswerthequestionforsuchaperson,wemustfirstaskourselveswhatistheprobabilitythatthepersonhasthedisease.Itisnaturaltothinkthatitislargerthan1%,asthepersonhassomereasontobelievethatheorshemighthavethedisease.Buthowmuchlarger?

Thiscannotbedecidedbasedonrelativefrequencies.Theprobabilitythataparticularindividualhasthediseasehastobebasedonasubjectiveopinion,andisthereforecalledasubjectiveprobability.Someresearchersinsistthatallprobabilitiesmustberelativefrequencies,butsubjectiveprobabilitiesabound.Thechancethatacandidatewinsthenextelection,thechancethatabigearthquakewillhittheBayAreainthenextdecade,thechancethataparticularcountrywinsthenextsoccerWorldCup:noneofthesearebasedonrelativefrequenciesorlongrunfrequencies.Eachonecontainsasubjectiveelement.

Itisfinetoworkwithsubjectiveprobabilitiesaslongasyoukeepinmindthattherewillbeasubjectiveelementinyouranswer.Beawarealsothatdifferentpeoplecanhavedifferentsubjectiveprobabilitiesofthesameevent.Forexample,thepatient'ssubjectiveprobabilitythatheorshehasthediseasecouldbequitedifferentfromthedoctor'ssubjectiveprobabilityofthesameevent.Herewewillworkfromthepatient'spointofview.

Supposethepatientassignedanumbertohis/herdegreeofuncertaintyaboutwhetherhe/shehadthedisease,beforeseeingthetestresult.Thisnumberisthepatient'ssubjectivepriorprobabilityofhavingthedisease.

Ifthatprobabilitywere10%,thentheprobabilitiesontheleftsideofthetreediagramwouldchangeaccordingly,withthe0.1and0.9nowinterpretedassubjectiveprobabilities:

Thechangehasanoticeableeffectontheanswer,asyoucanseebyrunningthecellbelow.

#Subjectivepriorprobabilityof10%thatthepersonhasthedisease

#ByBayes'Rule:

(0.1*0.99)/(0.1*0.99+0.9*0.005)

0.9565217391304347

Ifthepatient'spriorprobabilityofhavingthediseaseis10%,thenafterapositivetestresultthepatientmustupdatethatprobabilitytoover95%.Thisupdatedprobabilityiscalledaposteriorprobability.Itiscalculatedafterlearningthetestresult.

Ifthepatient'spriorprobabilityofhavngthediseaseis50%,thentheresultchangesyetagain.

#Subjectivepriorprobabilityof50%thatthepersonhasthedisease

#ByBayes'Rule:

(0.5*0.99)/(0.5*0.99+0.5*0.005)

0.9949748743718593

Startingoutwitha50-50subjectivechanceofhavingthedisease,thepatientmustupdatethatchancetoabout99.5%aftergettingapositivetestresult.

ComputationalNote.Inthecalculationabove,thefactorof0.5iscommontoallthetermsandcancelsout.Hencearithmeticallyitisthesameasacalculationwherethepriorprobabilitiesareapparentlymissing:

0.99/(0.99+0.005)

0.9949748743718593

Butinfact,theyarenotmissing.Theyhavejustcanceledout.Whenthepriorprobabilitiesarenotallequal,thentheyareallvisibleinthecalculationaswehaveseenearlier.

Inmachinelearningapplicationssuchasspamdetection,Bayes'Ruleisusedtoupdateprobabilitiesofmessagesbeingspam,basedonnewmessagesbeinglabeledSpamorNotSpam.Youwillneedmoreadvancedmathematicstocarryoutallthecalculations.Butthefundamentalmethodisthesameaswhatyouhaveseeninthissection.

computational and inferential thinking

Data & Analytics

computational thinking framework

on computational thinking, inferential thinking and “big...

computational thinking

computational thinking solutions

on computational thinking, inferential thinking and “data...

education and human resources (ehr ...computational and...

foster computational thinking

computational thinking and inferential thinking session...

welcome to cs88 · 2017-03-29 · • computational...

calculation thinking computational thinking

computational thinking concepts · computational thinking...

computational thinking explained

ani adhikari & michael jordan - computational thinking and...

computational thinking

the 5 cs + 1: computational thinking -...

computing computational thinking using computational...

unplugged computational thinking for fun€¦ · keywords:...

computational thinking - faculty &...

computational logic and computational thinking programming

on computational thinking, inferential thinking and data...