coursera chapter 1

67
Chapter 1 Statistics: the science of data -- collecting, classifying, summarizing, organizing, analyzing, presenting, and interpreting. coursera

Upload: others

Post on 03-Feb-2022

29 views

Category:

Documents


0 download

TRANSCRIPT

Chapter1

Statistics:thescienceofdata--collecting,classifying,summarizing,organizing,analyzing,presenting,and

interpreting.

coursera

Section1.1

TheStructureofData

CasesandVariablesWeobtaininformationaboutcases orunits.

Avariable isanycharacteristicthatisrecordedforeachcase.

Generallyeachcasemakesuparowinadataset,andeachvariablemakesupacolumn

CountriesoftheWorldCountry LandArea Population Density GDP Rural CO2 PumpP

rice Military

Afghanistan 652.86 30.552 46.8 665 74.1 35.3 1.28 8.65

Albania 27.40 2.897 105.7 4460 44.6 12.8 1.81

Algeria 2381.74 39.208 16.5 5361 30.5 24.6 0.29

AmericanSamoa 0.20 0.055 275.0 12.7

Andorra 0.47 0.079 168.1 13.8 9.5 1.67

Angola 1246.70 21.472 17.2 5783 57.5 44.8 0.63 13.81

AntiguaandBarbuda 0.4 0.090 204.5 13342 75.4 16.6

Argentina 2736.690 41.446 15.1 14715 8.5 16.9 1.46

IntroStatisticsSurveyData

Applyingthedefinition

Considertheabovedatasets–Whatarethecases?–Whatarethevariables?–Whatinterestingquestionscouldithelpyouanswer?

Quantitativevs.QualitativeData

• QuantitativeData:measurementsrecordedonanaturallyoccurringnumericalscale

• Categorical (orQualitative)Data:measurementsthatdividecasesintogroups(cannotbemeasuredonanumericalscale)

Istherenumericaldatathatisnotquantitative?

Datacanbeusedtoanswerinterestingquestions!

UsingDatatoAnsweraQuestion

1. Caneatingayogurtadaycauseyoutoloseweight?

2. Domalesfindfemalesmoreattractiveiftheywearred?

3. Doesloudermusiccausepeopletodrinkmorebeer?

4. Arelionsmorelikelytoattackafterafullmoon?

(theanswertoallofthesequestionsisyes!)

Variables

Foreachofthefollowingsituations:–Whatarethevariables?– Iseachvariablecategoricalorquantitative?

1. Caneatingayogurtadaycauseyoutoloseweight?

2. Domalesfindfemalesmoreattractiveiftheywearred?

3. Doesloudermusiccausepeopletodrinkmorebeer?

4. Arelionsmorelikelytoattackafterafullmoon?

ExplanatoryandResponseIfweareusingonevariabletohelpusunderstandorpredictvaluesofanothervariable,wecalltheformertheexplanatoryvariable andthelatter

theresponsevariable

Examples:• Doesmeditationhelpreducestress?• Doessugarconsumptionincreasehyperactivity?

Variables

Foreachofthefollowingsituations:–Whichistheexplanatoryandwhichistheresponsevariable?

1. Caneatingayogurtadaycauseyoutoloseweight?

2. Domalesfindfemalesmoreattractiveiftheywearred?

3. Doesloudermusiccausepeopletodrinkmorebeer?

4. Arelionsmorelikelytoattackafterafullmoon?

Summary• Dataareeverywhere,andpertaintoawidevarietyoftopics

• Adatasetisusuallycomprisedofvariablesmeasuredoncases

• Variablesareeithercategoricalorquantitative• Datacanbeusedtoprovideinformationaboutessentiallyanythingweareinterestedinandwanttocollectdataon!

Section1.2

SamplingfromaPopulation

SampleversusPopulation

Apopulation includesallindividualsorobjectsofinterest.

Asample isallthecasesthatwehavecollecteddataon(asubsetofthepopulation).

Statistical inference istheprocessofusingdatafromasampletogaininformationaboutthe

population.

TheBigPicture

Population

Sample

Sampling

StatisticalInference

Definitions• Population:allindividualsorobjectsofinterest.• Variable:acharacteristicofanindividualunit.• Sample:allthecasesthatwehavecollecteddataon(asubsetofthe

population).

Example:IwanttoestimatewhatproportionofUPstudentsareleft-handed.

• HowcouldIdothat?• Determinetheabove.

SamplingBias

• Samplingbiasoccurswhenthemethodofselectingasamplecausesthesampletodifferfromthepopulationinsomerelevantway.

• Ifsamplingbiasexists,wecannottrustgeneralizationsfromthesampletothepopulation

• PeopleareTERRIBLEatselectingagoodsample,evenwhenexplicitlytryingtoavoidsamplingbias!

DeweyDefeatsTruman?

Famousexampleofsamplingbiasleadingtoanincorrectprediction

Sampling

Population Sample

Sample

GOAL:Selectasamplethatissimilartothepopulation,onlysmaller

RandomSampling

• Howcanwemakesuretoavoidsamplingbias?

• Imagineputtingthenamesofalltheunitsofthepopulationintoahat,anddrawingoutnamesatrandomtobeinthesample

• Moreoften,weusetechnology

TakeaRANDOM sample!

SimpleRandomSample

Inasimplerandomsample,eachunitofthepopulationhasthesamechanceofbeingselected,regardlessoftheother

unitschosenforthesample

*morecomplicatedrandomsamplingschemesexist

RealitiesofSampling• Whilearandomsampleisideal,oftenitisn’tfeasible.Alistoftheentirepopulationmaynotbeavailable,oritmaybeimpossibleortoodifficulttocontactallmembersofthepopulation.

• Sometimes,yourpopulationofinteresthastobealteredtosomethingmorefeasibletosamplefrom.Generalizationofresultsarelimitedtothepopulationthatwasactuallysampledfrom.

• Inpractice,thinkhardaboutpotentialsourcesofsamplingbias,andtryyourbesttoavoidthem

Non-RandomSamplesSupposeyouwanttoestimatetheaveragenumberofhoursthatstudentsspendstudyingeachweek.Whichofthefollowingisthebestmethodofsampling?a) Gotothelibraryandaskallthestudentsthere

howmuchtheystudyb) Emailallstudentsaskinghowmuchtheystudy,

anduseallthedatayougetc) Standonthequadandaskeveryonewalkingby

howmuchtheystudy

a) Gotothelibraryandaskallthestudentstherehowmuchtheystudy

Samplingunitsbasedonsomethingobviouslyrelatedtothevariable(s)youarestudying

BadMethodsofSampling

b)Emailallstudentsaskinghowmuchtheystudy,anduseallthedatayouget

•Lettingyoursamplebecomprisedofwhoeverchoosestoparticipate(volunteerbias)• Peoplewhochosetoparticipateorrespondareprobablynotrepresentativeoftheentirepopulation

BadMethodsofSampling

Alcohol,Marijuana,andDriving• TheFederalOfficeofRoadSafetyinAustraliaconductedastudyontheeffectsofalcoholandmarijuanaonperformance• Volunteerswhorespondedtoadvertisementsforthestudyonrockradiostationsweregivenarandomcombinationofthetwodrugs,thentheirperformancewasobserved–Whatisthesample?Whatisthepopulation?– Istheresamplingbias?–Willtheresultsbeinformativeand/ordoyouthinkthestudyisworthconducting?

Source:Chesher,G.,Dauncey,H.,Crawford,J.andHorn,K,“TheInteractionbetweenAlcoholandMarijuana:ADoseDependentStudyontheEffectsofHumanMoodsandPerformanceSkills,”ReportNo.C40,FederalOfficeofRoadSafety,FederalDepartmentofTransport,Australia,1986.

DATA

DataCollectionandBias

PopulationSample

SamplingBias?

Otherformsofbias?

OtherFormsofBias• Evenwitharandomsample,datacanstillbebiased,especiallywhencollectedonhumans• Otherformsofbiastowatchoutforindatacollection:– Questionwording– Context– Inaccurateresponses

Manyotherpossibilities– examinethespecificsofeachstudy!

QuestionWording• Arandomsamplewasasked:“Shouldtherebeataxcut,orshouldmoneybeusedtofundnewgovernmentprograms?”

• Adifferentrandomsamplewasasked:“Shouldtherebeataxcut,orshouldmoneybespentonprogramsforeducation,theenvironment,healthcare,crime-fighting,andmilitarydefense?”

TaxCut:60% Programs:40%

TaxCut:22% Programs:78%

InaccurateResponses• InastudyonUSstudents,93%ofthesamplesaidtheywereinthetophalfofthesampleregardingdrivingskillSvenson,O.(February1981)."Arewealllessriskyandmoreskillfulthanourfellowdrivers?" Acta Psychologica 47 (2):143–148.

• FromrandomsampleofallUScollegestudents,22.7%reportedusingillicitdrugs.Doyouthinkthisnumberisaccurate?SubstanceAbuseandMentalHealthServicesAdministration(2010).“Resultsfromthe2009NationalSurveyonDrugUseandHealth:Volume1.”SummaryofNationalFindings(OfficeofAppliedStudies,NSDUHSeriesH-38A,HHSPublicationNo.SMA10-4856Findings).Rockville,MD,heeps://nsduhweb.rti.org/

Summary

Alwaysthinkcriticallyabouthowthedatawerecollected,andrecognizethatnotall

formsofdatacollectionleadtovalidinferences

� Thisistheeasiestwaytoinstantlybecomeamorestatisticallyliterateindividual!

Section1.3

ExperimentsandObservationalStudies

AssociationandCausationTwovariablesareassociated ifvaluesofonevariabletendtoberelatedtovalues

oftheothervariable

Twovariablesarecausallyassociatedifchangingthevalueoftheexplanatoryvariableinfluencesthevalueofthe

responsevariable

Explanatory,Response,Causation

Foreachofthefollowingheadlines:– Identifytheexplanatoryandresponsevariables(ifappropriate).– Doestheheadlineimplyacausal association?

1. “DailyExerciseImprovesMentalPerformance”

2. “Wanttoloseweight?Eatmorefiber!”

3. “Catownerstendtobemoreeducatedthandogowners”

0 200 400 600 800 1000

4050

6070

80

TV and Life Expectancy

TVs per 1000 People

Life

Exp

ecta

ncy

Angola

Australia

Cambodia

Canada

ChinaEgypt

France

Haiti

Iraq

Japan

Madagascar

Mexico

Morocco

Pakistan

Russia

South Africa

Sri Lanka

Uganda

United KingdomUnited States

Vietnam

Yemen

r = 0.74

TVsandLifeExpectancy

ShouldyoubuymoreTVstolivelonger?

Associationdoesnotimplycausation!

ConfoundingVariableAthirdvariablethatisassociatedwithboththeexplanatoryvariableandtheresponsevariable

iscalledaconfoundingvariable

• Aconfoundingvariablecanofferaplausibleexplanationforanassociationbetweentheexplanatoryandresponsevariables

• Wheneverconfoundingvariablesarepresent(ormaybepresent),acausalassociationcannotbedetermined

ConfoundingVariable

ExplanatoryVariable

ResponseVariable

ConfoundingVariable

TVsandLifeExpectancy

NumberofTVspercapita

LifeExpectancy

Wealth

ConfoundingVariable

Foreachofthefollowingrelationships,identifyapossibleconfoundingvariable:1. Moreicecreamsaleshavebeenlinkedtomoredeathsbydrowning.

2. Thetotalamountofbeefconsumedandthetotalamountofporkconsumedworldwidearecloselyrelatedoverthepast100years.

3. Peoplewhoownayachtaremorelikelytobuyasportscar.

4. Airpollutionishigherinplaceswithahigherproportionofpavedgroundrelativetograssyground.

5. Peoplewithshorterhairtendtobetaller.

Experimentvs ObservationalStudyAnobservationalstudy isastudyinwhichtheresearcherdoesnotactivelycontrolthevalueofanyvariable,butsimply

observesthevaluesastheynaturallyexist

Anexperiment isastudyinwhichtheresearcheractivelycontrolsoneormoreof

theexplanatoryvariables

ObservationalStudies• Therearealmostalwaysconfoundingvariablesinobservationalstudies

• ObservationalstudiescanalmostneverbeusedtoestablishcausationObservationalstudiescanalmostnever

beusedtoestablishcausationObservationalstudiescanalmostneverbeusedto

establishcausation

KindergartenandCrime• DoesKindergartenLeadtoCrime?• Yes,accordingtoresearchconductedbyNewHampshirestate

legislatureBobKingsbury• “Kingsbury(R-Laconia),86,recentlyclaimedthatanalyseshe’sbeen

carryingoutsince1996showthatcommunitiesinhisstatethathavekindergartenprogramshaveupto400%morecrimethanlocalitieswhoseclassroomsarefreeoffinger-painting5-year-olds.PointingtohishometownofLaconia,thelargestof10communitiesinBelknapCounty,thelegislatornotedthatithastheonlykindergartenprograminthecountyandthemostcrime,includingmostorallofthecounty’srapes,robberies,assaultsandmurders.”

Szalavitz,M.“DoesKindergartenLeadtoCrime?Fact-CheckingN.H.Legislator’s`Research’,”healthland.time.com,7/6/12.

TexasGOPPlatform• Afewdayslater,theTexasGOP2012Platformannouncedthatitopposedearlychildhoodeducation

• Causationorjustassociation?

Source:Strauss,V.“TexasGOPrejects‘criticalthinking’skills.Really.”www.washingtonpost.com,7/9/12.

http://www.businessweek.com/magazine/correlation-or-causation-12012011-gfx.htmlDatafromFacebook andBloomberg

http://www.businessweek.com/magazine/correlation-or-causation-12012011-gfx.htmlDatafromUSSocialSecurityAdministrationandNationalHousingFinanceAgency

It’saCommonMistake!

“Theinvalidassumptionthatcorrelationimpliescauseisprobablyamongthetwoorthreemostseriousandcommonerrorsofhumanreasoning.”

- StephenJayGould

http://xkcd.com/552/

Randomization

• Howcanwemakesuretoavoidconfoundingvariables?

RANDOMLYassignvaluesoftheexplanatory

variable

RandomizedExperiment

Inarandomizedexperiment theexplanatoryvariableforeachunitisdeterminedrandomly,beforetheresponsevariableismeasured

RandomizedExperiment• Thedifferentlevelsoftheexplanatoryvariableareknownastreatments

• Randomlydividetheunitsintogroups,andrandomlyassignadifferenttreatmenttoeachgroup

• Ifthetreatmentsarerandomlyassigned,thetreatmentgroupsshouldalllooksimilar

RandomizedExperiments• Becausetheexplanatoryvariableisrandomlyassigned,itisnotassociatedwithanyothervariables.Confoundingvariablesareeliminated!!!

ExplanatoryVariable

ResponseVariable

ConfoundingVariable

RANDOMIZEDEXPERIMENT

RandomizedExperiments

• Ifarandomizedexperimentyieldsasignificantassociationbetweenthetwovariables,wecanestablishcausationfromtheexplanatorytotheresponsevariable

Randomizedexperimentsareverypowerful!Theyallowyoutoinfercausality.

ExerciseandtheBrain• Astudyfoundthatelderlypeoplewhowalkedatleastamileadayhadsignificantlyhigherbrainvolume(graymatterrelatedtoreasoning)andsignificantlylowerratesofAlzheimer’sanddementiacomparedtothosewhowalkedless

• Thearticlestates:“Walkingaboutamileadaycanincreasethesizeofyourgraymatter,andgreatlydecreasethechancesofdevelopingAlzheimer'sdiseaseordementiainolderadults,anewstudysuggests.”

• Isthisconclusionvalid?Allen,N.“OnewaytowardoffAlzheimer’s:TakeaHike,”msnbc.com,10/13/10.

No. Observational study – cannot yield causal conclusions.

ExerciseandtheBrain

• Howwouldyoudesignanexperimenttodeterminewhetherexerciseactuallycauses changesinthebrain?

ExerciseandtheBrain• Asampleofmiceweredividedrandomly intotwogroups.Onegroupwasgivenaccesstoanexercisewheel,theothergroupwaskeptsedentary

• “Thebrainsofmiceandratsthatwereallowedtorunonwheelspulsedwithvigorous,newlybornneurons,andthoseanimalsthenbreezedthroughmazesandothertestsofrodentIQ”comparedtothesedentarymice

• IsthisevidencethatexercisecausesanincreaseinbrainactivityandIQ,atleastinmice?

Reynolds,“PhysEd:YourBrainonExercise",NYTimes,July7,2010.Yes. Randomized experiment– can yield causal conclusions.

Let’sTryIt!

• Isjust5secondsofexerciseenoughtoincreaseyourpulserate?

• Treatmentgroups:exerciseversussedentary• Randomlydividetheclassintothetwogroups• Givethetreatment• Measuretheresponse(pulserate)• We’lllearnhowtoanalyzethislater…

KneeSurgeryforArthritisResearchersconductedastudyontheeffectivenessofakneesurgerytocurearthritis.Itwasrandomlydeterminedwhetherpeoplegotthekneesurgery.Everyonewhounderwentthesurgeryreportedfeelinglesspain.Isthisevidencethatthesurgerycausesadecreaseinpain?

No. Need a control or comparison group. What would happen without surgery?

ControlGroup•Whendeterminingwhetheratreatmentiseffective,itisimportanttohaveacomparisongroup,knownasthecontrolgroup• Itisn’tenoughtoknowthateveryoneinonegroupimproved,weneedtoknowwhethertheyimprovedmorethantheywouldhaveimprovedwithoutthesurgery• Allrandomizedexperimentsneedeitheracontrolgroup,ortwodifferenttreatmentstocompare

KneeSurgeryforArthritis• Inthekneesurgerystudy,thoseinthecontrolgroupreceivedafakekneesurgery.Theywereputunderandcutopen,butthedoctordidnotactuallyperformthesurgery.Allofthesepatientsalsoreportedlesspain!• Infact,theimprovementwasindistinguishablebetweenthosereceivingtherealsurgeryandthosereceivingthefakesurgery!

Source:“ThePlaceboPrescription,”NYTimesMagazine,1/9/00.

PlaceboEffect• Often,peoplewillexperiencetheeffecttheythinktheyshouldbeexperiencing,eveniftheyaren’tactuallyreceivingthetreatment

• Example:Eurotrip

• Thisisknownastheplaceboeffect• Onestudyestimatedthat75%oftheeffectivenessofanti-depressantmedicationisduetotheplaceboeffect

• Formoreinformationontheplaceboeffect(it’sprettyamazing!)readThePlaceboPrescription

PlaceboandBlinding• Controlgroupsshouldbegivenaplacebo,afaketreatmentthatresemblestheactivetreatmentasmuchaspossible•Usingaplaceboisonlyhelpfulifparticipantsdonotknowwhethertheyaregettingtheplaceboortheactivetreatment• Ifpossible,randomizedexperimentsshouldbedouble-blinded:neithertheparticipantsortheresearchersinvolvedshouldknowwhichtreatmentthepatientsareactuallygetting

TypesofRandomizedExperiments

• Randomizingcasesintodifferenttreatmentgroupsiscalledarandomizedcomparativeexperiment

• Wecanalsogiveeachtreatmenttoeachcase,andjustrandomizetheorder inwhichtreatmentsarereceived:matchedpairsexperiment

• Eitherarevalidrandomizedexperiments!

MatchedPairs

Example:Toseeifpeoplereadfasteronpaperorakindle,astudywasdoneinwhich16peoplereadtwosetsofinstructionsofsimilarlength,oneonakindleandoneonpaper.Theorderinwhichtheyreadtheinstructionswasrandomized.(Readingwasfasteronpaper.)

Whynotalwaysrandomize?

• Randomizedexperimentsareideal,butsometimesnotethicalorpossible

• Often,youhavetodothebestyoucanwithdatafromobservationalstudies

• Example:researchfortheSupremeCourtcaseastowhetherpreferencesforminoritiesinuniversityadmissionshelpsorhurtstheminoritystudents

Wasthesamplerandomlyselected?

Possibletogeneralizetothepopulation

Yes

Shouldnotgeneralizeto

thepopulation

No

Wastheexplanatoryvariablerandomly

assigned?

Possibletomake

conclusionsaboutcausality

Yes

Cannotmakeconclusions

aboutcausality

No

RandomizationinDataCollection

DATA

TwoFundamentalQuestions inDataCollection

PopulationSample

Randomsample???

Randomizedexperiment???

Randomization• Doingarandomizedexperimentonarandomsampleisideal,butrarelyachievable

• Ifthefocusofthestudyisusingasampletoestimateastatisticfortheentirepopulation,youneedarandomsample,butdonotneedarandomizedexperiment(example:electionpolling)

• Ifthefocusofthestudyisestablishingcausalityfromonevariabletoanother,youneedarandomizedexperimentandcansettleforanon-randomsample(example:drugtesting)

Summary(1.3)• Associationdoesnotimplycausation!• Inobservationalstudies,confoundingvariablesalmostalwaysexist,socausationcannotbeestablished

• Randomizedexperimentsinvolverandomlydeterminingtheleveloftheexplanatoryvariable

• Randomizedexperimentspreventconfoundingvariables,socausalitycanbeinferred

• Acontrolorcomparisongroupisnecessary• Theplaceboeffectexists,soaplaceboandblindingshouldbeused