uteach cs principles - teach global impact...data analysis data mining ... the texas chainsaw...

6

Upload: others

Post on 09-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,
Page 2: UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,

UTeachCSPrinciples Unit5:BigData

UNITTOPIC:DataAnalysisDataMining

Youwillinvestigatetheuseofdatamininginthediscoveryofpatternsinlargedatasets.

Youwillapplyassociationruleminingtodiscoverknowledgeindatasets.

UTeachComputerScience—http://uteachcs.org ©2016TheUniversityofTexasatAustin

447

Page 3: UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,

UTeachCSPrinciples Unit5:BigDataDataMiningDataMining

Traditionaloreminingbeginswithanexploration(prospecting)ofaresourcepool(stone),andproceedstodeterminingifusableresourcesexist(ore)andtowhatdegree.Prospectorsbasicallyhaveanideaofwhattheyarelookingfor,andtheyrunsmallteststoseeiftheyarecorrect.Sometimestheystrikegold,othertimestheystrikeout.Likethesephysicalminesthatbringuseverythingfromcoaltodiamonds,wehaveanewtypeofmining:datamining.

Dataminingisakintothediscoveryofpatternsinlargedatasets.Likeoremining,dataminingbeginswithanexploration(analysis)ofaresourcepool(data),andproceedstodeterminewhetherusableresourcesexist(correlations)andtowhatdegree(howstrongtheyare).Notalldataminers"strikesitrich."Likeoremining,dataminingcanresultintheobservationofnousefulpatterns.However,likeoremining,sometimesdataminingleadstoabonanzaofusefulinformation.

Indatamining,theemphasisisonthediscoveryofnewknowledge.Dataminerswanttofindnewpatternsthatwerepreviouslyunobserved.Theyusestatisticalanalysisofbigdatatodiscoverwhatthehumaneyecan'tsee,justlikeanoreminermightuseapick,dynamite,orlabtesttouncoverorethatwasnotvisibletothenakedeyebefore.Thisisaformofexploratorydataanalysisratherthanstatisticalhypothesistesting.

DataMiningStrategiesDatamininginvolvessixcommonclassesoftasks,listedbelow,alongwithexamplesofhowthesestrategiescanbeusedinrecommendersystems,suchasthoseusedbyNetflix,Pandora,Amazon,http://www.whatshouldireadnext.com/,andmanyothercontentproviders.Ineachofthedescriptionsbelow,aNetflix-relatedexampleofitsusageisgiven:

Anomalydetection(Outlier/change/deviationdetection)—Theidentificationofunusualdatarecords,thatmightbeinterestingorsimplydataerrorsandrequirefurtherinvestigation.

MovieXisunlikeanyoftheothermoviesinUserY'sdataset.Removeitfromourcalculations.(example:TheTexasChainsawMassacreisonalistthatmostlycontainstitlessuchasTeletubbies,BarneyandFriends,andClifford.

Associationrulelearning(Dependencymodeling)—Searchesforrelationshipsbetweenvariables.Forexample,asupermarketmightgatherdataoncustomerpurchasinghabits.Usingassociationrulelearning,thesupermarketcandeterminewhichproductsarefrequentlyboughttogetherandusethisinformationformarketingpurposes.Thisissometimesreferredtoasmarketbasketanalysis.

Recommendersystems—UserswholikeMovieXtendtoalsolikeMovieY.

448

Page 4: UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,

Clustering—isthetaskofdiscoveringgroupsandstructuresinthedatathatareinsomewayoranother"similar,"withoutusingknownstructuresinthedata.

Dynamicallygroupedmoviecategories:"RomanticComediesinParisstarringformerprofessionalfootballplayers."

Classification—isthetaskofgeneralizingknownstructuretoapplytonewdata.Forexample,ane-mailprogrammightattempttoclassifyane-mailas"legitimate"oras"spam."

MovieXisaromanticcomedy.Regression—Attemptstofindafunctionthatmodelsthedatawiththeleasterror.

TypeXuserstypicallyincreasetheirmovieconsumptionratebyfourmoviesperyear.

Summarization—providingamorecompactrepresentationofthedataset,includingvisualizationandreportgeneration.

WhattypeofmoviedoesUserXtypicallylike?(i.e.,sumupuserX'spreferencesinYwords)

Thesestrategiesallhavedifferentpurposes,aresometimesmoreeffectiveoncertaindatasetsandlessonothers,andoftentimesworkbestinconjunctionwithoneother.Therefore,thereisnoone"best"waytoperformdatamining.Dataminersusemultiplestrategiestouncoverpatternsanddiscovernewknowledge.

Commonmisconception:DataminingisoftenconfusedwithArtificialIntelligence(AI).

DataminingisactuallyanapplicationoftechniquescommonlyassociatedwithAI."Machinelearning"and"decisionsupport"arestandardAItechniques,butwhenweapplythemto"knowledgediscoveryindatabases,"werefertothemcollectivelysimplyas"toolsfordatamining."

Howmuchpowerliesindatamining?Readthefollowingarticletosee"HowTargetFiguredOutATeenGirlWasPregnantBeforeHerFatherDid.".

UTeachComputerScience—http://uteachcs.org ©2016TheUniversityofTexasatAustin

449

Page 5: UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,

UTeachCSPrinciples Unit5:BigDataAssociationRuleMiningCompaniesKnowWhatYouBuy

FrenchtoastisoneofAmerica'sfavoritebreakfastfoods.It'sdeliciousandcanbeeasilypreparedathomeusingavarietyoftechniquesandtoppings.Eventhoughitcanbepreparedanumberofways,almostallFrenchtoastrecipescallforatleastthreethings:

1. bread2. milk3. eggs

Ifyou'regoingtomakeFrenchtoast,you'regoingtoneedbread,you'regoingtoneedmilk,andyou'regoingtoneedeggs.WhatdoesFrenchtoasthavetodowithbigdata?

AssociationRuleMiningAnassociationruleisalinkbetweenonesetofitemsandanother.Specifically,associationrulesidentifyinstancesinwhichtheappearanceofonesetitems(theantecedent)implythatanothersetofitems(theconsequent)willalsoappear.

Forexample:

{X,Y}⇒{Z}

Thisrulecanbereadas,“Iftheantecedents(XandY)appearthenitislikelythattheconsequent(Z)willalsoappear.”

Byusingassociationrules,wecangroupitemstogetherlogicallyandattempttomakepredictions.Bytrackingeachofthesetransactions,tabulatingthem,andthendiscoveringwhichpairs(orlargergroups)ofcolumnscorrelateoftenwithoneanother,associationrulesmaybegeneratedtocapturethesecorrelationsinthedata.ThisappliestoFrenchtoastpreparation.

Forexample:

Ifmostpeoplewhobuymilk,bread,andeggsalsobuymaplesyrup,thenassociationruleminingmightturnupthefollowingrule:

{milk,bread,eggs}⇒{syrup}

Walmartcannowtargetstorepatronswhopurchasemilk,bread,andeggstogentlysuggestthattheymightliketoalsobuysyrup.Thecomputerizedstorefront(orphysicalstorefront

450

Page 6: UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,

withalayoutdeterminedbycomputationaldatamining)doesnotknowthatthesepatronsmaybemakingFrenchtoast,theymerelyhavedevelopedassociationrulestoguideproductplacement.Theprocessofassociationruleminingisbasically"HowTargetFiguredOutaTeenGirlwasPregnant..."

InstructionsYourgrouphasbeenhiredbyDataMarket,acorporationseekingtoopenanewchainofstoresinyourregion.Theirgoalistoprovidecustomerswithoptimalarrangementsofstoreproducts,inanattempttominimizethetimeandeffortrequiredtoshop.

Youwilldesignamockstoreproductplacementscheme—drivenbydatacollectionfromcompetitors’storesinthearea.Usethereceiptsprovidedbyyourteacher(1)togenerateassociationrulesthatmappotentiallycorrelatedproducts,andthen(2)sketchanendcapfordata-drivenproductplacementtargetingpotentialshoppersinthearea.

Asyouextractdatafromthereceipts,considerthefollowingguidingquestions:

1. Whatisthebestwaytousetheprovidedtabletoorganizeyourdatacollection?

2. Whattrendsdoyoufindinthedata?3. Arethereanynegativeassociationsbetweenproducts?4. Whatistheidealsizeforsetsof

antecedents/consequents?5. Whatadditionalinformationmightbehelpful?6. Canyouimaginescenariosinwhichsetsofproductsare

groupedtogether?

UTeachComputerScience—http://uteachcs.org ©2016TheUniversityofTexasatAustin

451