cs435 introduction to big data - colorado state universitycs435/slides/week2-a.pdf · cs435...

6
CS435 Introduction to Big Data Spring 2017 Colorado State University 1/22/2018 Week 2-A Sangmi Lee Pallickara 1 1/22/2018 CS435 Introduction to Big Data - Spring 2018 W2.A.0 1/17/2018 CS435 Introduction to Big Data - Spring 2018 W1.A.0 CS435 Introduction to Big Data PART 0. INTRODUCTION TO BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs435 1/22/2018 CS435 Introduction to Big Data - Spring 2018 W2.A.1 1/22/2018 CS435 Introduction to Big Data - Spring 2018 .A.1 FAQs PA0 has been posted Feb. 6, 5:00PM via Canvas Individual submission (No team submission) Accommodation request, honor student Contact me by Jan 26 2018 Readings Reading research papers Keshav's "How to read a paper "How to Read and Understand a Scientific Paper: A Step-by-Step Guide for Non-Scientists" 1/22/2018 CS435 Introduction to Big Data - Spring 2018 W2.A.2 1/22/2018 CS435 Introduction to Big Data - Spring 2018 .A.2 Topics Introduction to Big Data Analytics Data Collection, Sampling, and Preprocessing Introduction to MapReduce 1/22/2018 CS435 Introduction to Big Data - Spring 2018 W2.A.3 1/17/2018 CS435 Introduction to Big Data - Spring 2018 W1.A.3 Part 0. Introduction Big Data Analytics -Data Collection, Sampling, and Preprocessing 1/22/2018 CS435 Introduction to Big Data - Spring 2018 W2.A.4 1/22/2018 CS435 Introduction to Big Data - Spring 2018 .A.4 This Material is Built Based on, Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications, Bart Baesens, 2014, Wiley 1/22/2018 CS435 Introduction to Big Data - Spring 2018 W2.A.5 1/22/2018 CS435 Introduction to Big Data - Spring 2018 .A.5 Analytics Process Model The most time-consuming step is the data selection and preprocessing step - This is usually around 80% of the total time needed to build an analytical model Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications, Bart Baesens, 2014, Wiley

Upload: volien

Post on 05-May-2018

227 views

Category:

Documents


2 download

TRANSCRIPT

CS435 Introduction to Big DataSpring 2017 Colorado State University

1/22/2018 Week 2-ASangmi Lee Pallickara

1

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.01/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.0

CS435IntroductiontoBigData

PART0.INTRODUCTIONTOBIGDATA

SangmiLeePallickaraComputerScience,ColoradoStateUniversityhttp://www.cs.colostate.edu/~cs435

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.11/22/2018 CS435IntroductiontoBigData- Spring2018 .A.1

FAQs

• PA0hasbeenposted• Feb.6,5:00PMviaCanvas• Individualsubmission(Noteamsubmission)

• Accommodationrequest,honorstudent• ContactmebyJan262018

• Readings• Readingresearchpapers• Keshav's "Howtoreadapaper• "HowtoReadandUnderstandaScientificPaper:AStep-by-StepGuideforNon-Scientists"

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.21/22/2018 CS435IntroductiontoBigData- Spring2018 .A.2

Topics

• IntroductiontoBigDataAnalytics• DataCollection,Sampling,andPreprocessing• IntroductiontoMapReduce

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.31/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.3

Part0.Introduction

BigDataAnalytics-DataCollection,Sampling,andPreprocessing

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.41/22/2018 CS435IntroductiontoBigData- Spring2018 .A.4

ThisMaterialisBuiltBasedon,

• AnalyticsinaBigDataWorld:TheEssentialGuidetoDataScienceandItsApplications,BartBaesens,2014,Wiley

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.51/22/2018 CS435IntroductiontoBigData- Spring2018 .A.5

AnalyticsProcessModel

Themosttime-consumingstepisthedataselectionandpreprocessingstep- Thisisusuallyaround80%ofthetotaltimeneededtobuildananalyticalmodel

AnalyticsinaBigDataWorld:TheEssentialGuidetoDataScienceandItsApplications,BartBaesens,2014,Wiley

CS435 Introduction to Big DataSpring 2017 Colorado State University

1/22/2018 Week 2-ASangmi Lee Pallickara

2

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.61/22/2018 CS435IntroductiontoBigData- Spring2018 .A.6

TypesofAnalytics

• Analyticsisatermthatisoftenusedinterchangeablywith• Datascience• Datamining• Knowledgediscovery

• Predictiveanalytics• Atargetvariableistypicallyavailable• E.g.linear/logisticregression,decisiontrees,neuralnetworks,supportvectormachines

• Descriptiveanalytics• Notargetvariable• e.g.Clustering,associationrules

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.71/22/2018 CS435IntroductiontoBigData- Spring2018 .A.7

TypesofDataSources• Transactions

• Structured,low-level,detailedinformation• Customertransactions

• Purchase,claim,cashtransfer,creditcardpayment• Storedinmassiveonlinetransactionprocessing(OLTP)relationaldatabase• Canbesummarizedoverlongertimehorizons(e.g.averages,relativetrends,Max/Minvalues)

• Unstructureddataembeddedintextdocuments• emails,webpages,claimforms,• Requiresextensivepreprocessing

• Qualitative,expert-baseddata• Requiressubjectmatterexperts’(SME)analysis• Scientificdata

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.81/22/2018 CS435IntroductiontoBigData- Spring2018 .A.8

Sampling

• Takingasubset ofdata foranalytics• Generatinghypothesis• Modelselection• Featureselection• Speculativeprocess• Buildinganalyticsmodel

• Stratifiedsampling• Takingsamplesaccordingtopredefinedstrata• e.g.Frauddetectionwithveryskewed(99percentnon-fraudcustomers,1percentfraudcustomers)• Sampleshouldcontainthesamepercentageoffraudcustomersasintheoriginaldata

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.91/22/2018 CS435IntroductiontoBigData- Spring2018 .A.9

TypesofDataElements

• Continuous• Dataelementsthataredefinedonanintervalthatcanbelimitedorunlimited

• e.g.income,sales,temperature

• CategoricalNominal• Dataelementsthatcanonlytakeonalimitedsetofvalueswithnomeaningfulorderingbetweenthem• e.g.maritalstatus,profession,purposeofloan

• Ordinal• Dataelementsthatcanonlytakeonalimitedsetofvalueswithameaningfulorderingbetweenthem• e.g.creditrating,agecodedasyoung,middleageandold

• Binary• Dataelementsthatcanonlytakeontwovalues

• e.g.Havingchild,allowedtodrive

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.101/22/2018 CS435IntroductiontoBigData- Spring2018 .A.10

MissingValues• Missingvaluescanoccurbecauseofvariousreasons• Theinformationcanbenon-applicable• Theinformationcanbeundisclosed• Theinformationcanbeunavailable

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.111/22/2018 CS435IntroductiontoBigData- Spring2018 .A.11

MissingValues--continued• Replace(impute)

• Replacesthemissingvaluewithacomputed/selectedvalue• Imputationalgorithmexamples

• Hot-deck:replaceswitharandomlyselectedsimilarrecords• Cold-deck:selectsreplacementfromanotherdataset• Meansubstitution:replaceswiththemeanofthatvariableforallothercases• Regression:predictsmissingvaluesofavariablebasedonothervariables.

• Delete• Deletesobservationswithlotsofmissingvalues• Thisassumesthatinformationismissingatrandomandhasnomeaningfulinterpretationand/orrelationshiptothetarget

• Keep• Missingvaluescanbemeaningful

• e.g.acustomerdidnotdisclosetheincomeforcurrentcondition

CS435 Introduction to Big DataSpring 2017 Colorado State University

1/22/2018 Week 2-ASangmi Lee Pallickara

3

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.121/22/2018 CS435IntroductiontoBigData- Spring2018 .A.12

OutliersofDataset

• Outliers areextremeobservationsthatareverydissimilartotherestofthepopulation• Validobservation

• Salaryofboss• Invalidobservation

• Ageis300

• Multivariateoutliers• Observationsthatareoutlyinginmultipledimensions

• e.g:TemperatureinFortCollinsis100degreesbutonamidnightinDecember

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.131/22/2018 CS435IntroductiontoBigData- Spring2018 .A.13

IdentifyingOutliersusingBoxPlots

• Aboxplotrepresentsthreekeyquartilesofthedata• Q1:25%oftheobservationshavealowervalue• Q2:50%oftheobservationshavealowervalue• Q3:75%oftheobservationshavealowervalue• Theminimum andmaximumvaluesareadded

• Toofarawayisnowquantifiedasmorethan1.5xInterquartileRange(IQR = (Q3 – Q1) )

Q3MQ1

Outliers

1.5xIQR

Min

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.141/22/2018 CS435IntroductiontoBigData- Spring2018 .A.14

IdentifyingOutliersusingZ-Score

• Measuringhowmanystandarddeviationsanobservationisawayfromthemean• 𝑧𝑖 =

$%&'(

whereμ representstheaverageofthevariableandσ itsstandarddeviation

• Apracticalruleofthumbthendefinesoutlierswhentheabsolutevalueofthez-score|z|isbiggerthan3

ID Age Z-Score

1 30 (30-40)/10=-1

2 50 (50-40)/10=+1

3 10 (10-40)/10=-3

4 40 (40-40)/10=0

5 60 (60-40)/10=+2

6 80 (80-40)/10=+4

-- .. …

μ =40σ =10

μ =0σ =1

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.151/22/2018 CS435IntroductiontoBigData- Spring2018 .A.15

DealingwithOutliers• Treatoutliersasmissingvalues• Popularschemes• Truncation

• Takingonlyvaluesthatarewithinthelimits• Winsorizing

• Limitingextremevaluestoreducetheeffectofpossiblespuriousoutliers

• {92,19, 101,58, 1053,91,26,78,10,13, -40, 101,86,85,15,89,89,28, -5,41} (N=20,mean =101.5)à {92,19, 101,58, 101,91,26,78,10,13, -5, 101,86,85,15,89,89,28, -5,41} (N=20,mean =55.65)

UsingtheZ-Scoresfortruncation

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.161/22/2018 CS435IntroductiontoBigData- Spring2018 .A.16

StandardizingData

• Scalingvariablestoasimilarrange• e.g.twovariables:educationandincome• Elementaryschool(1),middleschool(2),highschool(3),college(4),graduateschool(5)• Income:0~$5M• Whenbuildinglogisticregressionmodels,thecoefficientforeducationmightbecomeverysmall.

• Min/Maxstandardization• 𝑋𝑛𝑒𝑤 =

-./0&123 -./0145 -./0 &123 -./0

𝑛𝑒𝑤𝑚𝑎𝑥 − 𝑛𝑒𝑤𝑚𝑖𝑛 + 𝑛𝑒𝑤• Wherenewmax andnewmin arethenewlyimposedmaximum andminimum (e.g.1and0)

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.171/22/2018 CS435IntroductiontoBigData- Spring2018 .A.17

StandardizingData.-- continued

• Z-Scorebased• Calculatethez-scores

• Decimalscaling• 𝑋𝑛𝑒𝑤 =

-./0<=>

• Dividingbyapowerof10

• Standardizationisusefulforregression-basedapproaches• Itisnotneededfordecisiontrees

CS435 Introduction to Big DataSpring 2017 Colorado State University

1/22/2018 Week 2-ASangmi Lee Pallickara

4

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.181/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.18

Part0.Introduction

BigDataAnalytics-BigDataTechnologyStack

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.191/22/2018 CS435IntroductiontoBigData- Spring2018 .A.19

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.201/22/2018 CS435IntroductiontoBigData- Spring2018 .A.20

Inanutshell

DataLayerApacheHDFS,AmazonAWS’sS3,IBMGPFS,MicrosoftAzure

DataProcessingLayerApacheHadoopMapReduce,Pig,ApacheSpark,Cassandra,Storm,Mahout,MLLib,

DataIntegrationLayerApacheFlume,ApacheKafka,ApacheSqoop

OperationsandSchedulingLayerApacheAmbariApacheOozie

ApacheZookeeper

DataPresentationLayerApacheKibana

SecurityandGovernance

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.211/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.21

Part1.LargeScaleDataAnalytics

IntroductiontoMapReduce

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.221/22/2018 CS435IntroductiontoBigData- Spring2018 .A.22

Thismaterialisdevelopedbasedon,• Anand Rajaraman,JureLeskovec,andJeffreyUllman,“MiningofMassiveDatasets”,CambridgeUniversityPress,2012--Chapter2• DownloadthischapterfromtheCS435schedulepage

• Hadoop:ThedefinitiveGuide,TomWhite,O’Reilly,3rd Edition,2014

• MapReduce DesignPatterns,DonaldMinerandAdamShook,O’Reilly,2013

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.231/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.23

WhatisMapReduce?

CS435 Introduction to Big DataSpring 2017 Colorado State University

1/22/2018 Week 2-ASangmi Lee Pallickara

5

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.241/22/2018 CS435IntroductiontoBigData- Spring2018 .A.24

MapReduce[1/2]

• MapReduce isinspired bytheconceptsofmap andreduce inLisp.

• “Modern”MapReduce• DevelopedwithinGoogle asamechanismforprocessinglargeamountsofrawdata.

• Crawleddocumentsorwebrequestlogs• Distributesthesedataacrossthousandsofmachines• SamecomputationsareperformedoneachCPUwithdifferentdataset

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.251/22/2018 CS435IntroductiontoBigData- Spring2018 .A.25

MapReduce [2/2]

• MapReduce providesanabstractionthatallowsengineerstoperformsimplecomputationswhilehidingthedetailsofparallelization,datadistribution,loadbalancingandfaulttolerance

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.261/22/2018 CS435IntroductiontoBigData- Spring2018 .A.26

Mapper

• Mappermapsinputkey/valuepairstoasetofintermediatekey/valuepairs• Mapsaretheindividualtasksthattransforminputrecordsintointermediaterecords

• Thetransformedintermediaterecordsdonotneedtobeofthesametypeastheinputrecords

• Agiveninputpairmaymaptozeroormanyoutputpairs

• TheHadoop MapReduce frameworkspawnsonemaptaskforeachInputSplitgeneratedbytheInputFormat forthejob

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.271/22/2018 CS435IntroductiontoBigData- Spring2018 .A.27

Reducer

• Reducerreducesasetofintermediatevalueswhichshareakeytoasmallersetofvalues

• Reducerhas3primaryphases• Shuffle,sortandreduce

• Shuffle• Inputtothereduceristhesortedoutputofthemappers• TheframeworkfetchestherelevantpartitionoftheoutputofallthemappersviaHTTP

• Sort• Theframeworkgroupsinputtothereducerbykeys

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.281/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.28

MapReduce Example1

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.291/22/2018 CS435IntroductiontoBigData- Spring2018 .A.29

Example1:WordCount [1/5]

• Fortextfilesstoredunderusr/joe/wordcount/input,countthenumberofoccurrencesofeachword• Howdofilesanddirectorylook?

$ bin/hadoop dfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 Hello World, Bye World!

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop, Goodbye to hadoop.

CS435 Introduction to Big DataSpring 2017 Colorado State University

1/22/2018 Week 2-ASangmi Lee Pallickara

6

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.301/22/2018 CS435IntroductiontoBigData- Spring2018 .A.30

Example1:WordCount [2/5]

• RuntheMapReduce application

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount/usr/joe/wordcount/input /usr/joe/wordcount/output

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop, 1 Hello 2 World! 1 World, 1 hadoop. 1 to 1

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.311/22/2018 CS435IntroductiontoBigData- Spring2018 .A.31

Example1:WordCount [3/5]

Mappers1. Readaline2. Tokenizethestring3. Passthe

<key,value> outputtothereducer

Reducers1. Collect<key,value> pairs

sharingsamekey2. Aggregatetotalnumberof

occurrences

WhatdoyouhavetopassfromtheMappers?

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.321/22/2018 CS435IntroductiontoBigData- Spring2018 .A.32

Example1:WordCount [4/5]

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());context.write(word, one);

}}

}

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.331/22/2018 CS435IntroductiontoBigData- Spring2018 .A.33

Example1:WordCount [5/5]

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {

sum += val.get();}context.write(key, new IntWritable(sum));

}}

1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.341/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.34

Questions?