CS435 Introduction to Big DataSpring 2017 Colorado State University
1/22/2018 Week 2-ASangmi Lee Pallickara
1
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.01/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.0
CS435IntroductiontoBigData
PART0.INTRODUCTIONTOBIGDATA
SangmiLeePallickaraComputerScience,ColoradoStateUniversityhttp://www.cs.colostate.edu/~cs435
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.11/22/2018 CS435IntroductiontoBigData- Spring2018 .A.1
FAQs
• PA0hasbeenposted• Feb.6,5:00PMviaCanvas• Individualsubmission(Noteamsubmission)
• Accommodationrequest,honorstudent• ContactmebyJan262018
• Readings• Readingresearchpapers• Keshav's "Howtoreadapaper• "HowtoReadandUnderstandaScientificPaper:AStep-by-StepGuideforNon-Scientists"
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.21/22/2018 CS435IntroductiontoBigData- Spring2018 .A.2
Topics
• IntroductiontoBigDataAnalytics• DataCollection,Sampling,andPreprocessing• IntroductiontoMapReduce
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.31/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.3
Part0.Introduction
BigDataAnalytics-DataCollection,Sampling,andPreprocessing
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.41/22/2018 CS435IntroductiontoBigData- Spring2018 .A.4
ThisMaterialisBuiltBasedon,
• AnalyticsinaBigDataWorld:TheEssentialGuidetoDataScienceandItsApplications,BartBaesens,2014,Wiley
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.51/22/2018 CS435IntroductiontoBigData- Spring2018 .A.5
AnalyticsProcessModel
Themosttime-consumingstepisthedataselectionandpreprocessingstep- Thisisusuallyaround80%ofthetotaltimeneededtobuildananalyticalmodel
AnalyticsinaBigDataWorld:TheEssentialGuidetoDataScienceandItsApplications,BartBaesens,2014,Wiley
CS435 Introduction to Big DataSpring 2017 Colorado State University
1/22/2018 Week 2-ASangmi Lee Pallickara
2
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.61/22/2018 CS435IntroductiontoBigData- Spring2018 .A.6
TypesofAnalytics
• Analyticsisatermthatisoftenusedinterchangeablywith• Datascience• Datamining• Knowledgediscovery
• Predictiveanalytics• Atargetvariableistypicallyavailable• E.g.linear/logisticregression,decisiontrees,neuralnetworks,supportvectormachines
• Descriptiveanalytics• Notargetvariable• e.g.Clustering,associationrules
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.71/22/2018 CS435IntroductiontoBigData- Spring2018 .A.7
TypesofDataSources• Transactions
• Structured,low-level,detailedinformation• Customertransactions
• Purchase,claim,cashtransfer,creditcardpayment• Storedinmassiveonlinetransactionprocessing(OLTP)relationaldatabase• Canbesummarizedoverlongertimehorizons(e.g.averages,relativetrends,Max/Minvalues)
• Unstructureddataembeddedintextdocuments• emails,webpages,claimforms,• Requiresextensivepreprocessing
• Qualitative,expert-baseddata• Requiressubjectmatterexperts’(SME)analysis• Scientificdata
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.81/22/2018 CS435IntroductiontoBigData- Spring2018 .A.8
Sampling
• Takingasubset ofdata foranalytics• Generatinghypothesis• Modelselection• Featureselection• Speculativeprocess• Buildinganalyticsmodel
• Stratifiedsampling• Takingsamplesaccordingtopredefinedstrata• e.g.Frauddetectionwithveryskewed(99percentnon-fraudcustomers,1percentfraudcustomers)• Sampleshouldcontainthesamepercentageoffraudcustomersasintheoriginaldata
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.91/22/2018 CS435IntroductiontoBigData- Spring2018 .A.9
TypesofDataElements
• Continuous• Dataelementsthataredefinedonanintervalthatcanbelimitedorunlimited
• e.g.income,sales,temperature
• CategoricalNominal• Dataelementsthatcanonlytakeonalimitedsetofvalueswithnomeaningfulorderingbetweenthem• e.g.maritalstatus,profession,purposeofloan
• Ordinal• Dataelementsthatcanonlytakeonalimitedsetofvalueswithameaningfulorderingbetweenthem• e.g.creditrating,agecodedasyoung,middleageandold
• Binary• Dataelementsthatcanonlytakeontwovalues
• e.g.Havingchild,allowedtodrive
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.101/22/2018 CS435IntroductiontoBigData- Spring2018 .A.10
MissingValues• Missingvaluescanoccurbecauseofvariousreasons• Theinformationcanbenon-applicable• Theinformationcanbeundisclosed• Theinformationcanbeunavailable
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.111/22/2018 CS435IntroductiontoBigData- Spring2018 .A.11
MissingValues--continued• Replace(impute)
• Replacesthemissingvaluewithacomputed/selectedvalue• Imputationalgorithmexamples
• Hot-deck:replaceswitharandomlyselectedsimilarrecords• Cold-deck:selectsreplacementfromanotherdataset• Meansubstitution:replaceswiththemeanofthatvariableforallothercases• Regression:predictsmissingvaluesofavariablebasedonothervariables.
• Delete• Deletesobservationswithlotsofmissingvalues• Thisassumesthatinformationismissingatrandomandhasnomeaningfulinterpretationand/orrelationshiptothetarget
• Keep• Missingvaluescanbemeaningful
• e.g.acustomerdidnotdisclosetheincomeforcurrentcondition
CS435 Introduction to Big DataSpring 2017 Colorado State University
1/22/2018 Week 2-ASangmi Lee Pallickara
3
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.121/22/2018 CS435IntroductiontoBigData- Spring2018 .A.12
OutliersofDataset
• Outliers areextremeobservationsthatareverydissimilartotherestofthepopulation• Validobservation
• Salaryofboss• Invalidobservation
• Ageis300
• Multivariateoutliers• Observationsthatareoutlyinginmultipledimensions
• e.g:TemperatureinFortCollinsis100degreesbutonamidnightinDecember
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.131/22/2018 CS435IntroductiontoBigData- Spring2018 .A.13
IdentifyingOutliersusingBoxPlots
• Aboxplotrepresentsthreekeyquartilesofthedata• Q1:25%oftheobservationshavealowervalue• Q2:50%oftheobservationshavealowervalue• Q3:75%oftheobservationshavealowervalue• Theminimum andmaximumvaluesareadded
• Toofarawayisnowquantifiedasmorethan1.5xInterquartileRange(IQR = (Q3 – Q1) )
Q3MQ1
Outliers
1.5xIQR
Min
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.141/22/2018 CS435IntroductiontoBigData- Spring2018 .A.14
IdentifyingOutliersusingZ-Score
• Measuringhowmanystandarddeviationsanobservationisawayfromthemean• 𝑧𝑖 =
$%&'(
whereμ representstheaverageofthevariableandσ itsstandarddeviation
• Apracticalruleofthumbthendefinesoutlierswhentheabsolutevalueofthez-score|z|isbiggerthan3
ID Age Z-Score
1 30 (30-40)/10=-1
2 50 (50-40)/10=+1
3 10 (10-40)/10=-3
4 40 (40-40)/10=0
5 60 (60-40)/10=+2
6 80 (80-40)/10=+4
-- .. …
μ =40σ =10
μ =0σ =1
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.151/22/2018 CS435IntroductiontoBigData- Spring2018 .A.15
DealingwithOutliers• Treatoutliersasmissingvalues• Popularschemes• Truncation
• Takingonlyvaluesthatarewithinthelimits• Winsorizing
• Limitingextremevaluestoreducetheeffectofpossiblespuriousoutliers
• {92,19, 101,58, 1053,91,26,78,10,13, -40, 101,86,85,15,89,89,28, -5,41} (N=20,mean =101.5)à {92,19, 101,58, 101,91,26,78,10,13, -5, 101,86,85,15,89,89,28, -5,41} (N=20,mean =55.65)
UsingtheZ-Scoresfortruncation
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.161/22/2018 CS435IntroductiontoBigData- Spring2018 .A.16
StandardizingData
• Scalingvariablestoasimilarrange• e.g.twovariables:educationandincome• Elementaryschool(1),middleschool(2),highschool(3),college(4),graduateschool(5)• Income:0~$5M• Whenbuildinglogisticregressionmodels,thecoefficientforeducationmightbecomeverysmall.
• Min/Maxstandardization• 𝑋𝑛𝑒𝑤 =
-./0&123 -./0145 -./0 &123 -./0
𝑛𝑒𝑤𝑚𝑎𝑥 − 𝑛𝑒𝑤𝑚𝑖𝑛 + 𝑛𝑒𝑤• Wherenewmax andnewmin arethenewlyimposedmaximum andminimum (e.g.1and0)
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.171/22/2018 CS435IntroductiontoBigData- Spring2018 .A.17
StandardizingData.-- continued
• Z-Scorebased• Calculatethez-scores
• Decimalscaling• 𝑋𝑛𝑒𝑤 =
-./0<=>
• Dividingbyapowerof10
• Standardizationisusefulforregression-basedapproaches• Itisnotneededfordecisiontrees
CS435 Introduction to Big DataSpring 2017 Colorado State University
1/22/2018 Week 2-ASangmi Lee Pallickara
4
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.181/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.18
Part0.Introduction
BigDataAnalytics-BigDataTechnologyStack
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.191/22/2018 CS435IntroductiontoBigData- Spring2018 .A.19
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.201/22/2018 CS435IntroductiontoBigData- Spring2018 .A.20
Inanutshell
DataLayerApacheHDFS,AmazonAWS’sS3,IBMGPFS,MicrosoftAzure
DataProcessingLayerApacheHadoopMapReduce,Pig,ApacheSpark,Cassandra,Storm,Mahout,MLLib,
DataIntegrationLayerApacheFlume,ApacheKafka,ApacheSqoop
OperationsandSchedulingLayerApacheAmbariApacheOozie
ApacheZookeeper
DataPresentationLayerApacheKibana
SecurityandGovernance
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.211/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.21
Part1.LargeScaleDataAnalytics
IntroductiontoMapReduce
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.221/22/2018 CS435IntroductiontoBigData- Spring2018 .A.22
Thismaterialisdevelopedbasedon,• Anand Rajaraman,JureLeskovec,andJeffreyUllman,“MiningofMassiveDatasets”,CambridgeUniversityPress,2012--Chapter2• DownloadthischapterfromtheCS435schedulepage
• Hadoop:ThedefinitiveGuide,TomWhite,O’Reilly,3rd Edition,2014
• MapReduce DesignPatterns,DonaldMinerandAdamShook,O’Reilly,2013
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.231/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.23
WhatisMapReduce?
CS435 Introduction to Big DataSpring 2017 Colorado State University
1/22/2018 Week 2-ASangmi Lee Pallickara
5
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.241/22/2018 CS435IntroductiontoBigData- Spring2018 .A.24
MapReduce[1/2]
• MapReduce isinspired bytheconceptsofmap andreduce inLisp.
• “Modern”MapReduce• DevelopedwithinGoogle asamechanismforprocessinglargeamountsofrawdata.
• Crawleddocumentsorwebrequestlogs• Distributesthesedataacrossthousandsofmachines• SamecomputationsareperformedoneachCPUwithdifferentdataset
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.251/22/2018 CS435IntroductiontoBigData- Spring2018 .A.25
MapReduce [2/2]
• MapReduce providesanabstractionthatallowsengineerstoperformsimplecomputationswhilehidingthedetailsofparallelization,datadistribution,loadbalancingandfaulttolerance
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.261/22/2018 CS435IntroductiontoBigData- Spring2018 .A.26
Mapper
• Mappermapsinputkey/valuepairstoasetofintermediatekey/valuepairs• Mapsaretheindividualtasksthattransforminputrecordsintointermediaterecords
• Thetransformedintermediaterecordsdonotneedtobeofthesametypeastheinputrecords
• Agiveninputpairmaymaptozeroormanyoutputpairs
• TheHadoop MapReduce frameworkspawnsonemaptaskforeachInputSplitgeneratedbytheInputFormat forthejob
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.271/22/2018 CS435IntroductiontoBigData- Spring2018 .A.27
Reducer
• Reducerreducesasetofintermediatevalueswhichshareakeytoasmallersetofvalues
• Reducerhas3primaryphases• Shuffle,sortandreduce
• Shuffle• Inputtothereduceristhesortedoutputofthemappers• TheframeworkfetchestherelevantpartitionoftheoutputofallthemappersviaHTTP
• Sort• Theframeworkgroupsinputtothereducerbykeys
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.281/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.28
MapReduce Example1
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.291/22/2018 CS435IntroductiontoBigData- Spring2018 .A.29
Example1:WordCount [1/5]
• Fortextfilesstoredunderusr/joe/wordcount/input,countthenumberofoccurrencesofeachword• Howdofilesanddirectorylook?
$ bin/hadoop dfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 Hello World, Bye World!
$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop, Goodbye to hadoop.
CS435 Introduction to Big DataSpring 2017 Colorado State University
1/22/2018 Week 2-ASangmi Lee Pallickara
6
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.301/22/2018 CS435IntroductiontoBigData- Spring2018 .A.30
Example1:WordCount [2/5]
• RuntheMapReduce application
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount/usr/joe/wordcount/input /usr/joe/wordcount/output
$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop, 1 Hello 2 World! 1 World, 1 hadoop. 1 to 1
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.311/22/2018 CS435IntroductiontoBigData- Spring2018 .A.31
Example1:WordCount [3/5]
Mappers1. Readaline2. Tokenizethestring3. Passthe
<key,value> outputtothereducer
Reducers1. Collect<key,value> pairs
sharingsamekey2. Aggregatetotalnumberof
occurrences
WhatdoyouhavetopassfromtheMappers?
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.321/22/2018 CS435IntroductiontoBigData- Spring2018 .A.32
Example1:WordCount [4/5]
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());context.write(word, one);
}}
}
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.331/22/2018 CS435IntroductiontoBigData- Spring2018 .A.33
Example1:WordCount [5/5]
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {
sum += val.get();}context.write(key, new IntWritable(sum));
}}
1/22/2018 CS435IntroductiontoBigData- Spring2018 W2.A.341/17/2018 CS435IntroductiontoBigData- Spring2018 W1.A.34
Questions?