cloudera academic partnership 2 - infolab | welcome · author: seon kim created date: 10/12/2015...

102
2"2 © Copyright 2010/2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. The Hadoop Ecosystem Chapter 2.1

Upload: danganh

Post on 05-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"2$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

The"Hadoop"Ecosystem"Chapter"2.1"

Page 2: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"3$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! What$other$projects$exist$around$core$Hadoop?$

! When$to$use$HBase?$

! How$does$Spark$compare$to$MapReduce?$

! What$is$the$differences$between$Hive,$Pig,$and$Impala?$

! How$is$Flume$typically$deployed?$

$

The"Hadoop"Ecosystem"

Page 3: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"4$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Chapter"Topics"

The$Hadoop$Ecosystem$

!! IntroducLon$

!! Data"Storage:"HBase"

!! Data"IntegraMon:"Flume"and"Sqoop"

!! Data"Processing:"Spark"

!! Data"Analysis:"Hive,"Pig,"and"Impala"

!!Workflow"Engine:"Oozie"

!!Machine"Learning:"Mahout"

Page 4: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"5$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Hadoop"Distributed"File"System"

MapReduce"

Hive" Pig"Impala"Sqoop"

The"Hadoop"Ecosystem"(1)"

Oozie" …"Flume"HBase"

Hadoop"

Ecosystem"

Hadoop"Core"

Components"

CDH"

Page 5: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"6$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Hive" Pig"Impala"Sqoop"

$

! Next,$a$discussion$of$the$key$Hadoop$ecosystem$components$

The"Hadoop"Ecosystem"(2)"

Oozie" …"Flume"HBase"

Hadoop"

Ecosystem"

Page 6: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"7$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Ecosystem$projects$may$be$– Built"on"HDFS"and"MapReduce"

– Built"on"just"HDFS"– Designed"to"integrate"with"or"support"Hadoop"

! Most$are$Apache$projects$or$Apache$Incubator$projects$– Some"others"are"not"managed"by"the"Apache"So]ware"FoundaMon"

– These"are"o]en"hosted"on"GitHub"or"a"similar"repository"

! Following$is$an$introducLon$to$some$of$the$most$significant$projects$

The"Hadoop"Ecosystem"(3)"

Page 7: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"9$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! HBase$is$the$Hadoop$database$

! A$‘NoSQL’$datastore$

! Can$store$massive$amounts$of$data$– Petabytes+"

! High$write$throughput$– Scales"to"hundreds"of"thousands"of"inserts"per"second"

! Handles$sparse$data$well$– No"wasted"spaces"for"empty"columns"in"a"row"

! Limited$access$model$– OpMmized"for"lookup"of"a"row"by"key"rather"than"full"queries"

– No"transacMons:"single"row"operaMons"only"– Only"one"column"(the"‘row"key’)"is"indexed"

HBase"

Page 8: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"10$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

HBase"vs"TradiMonal"RDBMSs"

RDBMS$ HBase$Data$layout$ Row/oriented" Column/oriented"

TransacLons$ Yes" Single"row"only"

Query$language$ SQL" get/put/scan"(or"use"Hive"

or"Impala)"

Security$ AuthenMcaMon/AuthorizaMon" Kerberos"

Indexes$ Any"column" Row/key"only"

Max$data$size$ TBs" PB+"

Read/write$throughput$(queries$per$second)$

Thousands" Millions"

Page 9: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"11$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Use$plain$HDFS$if…$– You"only"append"to"your"dataset""(no"random"write)"

– You"usually"read"the"whole"dataset"(no"random"read)"

! Use$HBase$if…$– You"need"random"write"and/or"read"

– You"do"thousands"of"operaMons"per"second""on"TB+"of"data"

! Use$an$RDBMS$if…$– Your"data"fits"on"one"big"node"– You"need"full"transacMon"support"– You"need"real/Mme"query"capabiliMes"

When"To"Use"HBase"

Page 10: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"13$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! What$is$Flume?$– A"service"to"move"large"amounts"of"data"in"real"Mme"

– Example:"storing"log"files"in"HDFS"

! Flume$imports$data$into$HDFS$as$it$is$generated$– Instead"of"batch/processing"it"later"– For"example,"log"files"from"a"Web"server"

! Flume$is$– Distributed"– Reliable"and"available"– Horizontally"scalable""– Extensible"

Flume:"Real/Mme"Data"Import"

Page 11: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"14$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Flume:"High/Level"Overview"

Agent"" Agent" Agent"

Agent" Agent$

Agent(s)$

Agent"

compress$encrypt$

•  Pre/process"data"before"storing"•   "e.g.,"transform,"scrub,"enrich"

•  Store"in"any"format"

•  Text,"compressed,"binary,"or"

custom"sink"

•  Collect"data"as"it"is"produced"•   Files,"syslogs,"stdout"or"custom"source"

"Agent""

•  Process"in"place""•   e.g.,"encrypt,"compress"

•  Write"in"parallel"

•  Scalable"throughput"

HDFS"

Page 12: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"15$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Sqoop$transfers$data$between$RDBMSs$and$HDFS$– Does"this"very"efficiently"via"a"Map/only"MapReduce"job"

– Supports"JDBC,"ODBC,"and"several"specific"databases"– “Sqoop”"="“SQL"to"Hadoop”"

Sqoop:"Exchanging"Data"With"RDBMSs"

HDFS"

Sqoop"

"""

"

RDBMS"

Page 13: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"16$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Custom$connectors$for$– MySQL"

– Postgres"– Netezza"– Teradata"– Oracle"(partnered"with"Quest"So]ware)"

! Not$open$source,$but$free$to$use$

Sqoop"Custom"Connectors"

Page 14: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"18$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Apache$Spark$is$a$fast,$general$engine$for$large"scale$$data$processing$on$a$cluster$

! Originally$developed$UC$Berkeley’s$AMPLab$

! Open$source$Apache$project$

! Provides$several$benefits$over$MapReduce$– Faster"– Be>er"suited"for"iteraMve"algorithms"

– Can"hold"intermediate"data"in"RAM,"resulMng"in"much"be>er"

performance"

– Easier"API"– Supports"Python,"Scala,"Java"

– Supports"real/Mme"streaming"data"processing"

Apache"Spark"

Page 15: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"19$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! MapReduce$– Widely"used,"huge"investment"already"made"

– Supports"and"supported"by"many"complementary"tools"

– Mature,"well/tested"

! Spark$– Flexible"– Elegant""– Fast"– Supports"real/Mme"streaming"data"processing"

! Over$Lme,$Spark$is$expected$to$supplant$MapReduce$as$the$general$processing$framework$used$by$most$organizaLons$

Spark"vs"Hadoop"MapReduce"

Page 16: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"21$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! The$moLvaLon:$MapReduce$is$powerful$$but$hard$to$master$

! The$soluLon:$Hive$and$Pig$$– Languages"for"querying"and"manipulaMng"data"

– Leverage"exisMng"skillsets"– Data"analysts"who"use"SQL"– Programmers"who"use"scripMng"languages""

– Open"source"Apache"projects"– Hive"iniMally"developed"at"Facebook"– Pig"IniMally"developed"at"Yahoo!"

! Interpreter$runs$on$a$client$machine$– Turns"queries"into"MapReduce"jobs"

– Submits"jobs"to"the"cluster"

Hive"and"Pig:"High"Level"Data"Languages"

Page 17: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"22$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Hive"

! What$is$Hive?$– HiveQL:"An"SQL/like"interface"to"Hadoop"

$

SELECT * FROM purchases WHERE price > 10000 ORDER BY storeid

Page 18: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"23$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Pig"

! What$is$Pig?$– Pig$LaLn:"A"dataflow"language"for"transforming"large"data"sets"

purchases = LOAD "/user/dave/purchases" AS (itemID, price, storeID, purchaserID);

bigticket = FILTER purchases BY price > 10000; ...

Page 19: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"24$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Hive"vs."Pig"

Hive$ Pig$

Language$ HiveQL"(SQL/like)" Pig"LaMn"(dataflow"

language)"

Schema$ Table"definiMons"stored"in"

a"metastore"

Schema"opMonally"defined"

at"runMme"

ProgrammaLc$access$ JDBC,"ODBC" PigServer"(Java"API)"

JDBC:"Java"Database"ConnecMvity"

ODBC:"Open"Database"ConnecMvity"

Page 20: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"25$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! High"performance$SQL$engine$for$vast$amounts$of$data$– Similar"query"language"to"HiveQL""

– 10"to"50+"Mmes"faster"than"Hive,"Pig,"or"MapReduce"

! Impala$runs$on$Hadoop$clusters$– Data"stored"in"HDFS"– Does"not"use"MapReduce"

! Developed$by$Cloudera$– 100%"open"source,"released"under"the"Apache"so]ware"license"

Impala:"High"Performance"Queries"

Page 21: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"26$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Use$Impala$when…$– You"need"near"real/Mme"responses"to"ad"hoc"queries"

– You"have"structured"data"with"a"defined"schema"

! Use$Hive$or$Pig$when…$– You"need"support"for"custom"file"types,"complex"data"types,"or"external"

funcMons"

! Use$Pig$when…$– You"have"developers"experienced"with"wriMng"scripts"– Your"data"is"unstructured/mulM/structured"

! Use$Hive$When…$– You"have"analysts"familiar"with"SQL"

– You"are"integraMng"with"BI"or"reporMng"tools"via"ODBC/JDBC"

Which"to"Choose?"

Page 22: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"30$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Mahout$is$a$Machine$Learning$library$wrigen$in$Java$$

! Used$for$– CollaboraMve"filtering"(recommendaMons)"

– Clustering"(finding"naturally"occurring"“groupings”"in"data)"– ClassificaMon"(determining"whether"new"data"fits"a"category)"

! Why$use$Hadoop$for$Machine$Learning?$– “It’s"not"who"has"the"best"algorithms"that"wins."It’s"who"has"the"most"

data.”"

Mahout"

Page 23: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"31$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Hadoop$Ecosystem$– Many"projects"built"on,"and"supporMng,"Hadoop"

– Several"will"be"covered"in"detail"later"in"the"course"

Key"Points"

Page 24: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"32$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

The$following$offer$more$informaLon$on$topics$discussed$in$this$chapter

! HBase$was$inspired$by$Google’s$“BigTable”$paper$presented$at$OSDI$in$2006$– http://research.google.com/archive/bigtable-osdi06.pdf

! An$interesLng$link$comparing$NoSQL$databases$– http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html

! AMPLab$– https://amplab.cs.berkeley.edu

! Databricks$– http://databricks.com

Bibliography"

Page 25: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"33$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

The$following$offer$more$informaLon$on$topics$discussed$in$this$chapter

! Spark$– http://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html

! Dremel$is$a$distributed$system$for$interacLve$ad"hoc$queries$that$was$created$by$Google$ – http://research.google.com/pubs/archive/36632.pdf

! Impala,$developed$by$Cloudera,$supports$the$same$inner,$outer,$and$semi"joins$that$Hive$does – http://tiny.cloudera.com/dac15b

Bibliography""(cont’d)"

Page 26: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"34$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Managing"Your"Hadoop"SoluMon"Chapter"2.2"

Page 27: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"35$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! What$is$the$typical$architecture$of$a$data$center$with$Hadoop?$

! What$are$typical$hardware$requirements$for$a$Hadoop$cluster?$

Managing"Your"Hadoop"SoluMon"

Page 28: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"36$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Chapter"Topics"

Managing$Your$Hadoop$SoluLon$

!! Hadoop$in$the$Data$Center$

!! Cluster"Hardware"

Page 29: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"37$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

A"Typical"Data"Center"With"Hadoop"

Page 30: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"38$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Technology"Strengths"and"Weaknesses"

Technology$ Strengths$ Drawbacks$

Hadoop" •! Scales"to"petabytes"•! Import/export"tools"

•! Flexible"structure"•! Commodity"hardware"

•! Query"speed"•! No"transacMon"support"

RelaMonal"

Databases"

•! Complex"transacMons"

•! 1000s"of"queries/second"•! SQL"

•! Data"must"fit"into"rows"and"columns"

•! Schema"is"costly"to"change"

•! Full"table"scans"are"slow"

Data"warehouses" •! ReporMng"capabiliMes"•! Up"to"100s"of"terabytes"

•! Dimensions"require"pre/materializaMon"

File"servers"(SAN/

NAS)"

•! Serving"individual"files"•! Write"caches"

•! Cost"•! Reading"large"porMons"of"the"data"saturates"the"network"

Backup"systems"

(tape)"

•! Cheap" •! Expensive"to"retrieve"the"data"

SAN:"Storage"Area"Network"

NAS:"Network/A>ached"Storage"

Page 31: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"39$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Example"Data"Flow"

Web"server"logs"

Orders"

Site"Content/

RecommendaMons"

Hadoop$

ETL"

RecommendaMons"

HDFS" Enterprise"

Data""

Warehouse"

Sqoop"

(nightly)"

Flume""

(real"Mme)"

Sqoop"

(nightly)"

Sqoop"

HBase"

Page 32: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"40$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Chapter"Topics"

Managing$Your$Hadoop$SoluLon$

!! Hadoop"in"the"Data"Center"

!! Cluster$Hardware$

Page 33: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"41$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Cluster"Hardware"

Name"

Node"

JobTracker"or"

Resource"

Manager"

Slave"

Nodes"

Master""

Nodes"Client"

Master""

Nodes"

Page 34: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"42$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Processors$– Mid/grade"processors"(e.g.,"2"x"6/core"2.9"GHz)"

! Memory$– 48/96GB"RAM"

! Network$– 1Gb"Ethernet"(mid/range)"

– 10Gb"Ethernet"(high/end)"! Disk$Drives$

– 6"x"2TB"drives"per"machine"(mid/range)"

– 12"x"3TB"drives"per"machine"(high/end)"

– Non/RAID""

Slave"Nodes:"Recommended"ConfiguraMons"(1)"

RAID:"Redundant"Array"of"Inexpensive"Disks"

Page 35: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"43$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Switch$– Dedicated"switching"infrastructure"required"because"Hadoop"can"saturate"the"network""

– “All"nodes"talking"to"all"nodes”"! Cost$

– Per"slave"node"cost"should"be"around"$4,000"to"$10,000"(2014"esMmate)"

Slave"Nodes:"Recommended"ConfiguraMons"(2)"

Page 36: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"44$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Four$master$nodes$– NameNode"(acMve"and"standby)"

– JobTracker"(MRv1)"or"ResourceManager"(MRv2)"(acMve"and"standby)"

– Some"installaMons"just"use"a"single"JobTracker"or"Resource"

Manager"

! Each$typically$runs$on$a$separate$machine$

! Master$node$machines$are$high"quality$servers$– Carrier/class"hardware"– Dual"power"supplies,"Ethernet"cards"– RAIDed"hard"drives"– 24GB"RAM"for"clusters"of"20"nodes"or"less"

– 48GB"RAM"for"clusters"of"up"to"300"nodes"

– 96GB"RAM"for"larger"clusters"

Master"Nodes"Are"More"Important"

Page 37: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"45$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Basing$your$cluster$growth$on$storage$capacity$is$omen$a$good$method$to$use$

! Example:$– Data"grows"by"approximately"3TB"per"week/40TB"per"quarter"

– Hadoop"replicates"3"Mmes"="120TB"

– Extra"space"required"for"temporary"data"while"running"jobs"(~30%)"="

160TB"

– Assuming"machines"with"12"x"3TB"hard"drives"

– 4/5"new"machines"per"quarter"

– Two"years"of"data"="1.3PB"• Requires"approximately"36"machines"

Capacity"Planning"(1)"

Page 38: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"46$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! New$nodes$are$automaLcally$used$by$Hadoop$

! Many$clusters$start$small$(less$than$10$nodes)$and$scale$up$as$data$and$processing$demands$grow$

! Hadoop$clusters$can$grow$to$thousands$of$nodes$

Capacity"Planning"(2)"

Page 39: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"47$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

The$following$offer$more$informaLon$on$topics$discussed$in$this$chapter

! Slave$and$master$node$configuraLon$recommendaLons:$Hadoop$OperaLons,$first$ediLon$(1e),$by$Eric$Sammer,$p.$46$and$50.$

Bibliography"

Page 40: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"48$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

IntroducMon"to"MapReduce"Chapter"2.3"

Page 41: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"49$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Concepts$behind$MapReduce$

! How$does$data$flow$through$MapReduce$stages$

! Typical$uses$of$Mappers$

! Typical$uses$of$Reducers$

IntroducMon"to"MapReduce"

Page 42: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"50$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Chapter"Topics"

IntroducLon$to$MapReduce$

!!MapReduce$Overview$

!! Example:"WordCount"

!!Mappers"

!! Reducers"

Page 43: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"51$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! AutomaLc$parallelizaLon$and$distribuLon$

! Fault"tolerance$

! A$clean$abstracLon$for$programmers$– MapReduce"programs"are"usually"wri>en"in"Java"

– Can"be"wri>en"in"any"language"using"Hadoop&Streaming"– All"of"Hadoop"is"wri>en"in"Java"

! MapReduce$abstracts$all$the$‘housekeeping’$away$from$the$developer$– Developer"can"simply"concentrate"on"wriMng"the"Map"and"Reduce"

funcMons"

Review"/"Features"of"MapReduce"

Page 44: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"52$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Review"–"Key"MapReduce"Stages"

! The$Mapper$– Each"Map"task"(typically)"operates"on"a"single"HDFS"

block"

– Map"tasks"(usually)"run"on"the"node"where"the"

block"is"stored"

! Shuffle$and$Sort$– Sorts"and"consolidates"intermediate"data"from"all"

mappers"

– Happens"a]er"all"Map"tasks"are"complete"and"

before"Reduce"tasks"start"

! The$Reducer$– Operates"on"shuffled/sorted"intermediate"data"

(Map"task"output)"

– Produces"final"output"

Map"

Reduce"

Shuffle""

and"Sort"

Page 45: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"53$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

The"MapReduce"Flow"

Input"Split"1"

Input"

File(s)"Input"Split"2" Input"Split"3"

Input"Format"

Page 46: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"54$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

The"MapReduce"Flow"

Record"Reader"

Input"Split"1"

Mapper"

Input"

File(s)"

Record"Reader"

Input"Split"2"

Mapper"

Record"Reader"

Input"Split"3"

Mapper"

Input"Format"

Page 47: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"55$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

The"MapReduce"Flow"

Record"Reader"

ParMMoner"

Input"Split"1"

Shuffle"and"Sort"

Mapper"

Input"

File(s)"

Record"Reader"

ParMMoner"

Input"Split"2"

Mapper"

Record"Reader"

ParMMoner"

Input"Split"3"

Mapper"

Input"Format"

Page 48: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"56$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

The"MapReduce"Flow"

Record"Reader"

ParMMoner"

Input"Split"1"

Shuffle"and"Sort"

Mapper"

Reducer"

Input"

File(s)"

Reducer"

Record"Reader"

ParMMoner"

Input"Split"2"

Mapper"

Record"Reader"

ParMMoner"

Input"Split"3"

Mapper"

Input"Format"

Page 49: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"57$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

The"MapReduce"Flow"

Record"Reader"

ParMMoner"

Input"Split"1"

Shuffle"and"Sort"

Mapper"

Reducer"

Output"Format"

Output"File"

Input"

File(s)"

Reducer"

Output"Format"

Output"File"

Record"Reader"

ParMMoner"

Input"Split"2"

Mapper"

Record"Reader"

ParMMoner"

Input"Split"3"

Mapper"

Input"Format"

Page 50: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"58$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

The"MapReduce"Flow"

Record"Reader"

ParMMoner"

Input"Split"1"

Shuffle"and"Sort"

Mapper"

Reducer"

Output"Format"

Output"File"

Input"

File(s)"

Reducer"

Output"Format"

Output"File"

Record"Reader"

ParMMoner"

Input"Split"2"

Mapper"

Record"Reader"

ParMMoner"

Input"Split"3"

Mapper"

Input"Format"

Page 51: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"59$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Chapter"Topics"

IntroducLon$to$MapReduce$

!!MapReduce"Overview"

!! Review:$WordCount$Example$

!!Mappers"

!! Reducers"

Page 52: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"60$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Example:"Word"Count"

the cat sat on the mat the aardvark sat on the sofa "

Input"Data"

Result"

aardvark 1

cat 1

mat 1

on 2

sat 2

sofa 1

the 4

Map" Reduce"

Page 53: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"61$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Example:"The"WordCount"Mapper"(1)"

the cat sat on the mat the aardvark sat on the sofa …

Input"Data"(HDFS"file)"

0 the cat sat on the mat

23 the aardvark sat on the sofa

52 …

… …

Record"Reader"

Mapper"

Page 54: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"62$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Mapper"the 1

cat 1

sat 1

on 1

the 1

mat 1

Example:"The"WordCount"Mapper"(2)"

the cat sat on the mat the aardvark sat on the sofa …

map()"

map()"

the 1

aardvark 1

sat 1

on 1

the 1

sofa 1

Input"Data"(HDFS"file)"

0 the cat sat on the mat

23 the aardvark sat on the sofa

52 …

… …

Record"Reader"

Page 55: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"63$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Example:"WordCount"Shuffle"and"Sort"

drove 1

elephant 1

mahout 1

the 1,1

the 1

cat 1

sat 1

on 1

the 1

mat 1

the 1

aardvark 1

sat 1

on 1

the 1

sofa 1

the 1

mahout 1

drove 1

the 1

elephant 1

Mapper"

Mapper"

aardvark 1

cat 1

mat 1

on 1,1

sat 1,1

sofa 1

the 1,1,1,1

Node"1"

Node"2"

aardvark 1

cat 1

mat 1

elephant 1

mahout 1

sat 1,1

drove 1

on 1,1

sofa 1

the 1,1,1,1,1,1

Page 56: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"64$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Example:"SumReducer"(1)"

"

Reducer"0"

"

Reducer"1"

"

Reducer"2"

"

part-r-00002

"

part-r-00001

"

part-r-00000

aardvark 1

cat 1

mat 1

drove 1

on 2

sofa 1

the 6

elephant 1

mahout 1

sat 2

Final"Output"

aardvark 1

cat 1

mat 1

elephant 1

mahout 1

sat 1,1

drove 1

on 1,1

sofa 1

the 1,1,1,1,1,1

Page 57: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"65$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

drove 1

on 1,1

sofa 1

the 1,1,1,1,1,1

"

Example:"SumReducer"(2)"

Reducer"2"

reduce()"

reduce()"

reduce()"

reduce()"

drove 1

on 2

sofa 1

the 6

Final"Output"

part-r-00002 HDFS"File"

Page 58: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"66$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Chapter"Topics"

IntroducLon$to$MapReduce$

!!MapReduce"Overview"

!! Example:"WordCount"

!!Mappers$

!! Reducers"

Page 59: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"67$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! The$Mapper$$– Input:"key/value"pair"– Output:"A"list"of"zero"or"more"key"value"pairs"

MapReduce:"The"Mapper"(1)"

map(in_key, in_value) " (inter_key, inter_value) list

intermediate key 1 value 1

intermediate key 2 value 2

intermediate key 3 value 3

… …

input key

input value map"

Page 60: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"68$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! The$Mapper$may$use$or$completely$ignore$the$input$key$– For"example,"a"standard"pa>ern"is"to"read"one"line"of"a"file"at"a"Mme"

– The"key"is"the"byte"offset"into"the"file"at"which"the"line"starts"– The"value"is"the"contents"of"the"line"itself"– Typically"the"key"is"considered"irrelevant"

! If$the$Mapper$writes$anything$out,$the$output$must$be$in$the$form$of$$key/value$pairs$

MapReduce:"The"Mapper"(2)"

the 1

aardvark 1

sat 1

on 1

the 1

sofa 1

23 the aardvark sat on the sofa map"

Page 61: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"69$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Turn$input$into$upper$case$(pseudo"code):$

Example"Mapper:"Upper"Case"Mapper"

let map(k, v) = emit(k.toUpper(), v.toUpper())

mahout an elephant driver

map()"bugaboo an object of fear

or alarm

bumbershoot umbrella

MAHOUT AN ELEPHANT DRIVER

BUGABOO AN OBJECT OF FEAR OR ALARM

BUMBERSHOOT UMBRELLA

map()"

map()"

Page 62: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"70$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Output$each$input$character$separately$(pseudo"code):$

Example"Mapper:"‘Explode’"Mapper"

let map(k, v) = foreach char c in v: emit (k, c)

map()"

pi 3.14

pi 3

pi .

pi 1

pi 4

145 kale

145 k

145 a

145 l

145 e

map()"

Page 63: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"71$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Only$output$key/value$pairs$where$the$input$value$is$a$prime$number$(pseudo"code):$

Example"Mapper:"‘Filter’"Mapper"

let map(k, v) = if (isPrime(v)) then emit(k, v)

map()"

pi 3.14

foo 13

map()"

48 7 48 7 map()"

5 12

foo 13

map()"

Page 64: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"72$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! The$key$output$by$the$Mapper$does$not$need$to$be$idenLcal$to$the$input$key$

! Example:$output$the$word$length$as$the$key$(pseudo"code):$

Example"Mapper:"Changing"Keyspaces"

let map(k, v) = emit(v.length(), v)

002 aim map()"

001 hadoop map()"

003 ridiculous map()"

3 aim

6 hadoop

10 ridiculous

Page 65: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"73$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Emit$the$key,value$pair$(pseudo"code):$

Example"Mapper:"IdenMty"Mapper"

let map(k, v) = emit(k,v)

mahout an elephant driver

map()"bugaboo an object of fear

or alarm

bumbershoot umbrella

map()"

map()"

mahout an elephant driver

bugaboo an object of fear or alarm

bumbershoot umbrella

Page 66: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"74$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Chapter"Topics"

IntroducLon$to$MapReduce$

!!MapReduce"Overview"

!! Example:"WordCount"

!!Mappers"

!! Reducers$

Page 67: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"75$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Amer$the$Map$phase$is$over,$all$intermediate$values$for$a$given$intermediate$key$are$grouped$together$

! Each$key$and$value$list$is$passed$to$a$Reducer$– All"values"for"a"parMcular"intermediate"key"go"to"the"same"Reducer"

– The"intermediate"keys/value"lists"are"passed"in"sorted"key"order"

Shuffle"and"Sort"

html 891

gif 1231

jpg 3992

html 788

gif 3997

html 344

gif 1231

3997

jpg 3992

html

344

891

788

gif 1231

3997

jpg 3992

html

344

891

788

Reducer"

Reducer"

Page 68: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"76$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! The$Reducer$outputs$zero$or$more$final$key/value$pairs$– In"pracMce,"usually"emits"a"single"key/value"pair"for"each"input"key"

– These"are"wri>en"to"HDFS"

The"Reducer"

gif 1231

3997

html

344

891

788

reduce()"

html 1498

reduce(inter_key, [v1, v2, …]) " (result_key, result_value)

gif 2614

reduce()"

Page 69: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"77$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Add$up$all$the$values$associated$with$each$intermediate$key$(pseudo"code):$

Example"Reducer:"Sum"Reducer"

let reduce(k, vals) = sum = 0 foreach int i in vals: sum += i emit(k, sum)

SKU0021

34

8

19

the

1

1

1

1

SKU0021 61

the 4 reduce()"

reduce()"

Page 70: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"78$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Find$the$mean$of$all$the$values$associated$with$each$intermediate$key$(pseudo"code):$

Example"Reducer:"Average"Reducer"

let reduce(k, vals) = sum = 0; counter = 0; foreach int i in vals: sum += i; counter += 1; emit(k, sum/counter)

SKU0021

34

8

19

the

1

1

1

1

SKU0021 20.33

the 1 reduce()"

reduce()"

Page 71: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"79$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! The$IdenLty$Reducer$is$very$common$(pseudo"code):$

Example"Reducer:"IdenMty"Reducer"

let reduce(k, vals) = foreach v in vals: emit(k, v)

28

2

2

7

bow

a knot with two loops and two loose ends

a weapon for shooting arrows

a bending of the head or body in respect

bow a knot with two loops and two loose ends

bow a weapon for shooting arrows

bow a bending of the head or body in respect

28 2

28 2

28 7

reduce()"

reduce()"

Page 72: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"80$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! A$MapReduce$program$has$two$major$developer"created$components:$a$Mapper$and$a$Reducer$

! Mappers$map$input$data$to$intermediate$key/value$pairs$– O]en"parse,"filter,"or"transform"the"data"

! Reducers$process$Mapper$output$into$final$key/value$pairs$– O]en"aggregate"data"using"staMsMcal"funcMons"

Key"Points"

Page 73: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"81$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Hadoop"Clusters"Chapter"2.4"

Page 74: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"82$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Components$of$a$Hadoop$cluster$

! How$do$Hadoop$jobs$and$tasks$run$on$a$cluster$

! How$does$a$job’s$data$flow$in$a$Hadoop$cluster$

Hadoop"Clusters"

Page 75: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"83$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Chapter"Topics"

Hadoop$Clusters$

!! Hadoop$Cluster$Overview$

!! Hadoop"Jobs"and"Tasks"

Page 76: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"84$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Amer$tesLng$a$Hadoop$soluLon$on$a$small$sample$dataset,$the$soluLon$can$be$run$on$a$Hadoop$cluster$to$process$all$data$

! Clusters$are$typically$installed$and$maintained$by$a$system$administrator$

! Developers$benefit$by$understanding$how$the$components$of$a$Hadoop$cluster$work$together$

Installing"A"Hadoop"Cluster"(1)"

Page 77: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"85$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Difficult$– Download,"install,"and"integrate"individual"Hadoop"components"

directly"from"Apache"

! Easier:$CDH$– Cloudera’s"DistribuMon"for"Apache"Hadoop"– Vanilla"Hadoop"plus"many"patches,"backports,"bug"fixes"

– Includes"many"other"components"from"the"Hadoop"ecosystem"

"

! Easiest:$Cloudera$Manager$– Wizard/based"UI"to"install,"configure"and"manage"a"Hadoop"

cluster"

– Included"with"Cloudera"Standard"(free)"or"Cloudera"Enterprise"

Installing"A"Hadoop"Cluster"(2)"

Page 78: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"86$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

MapReduce"

Master"

Nodes"

Hadoop"Cluster"Terminology"

Slave"Node"

HDFS""

Master"

Nodes"

! A$Hadoop$cluster$is$a$group$of$computers$working$together$– Usually"runs"HDFS"and"MapReduce""

! A$node$is$an$individual$computer$in$the$cluster$– Master"nodes"manage"distribuMon"of"work"and"data"to"slave"nodes"

! A$daemon$is$a$program$running$on$a$node$– Each"performs"different"funcMons"in"the"cluster"

Slave"Node"

Slave"Node"

Slave"Node"

…""

Page 79: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"87$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! HDFS$daemons$– NameNode"–"holds"the"metadata"for"HDFS""

– Typically"two"on"a"producMon"cluster:"one"acMve,"one"standby"– DataNode"–"holds"the"actual"HDFS"data"

– One"per"slave"node"

Hadoop"Daemons:"HDFS"

DataNode"

Name"Node""(AcMve"+"

Standby)"

DataNode"

DataNode"

DataNode"

…""

Page 80: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"88$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! MapReduce$v1$(“MRv1”$or$“Classic$MapReduce”)$– Uses"a"JobTracker/TaskTracker"architecture"– One"JobTracker"per"cluster"–"limits"cluster"size"to"about"4000"nodes"

– Slots"on"slave"nodes"designated"for"Map"or"Reduce"tasks"

! MapReduce$v2$(“MRv2”)$– Built"on"top"of"YARN"(Yet"Another"Resource"NegoMator)"– Uses"ResourceManager/NodeManager"architecture"

– Increases"scalability"of"cluster"– Node"resources"can"be"used"for"any"type"of"task"

– Improves"cluster"uMlizaMon"

– Support"for"non/MR"jobs"

MapReduce"v1"and"v2"(1)"

Page 81: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"89$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! MRv1$daemons$– JobTracker"–"one"per"cluster"

– Manages"MapReduce"jobs,"distributes"individual"tasks"to"

TaskTrackers"

– TaskTracker"–"one"per"slave"node"– Starts"and"monitors"individual"Map"and"Reduce"tasks"

Hadoop"Daemons:"MapReduce"v1"

JobTracker"(AcMve"+"

Standby)"

TaskTracker"

TaskTracker"

TaskTracker"

…""

Page 82: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"90$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

•   Manage"data"storage"

•   Hold"metadata"

Basic"Cluster"ConfiguraMon:"HDFS"

DataNode" DataNode"DataNode"DataNode"Slave"

Nodes"

HDFS""

Master"

Nodes"

Name"

Node"(AcMve"

+"Standby)"

Page 83: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"91$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

DataNode" DataNode"DataNode"DataNode"Slave"

Nodes"

•   Manages"MR"jobs"

•   Distributes"tasks"to"slave"nodes"

Basic"Cluster"ConfiguraMon:"HDFS"+"MapReduce"v1"

HDFS""

Master"

Nodes"

MapReduce"

Master""

Nodes"

Client"

•   Manage"data"storage"

•   Hold"metadata"

TaskTracker" TaskTracker" TaskTracker" TaskTracker"

Name"

Node"(AcMve"

+"Standby)"

Job""

Tracker"(AcMve"+"

Standby)"

Task" Task" Task"

MR"Job"

Page 84: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"92$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! MRv2$daemons$– ResourceManager"–"one"per"cluster"

– Starts"ApplicaMonMasters,"allocates"resources"on"slave"nodes"

– ApplicaMonMaster"–"one"per"job"

– Requests"resources,"manages"individual"Map"and"Reduce"tasks"

– NodeManager"–"one"per"slave"node"

– Manages"resources"on"individual"slave"nodes"

– JobHistory"–"one"per"cluster"– Archives"jobs’"metrics"and"metadata"

Hadoop"Daemons:"MapReduce"v2"

Resource"

Manager"(AcMve"+"Standby)"

NodeManager"

NodeManager"ApplicaMonMaster"

NodeManager"

…""

Job"History"

Server"

Page 85: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"93$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

•   Manage"data"storage"

•   Hold"metadata"

Basic"Cluster"ConfiguraMon:"HDFS"

DataNode" DataNode"DataNode"DataNode"Slave"

Nodes"

HDFS""

Master"

Nodes"

Name"

Node"(AcMve"

+"Standby)"

Note:"This"slide"is"the"same"as"the"earlier"HDFS"slide;"there"is"no"change"in"the"HDFS"

design"for"YARN/MapReduce"v2"because"HDFS"is"the"storage"side"of"Hadoop."YARN/

MapReduce"v2"implement"the"compute"side"of"Hadoop."

Page 86: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"94$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

DataNode" DataNode"DataNode"DataNode"Slave"

Nodes"

•   Allocates"resources"•   Launches"ApplicaMon"Masters"

Basic"Cluster"ConfiguraMon:"HDFS"+"MapReduce"v2"

Resource"

Manager"(AcMve"+"

Standby)"

HDFS""

Master"

Nodes"

MapReduce"

Master""

Nodes"

Client"

•   Manage"data"storage"

•   Hold"metadata"

Name"

Node"(AcMve"

+"Standby)"

NodeManager" NodeManager" NodeManager" NodeManager"

MR"Job"

Task" Task" Task"ApplicaLon$Master$

MR"Job"

Page 87: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"95$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Chapter"Topics"

Hadoop$Clusters$

!! Hadoop"Cluster"Overview"

!! Hadoop$Jobs$and$Tasks$

Page 88: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"96$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! A$job$is$a$‘full$program’$– A"complete"execuMon"of"Mappers"and"Reducers"over"a"dataset"

! A$task$is$the$execuLon$of$a$single$Mapper$or$Reducer$over$a$slice$of$data$

! A$task0a1empt$is$a$parLcular$instance$of$an$agempt$to$execute$a$task$– There"will"be"at"least"as"many"task"a>empts"as"there"are"tasks"

– If"a"task"a>empt"fails,"another"will"be"started"by"the"JobTracker"or"

ApplicaMonMaster"

– Specula3ve&execu3on"(covered"later)"can"also"result"in"more"task"

a>empts"than"completed"tasks&

Review"/"MapReduce"Terminology"

Page 89: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"97$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Submi|ng"A"Job"

MR""

Master"

Node"

Client"

XML".jar"

Job"

Page 90: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"98$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Job""

Tracker(s)"

TaskTracker"

A"MapReduce"v1"Cluster"

TaskTracker"

TaskTracker"

TaskTracker"

DataNode"

DataNode"

DataNode"

DataNode"

Name"

Node(s)"

Slave"Nodes"

Page 91: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"99$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Job""

Tracker(s)"

TaskTracker"

Running"a"Job"on"a"MapReduce"v1"Cluster"(1)"

TaskTracker"

TaskTracker"

TaskTracker"

DataNode"

DataNode"

DataNode"

DataNode"

Block1"

Block2" Name"

Node(s)"

HDFS: mydata

Slave"Nodes" $ hadoop fs –put mydata

Page 92: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"100$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Job""

Tracker(s)"

TaskTracker"

Running"a"Job"on"a"MapReduce"v1"Cluster"(2)"

Client"

TaskTracker"

TaskTracker"

TaskTracker"

DataNode"

DataNode"

DataNode"

DataNode"

Block1"

Block2"

Map"Task"1"

Map"Task"2"

Reduce"Task"1"

Reduce"Task"2"

Name"

Node(s)"

HDFS: mydata

Slave"Nodes"

Page 93: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"101$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Resource"

Manager(s)"

NodeManager"

A"MapReduce"v2"Cluster"

NodeManager"

NodeManager"

NodeManager"

DataNode"

DataNode"

DataNode"

DataNode"

Name"

Node(s)"

Job"History"

Server"

Slave"Nodes"

Page 94: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"102$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Resource"

Manager(s)"

NodeManager"

Running"a"Job"on"a"MapReduce"v2"Cluster"(1)"

Client"

NodeManager"

NodeManager"

NodeManager"

DataNode"

DataNode"

DataNode"

DataNode"

Block1"

Block2" Name"

Node(s)"

HDFS: mydata

MapReduce$ApplicaLon$Master$

Job"History"

Server"

Slave"Nodes"

Page 95: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"103$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Resource"

Manager(s)"

NodeManager"

Running"a"Job"on"a"MapReduce"v2"Cluster"(2)"

Client"

NodeManager"

NodeManager"

NodeManager"

DataNode"

DataNode"

DataNode"

DataNode"

Block1"

Block2"

Map"Task"1"

Map"Task"2"

Reduce"Task"1"

Reduce"Task"2"

Name"

Node(s)"

HDFS: mydata

MapReduce$ApplicaLon$Master$

Job"History"

Server"

Slave"Nodes"

Page 96: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"104$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Job"Data:"Mapper"Data"Locality"

When"possible,"Map"

tasks"run"on"a"node"

where"the"block"of"

data"to"be"processed"

is"stored"locally"

Otherwise,"the"Map"

task"will"transfer"

the"data"across"the"

network"and"then"

process"that"data"

Block1"

Block2"

Map"Task"1"

Map"Task"2"

HDFS"

Page 97: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"105$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Job"Data:"Intermediate"Data"

Block1"

Block2"

Map"Task"1"

Map"Task"2"

Intermediate"

Data"

Intermediate"

Data"

HDFS"

Map"task"

intermediate"data"is"

stored"on"the"local"

disk"(not"HDFS)"

Page 98: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"106$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

Job"Data:"Shuffle"and"Sort"

Block1"

Block2"

Map"Task"1"

Map"Task"2"

Reduce"Task"1"

Reduce"Task"2"

Intermediate"

Data"

Intermediate"

Data"

HDFS"

Intermediate"data"is"

transferred"across"the"

network"to"the"

Reducers"

There"is"no"concept"of"

data"locality"for"

Reducers"

Reducers"write"their"

output"to"HDFS"

Page 99: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"107$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! It$appears$that$the$shuffle$and$sort$phase$is$a$bogleneck$– The"reduce"method"in"the"Reducers"cannot"start"unMl"all"Mappers"

have"finished"

! In$pracLce,$Hadoop$will$start$to$transfer$data$from$Mappers$to$Reducers$as$soon$as$the$Mappers$finish$work$– This"avoids"a"huge"amount"of"data"transfer"starMng"as"soon"as"the"last"

Mapper"finishes"

– The"reduce"method"sMll"does"not"start"unMl"all"intermediate"data"has"

been"transferred"and"sorted"

Is"Shuffle"and"Sort"a"Bo>leneck?"

Page 100: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"108$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! It$is$possible$for$one$Map$task$to$run$more$slowly$than$the$others$– Perhaps"due"to"faulty"hardware,"or"just"a"very"slow"machine"

! It$would$appear$that$this$would$create$a$bogleneck$– The"reduce"method"in"the"Reducer"cannot"start"unMl"every"Mapper"

has"finished"

! Hadoop$uses$specula3ve0execu3on$to$miLgate$against$this$– If"a"Mapper"appears"to"be"running"significantly"more"slowly"than"the"

others,"a"new"instance"of"the"Mapper"will"be"started"on"another"

machine,"operaMng"on"the"same"data"

– A"new"task&a7empt"for"the"same"task"

– The"results"of"the"first"Mapper"to"finish"will"be"used"

– Hadoop"will"kill"off"the"Mapper"which"is"sMll"running"

Is"a"Slow"Mapper"a"Bo>leneck?"

Page 101: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"109$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

! Write$the$Mapper$and$Reducer$classes$

! Write$a$Driver$class$that$configures$the$job$and$submits$it$to$the$cluster$– Driver"classes"are"covered"in"the"“WriMng"MapReduce”"chapter"

! Compile$the$Mapper,$Reducer,$and$Driver$classes$

$

! Create$a$jar$file$with$the$Mapper,$Reducer,$and$Driver$classes$

! Run$the$hadoop jar$command$to$submit$the$job$to$the$Hadoop$cluster$

CreaMng"and"Running"a"MapReduce"Job"

$ javac -classpath `hadoop classpath` MyMapper.java MyReducer.java MyDriver.java

$ jar cf MyMR.jar MyMapper.class MyReducer.class MyDriver.class

$ hadoop jar MyMR.jar MyDriver in_file out_dir

Page 102: Cloudera Academic Partnership 2 - InfoLab | Welcome · Author: Seon Kim Created Date: 10/12/2015 8:30:35 PM

2"110$©"Copyright"2010/2014"Cloudera."All"rights"reserved."Not"to"be"reproduced"without"prior"wri>en"consent."

The$following$offer$more$informaLon$on$topics$discussed$in$this$chapter

! Full$Cloduera$documentaLon$available$at$– http://cloudera.com/

! Reference$Hadoop:$The$DefiniLve$Guide,$third$ediLon$(TDG$3e)$for$details$on$job$submission$"$Chapter$6,$“How$MapReduce$Works”,$starLng$on$page$189.$

Bibliography"