mapreduce runtime environments: design, performance, optimizations

MapReduce Runtime EnvironmentsDesign, Performance, Optimizations

Gilles FedakMany Figures imported from authors’ works

[email protected]/University of Lyon, France

Escola Regional de Alto Desempenho - Região Nordeste

G. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 1 / 84

Outline of Topics

1 What is MapReduce ?The Programming ModelApplications and Execution Runtimes

2 Design and Principles of MapReduce Execution RuntimeFile SystemsMapReduce ExecutionKey Features of MapReduce Runtime Environments

3 Performance Improvements and OptimizationsHandling Heterogeneity and StragglersExtending MapReduce Application Range

4 MapReduce beyond the Data CenterBackground on BitDewMapReduce for Desktop Grid Computing

5 Conclusion


Section 1

What is MapReduce ?


What is MapReduce ?

• Programming Model for data-intense applications• Proposed by Google in 2004• Simple, inspired by functionnal programming

• programmer simply defines Map and Reduce tasks

• Building block for other parallel programming tools• Strong open-source implementation: Hadoop• Highly scalable

• Accommodate large scale clusters: faulty and unreliable resources

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, in OSDI’04:Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.


Subsection 1

The Programming Model


MapReduce in Lisp

• (map f (list l1, . . . , ln))• (map square ‘(1 2 3 4))• (1 4 9 16)

• (+ 1 4 9 16)• (+1 (+4 (+9 16)))• 30


MapReduce a la Google

Input

Mapper

Map(k , v)→ (k ′, v ′)

Reducer

Reduce(k ′, {v ′1, . . . , v ′n})→ v ′′

Output

Group(k ′, v ′) by k ′

• Map(key , value) is run on each item in the input data set• Emits a new (key , val) pair

• Reduce(key , val) is run for each unique key emitted by the Map• Emits final output


Exemple: Word Count

• Input: Large number of text documents• Task: Compute word count across all the document• Solution :

• Mapper : For every word in document, emits (word, 1)• Reducer : Sum all occurrences of word and emits (word, total_count)


WordCount Example

/ / Pseudo−code f o r the Map f u n c t i o nmap( S t r i n g key , S t r i n g value ) :

/ / key : document name,/ / value : document contentsforeach word i n s p l i t ( value ) :

Emi t In te rmed ia te ( word , ’ ’ 1 ’ ’ ) ;

/ / Peudo−code f o r the Reduce f u n c t i o nreduce ( S t r i n g key , I t e r a t o r value ) :

/ / key : a word/ / values : a l i s t o f countsi n t word_count = 0 ;foreach v i n values :

word_count += Parse In t ( v ) ;Emit ( key , AsStr ing ( word_count ) ) ;


Subsection 2

Applications and Execution Runtimes


MapReduce Applications Skeleton

• Read large data set• Map: extract relevant information from each data item• Shuffle and Sort• Reduce : aggregate, summarize, filter, or transform• Store the result• Iterate if necessary


MapReduce Usage at Google

Jerry Zhao (MapReduce Group Tech lead @ Google)


MaReduce + Distributed Storage

• Powerfull when associated with a distributed storage system• GFS (Google FS), HDFS (Hadoop File System)

!"#$%&'(%)*'+,-&'.(/0+-! !"#$%&'(()#*$+),--"./,0$1)#/0*),)1/-0%/2'0$1)-0"%,3$)-4-0$5)! 678)96""3($)78:;)<=78)9<,1"">)7/($)84-0$5:)


The Execution Runtime

• MapReduce runtime handles transparently:• Parallelization• Communications• Load balancing• I/O: network and disk• Machine failures• Laggers


MapReduce Compared

Jerry Zhao


Existing MapReduce Systems

• Google MapReduce• Proprietary and close source• Closely linked to GFS• Implemented in C++ with Python and Java Bindings

• Hadoop• Open source (Apache projects)• Cascading, Java, Ruby, Perl, Python, PHP, R, C++ bindings (and certainly

more)• Many side projects: HDFS, PIG, Hive, Hbase, Zookeeper• Used by Yahoo!, Facebook, Amazon and the Google IBM research Cloud

!"#$%&'()*+,-./0-(12$3-4$((! !""#$%&'()*%+,-%&

! ./")/0%1(/2&(3+&-$"4%+&4",/-%&! 5$"4%$2&$036%+&1"&!78&! 9:)$%:%31%+&03&5;;&&<01=&.21="3&(3+&>(?(&@03+03#4&

! 8(<A($$&

! B(+"")&! C)%3&4",/-%&DE)(-=%&)/"F%-1G&

! 5(4-(+03#H&>(?(H&*,@2H&.%/$H&.21="3H&.B.H&*H&"/&5;;&

! '(32&40+%&)/"F%-14&I&BJ78H&.9!H&B@(4%H&B0?%H&K""6%)%/&

! L4%+&@2&M(=""NH&7(-%@""6H&E:(A"3&(3+&1=%&!""#$%&9O'&/%4%(/-=&5$",+&


Other Specialized Runtime Environments

• Hardware Platform• Multicore (Phoenix)• FPGA (FPMR)• GPU (Mars)

• Security• Data privacy (Airvat)

• MapReduce/Virtualization• EC2 Amazon• CAM

• Frameworks based onMapReduce

• Iterative MapReduce(Twister)

• Incremental MapReduce(Nephele, Online MR)

• Languages (PigLatin)• Log analysis : In situ MR


Hadoop Ecosystem


Research Areas to Improve MapReduce Performance

• File System• Higher Throughput, reliability, replication efficiency

• Scheduling• data/task affinity, scheduling multiple MR apps, hybrid distributed computing

infrastructures

• Failure detections• Failure characterization, failure detector and prediction

• Security• Data privacy, distributed result/error checking

• Greener MapReduce• Energy efficiency, renewable energy


Research Areas to Improve MapReduce Applicability

• Scientific Computing• Iterative MapReduce, numerical processing, algorithmics

• Parallel Stream Processing• language extension, runtime optimization for parallel stream processing,

incremental processing

• MapReduce on widely distributed data-sets• Mapreduce on embedded devices (sensors), data integration


CloudBlast: MapReduce and Virtualization for BLASTComputing

NCBI BLAST• The Basic Local Alignment Search Tool

(BLAST)

• finds regions of local similarity betweensequences.

• compares nucleotide or protein sequencesto sequence databases and calculates thestatistical significance of matches.

Hadoop + BLAST

• stream of sequences are split by Hadoop.

• Mapper compares each sequence to thegene database

• HDFS is used to merge the result

• database are not chunked

• provides fault-tolerance, load balancing.

!"#$%&'!()!! *+,-.#/$!0#12!3456'7%8')!9#:'/!4!;'1!<=!#/5%1!;'>%'/8';?!@47<<5!;5A#1;!12'!#/5%1!=#A'!#/1<!82%/B;!<=!;4C'!;#D'?!5<;;#EAF!;5A#11#/$!#/!12'!C#77A'!<=!4!;'>%'/8')!-1&'4CG411'&/6'8<&76'47'&!04;!7':'A<5'7!1<!#/1'&5&'1!8<&&'81AF!12'!E<%/74&#';!<=!;'>%'/8';?!02#82!4&'!54;;'7!4;!B'F;!<=!4!B'FH:4A%'!54#&!1<!12'!3455'&!1241!'I'8%1';!*+,-.)!J"-!&'1&#':';!4AA!

A<84A!=#A'!<%15%1;!8<CE#/#/$!#/1<!4!;#/$A'!&';%A1)!

<5'&41#/$!;F;1'C;!4/7!477#1#</4A!;<=104&'!K')$)!1<!;%55<&1!4!54&1#8%A4&! 1F5'! <=! /'10<&B! 5&<1<8<A! <&! 1'82/<A<$FL)!-#C#A4&AF?!4!@47<<5ME4;'7!455&<482!&'>%#&';!C482#/';!0#12!12'! @47<<5! &%/1#C'! 4/7! =#A'! ;F;1'C! ;<=104&'?! 02#82! 84/!</AF!'I'8%1'! #=!</'!<=!;':'&4A!488'514EA'!:'&;#</!<=!N4:4!#;!4A;<!#/;14AA'7)!*<12!455&<482';!4A;<!&'>%#&'!;;2!;<=104&'!=<&!;14&1#/$!12'!7#;1&#E%1'7!&%/1#C'!'/:#&</C'/1)!O/!12'!455&<482!'/:#;#</'7! EF! 12#;! 545'&?!C482#/'! :#&1%4A#D41#</! #;! %;'7! 1<!'/4EA'! 12'! %;'! <=! P3;! 1241! 4A&'47F! '/845;%A41'! 12';'!'/:#&</C'/1;?!'=='81#:'AF!E'8<C#/$!455A#4/8';!=<&!'I'8%1#</!<=!E#<#/=<&C41#8;!455A#841#</;)!.2'! A</$M1'&C!:#4E#A#1F!4/7!7';#&4E#A#1F!<=! 12'!5&<5<;'7!

455&<482!4;!4!;%;14#/4EA'!4/7!'=='81#:'!;<A%1#</!;1'C;!=&<C!12'!=<AA<0#/$!12&''!=481;Q!!R)!P3! ;<A%1#</;! =<&!GS! ;'&:'&;!24:'!C41%&'7! #/! &'8'/1!

F'4&;!T(UV!0#12!C4/F!<5'/!;<%&8'!4/7!8<CC'&8#4A!<=='&#/$;!K')$)?! S#1&#I! W'/?! P304&'?! 3#8&<;<=1?! 4/7! XP3L)!S<CC'&8#4AAF! 4:4#A4EA'! 9&#7! #/=&4;1&%81%&';! %;#/$! P3;!#/8A%7'!Y1'&4!T(ZV?!,C4D</![A4;1#8!S<C5%1#/$!SA<%7!K[S(L!T(\V?! [/$#/'! F4&7! TY]V?! 9<9&#7! TYRV?! 4/7! N<F'/1! TY(V)![I4C5A';! <=! <5'/M;<%&8'! ;<A%1#</;! 4&'! :#&1%4A! 0<&B;548';!TRRV?![%84AF51%;!TR(V!4/7!^5'/_'E%A4!TRYV)!()!"<&! &';<%&8'!5&<:#7'&;?!P3!1'82/<A<$#';!<=='&!'4;#'&!

C4/4$'C'/1!0#12!E'11'&!8</1&<A!<=!12'!;24&'7!&';<%&8';!4/7!E'11'&!#;<A41#</!4C</$!%;'&;)!"&<C!12'!5'&;5'81#:'!<=!8A#'/1;?!12'!4E#A#1F!1<!&%/!12'!<5'&41#/$!;F;1'C!<=!82<#8'!</!12'!9&#7!C4B';! #1! 5<;;#EA'! 1<! 2<C<$'/#D'?! 41! A'4;1! #/! 1'&C;! <=!

'I'8%1#</! '/:#&</C'/1?! 12'! 2'1'&<$'/'<%;! ;'1! <=! 4:4#A4EA'!&';<%&8';)! O1! #;! /<! A</$'&! /'8';;4&F! 1<! 5&'54&'! 7#=='&'/1!455A#841#</! E#/4&#';! 1<! &%/! #/! 7#=='&'/1! 'I'8%1#</!'/:#&</C'/1;)!Y)!,55A#841#</;!84/!E'!'/845;%A41'7!#/1<!P3;!4/7?!;#/8'!

P3!#C4$';!4&'!%;'7!4;!1'C5A41';!1<!7'5A<F!C%A1#5A'!8<5#';!02'/! /''7'7?! </AF! 12'! 1'C5A41'! #C4$';! /''7! 1<! E'!C4#/14#/'7)!`2'/!455A#841#</!%5741';!4&'!/'8';;4&F?!4!/'0!1'C5A41'! #C4$'! #;! 8&'41'7! 4/7! C%A1#5A'! 8<5#';! 4&'!4%1<C41#84AAF! 7#;1&#E%1'7! EF! 12'! 9&#7! ;'&:#8';! TRRV)! .2#;!&'A#':';! %;'&;! 4/7H<&! 47C#/#;1&41<&;?! 5<1'/1#4AAF! #/!C%A1#5A'!;#1';?! =&<C! 455A#841#</! #/;14AA41#</?! C4#/1'/4/8'! 4/7!7'5A<FC'/1!14;B;)!,55A#841#</!%;'&;!4&'!=&''7!=&<C!12'!/''7!1<! 7'4A! 0#12! 455A#841#</! #/;14AA41#</?! 8<C5#A41#</! 4/7! 1';1;)!.'C5A41';! 5&'54&'7! EF! ;5'8#4A#;1;! 84/! E'! %;'7?! 4/7!E#<A<$#;1;! 84/! 8</8'/1&41'! </! 5&<:#7#/$! #/5%1! 7414! 4/7!#/1'&5&'1#/$! 12'! &';%A1;?! 4;! 'I'C5A#=#'7! EF! 12'! C5#*+,-.!455A#4/8'!4:4#A4EA'!=<&!,C4D</![S(!TYYV)!!

!"! #$%&%'$()&*+,,&$()-$%.*/)-)0)12**#<#/=<&C41#8;!455A#841#</;!C4B'!%;'!<=!E#<A<$#84A!7414!

;'1;!<=!':'&!#/8&'4;#/$!;#D';?!4;!;2<0/!EF!;141#;1#8;!=&<C!12'!5&<1'#/! ;'>%'/8'!a/#G&<1X*H.&[3*+!7414E4;'! TYbV! ! K"#$)!YL)! G'&#<7#84A! %5741';! <=! 12';'! 7414E4;';! 4&'! /'8';;4&F! 1<!<=='&! %5M1<M741'! &';%A1;! 1<! E#<A<$#;1;)! G&<54$41#</! <=! 12';'!%5741';! #;!/<1!4/!'4;F! 14;B!02'/!7'4A#/$!0#12!12<%;4/7;!<=!&';<%&8';)!J414! C4/4$'C'/1! 1'82/#>%';! #/! 9&#7! 8<C5%1#/$! K')$)?!

9P"-!TYcV?!-6*!TYdVL!8<%A7!5<1'/1#4AAF!E'!%;'7!1<!4%1<C41'!12'! 7#;1&#E%1#</! <=! %5741'7! 7414! 1<! P3;! </! 12'! 8A<%7)!@<0':'&?! #/! $'/'&4A?! E#<#/=<&C41#8;! 455A#841#</;! 4&'! /<1!7';#$/'7! 4/7! #C5A'C'/1'7! 1<! 14B'! 47:4/14$'! <=! 9&#7! 7414!C4/4$'C'/1! ;'&:#8';)! O/! 477#1#</?! E';1! 5'&=<&C4/8'! #;!<E14#/'7!02'/!7414;'1;!4&'!4;!8A<;'!4;!5<;;#EA'! 1<!8<C5%1'!/<7';?! #7'4AAF! #/! 12'! A<84A! ;1<&4$')! @47<<5! J"-! 14B';!47:4/14$'! <=! 12'! A<84A! ;1<&4$'! EF! &'5A#841#/$! #/5%1! 4/7!<%15%1!7414)!^/! 4! P3M5<0'&'7! 7#;1&#E%1'7! '/:#&</C'/1! K4;! #/! 12#;!

0<&BL?! 12'! 5&<5<;'7! 455&<482! #;! 1<! '/845;%A41'! E#<A<$#84A!7414! #/1<! P3! #C4$';! <&! #/1<! #/7'5'/7'/1! :#&1%4A! 7#;B!#C4$';)!6'5A#841#</;!<=!7#;B! #C4$';!4&'!<=='&'7!4;!;'&:#8';!#/!8A<%7!#/=&4;1&%81%&';?!=48#A#141#/$!7414!C4#/1'/4/8'!4;!</AF!!

!"#$%&'!Y)!! 9&<012!<=!5&<1'#/!;'>%'/8';!#/!a/#G&<1X*H.&[3*+!7414E4;')!

6'A'4;'!YZ)(!K]ZM,5&M(]]ZLQ!#1!8</14#/;!ccUU]cb!;'>%'/8'!'/1&#';!8<C5&#;#/$!RZ]YdRcR\Y!4C#/<!48#7;)!(RdRUd!;'>%'/8';!24:'!E''/!477'7!;#/8'!&'A'4;'!YZ?!12'!;'>%'/8'!7414!<=!b((!'I#;1#/$!'/1&#';!24;!E''/!

%5741'7!4/7!12'!4//<141#</;!<=!RdR(\\(!'/1&#';!24:'!E''/!&':#;'7)!TYbV!

>seq1 AACCGGTT >seq2

CCAAGGTT >seq3 GGTTAACC >seq4 TTAACCGG

>seq1 AACCGGTT >seq2 CCAAGGTT

>seq3 GGTTAACC >seq4 TTAACCGG

BLAST BLAST

part-00000 part-00001

Split into chunks based on new line.

StreamPatternRecordReader splits chunks into sequences as keys for

the mapper (empty value).

BLAST output is written to a local file.

Mapper

StreamInputFormat

Mapper

DFS merger

((c

Authorized licensed use limited to: IEEE Xplore. Downloaded on February 4, 2009 at 11:57 from IEEE Xplore. Restrictions apply.

Cloudblast: Combining mapreduce and virtualization ondistributed resources for bioinformatics applications. A.Matsunaga, M. Tsugawa, & J. Fortes. (eScience’08), 2008.


CloudBlast Deployment and Performance

• Use Vine to deploy Hadoop on several clusters (24 Xen-based VM @ Univ. Florida + 12Xen based VMs at Univ Calif.)

• Each VM contains Hadoop + BLAST + databse (3.5GB)

• Compared with MPIBlast

!"#$%&'!()!! *+,'&#-'./01!2'/%,)!34!5'.6702'8!9:2!0/!;"!0.8!<3!5'.6702'8!9:2!0/!;=!0&'!>?..'>/'8!/@&?%$@!9#A'!/?!,&?B#8'!0!8#2/&#7%/'8!'.B#&?.-'./!/?!'+'>%/'!C08??,!0.8!:DE6702'8!FGHIJ!0,,1#>0/#?.2)!H!%2'&!>0.!&'K%'2/!0!B#&/%01!>1%2/'&!L?&!FGHIJ!'+'>%/#?.!7M!>?./0>/#.$!/@'!

B#&/%01!N?&O2,0>'2!2'&B'&!P9QIR)!

()!H!9S!9:!#2!>1?.'8!0.8!2/0&/'8!#.!'0>@!2#/'T!?LL'&#.$!0116/?6011!>?..'>/#B#/M!/?!011!9:2)!

U)!J@'! 7#?1?$#2/! 1?$2! #./?! ?.'! ?L! /@'!9:2T! %,1?082! /@'!#.,%/!2'K%'.>'!0.8!2/0&/2!'+'>%/#.$!FGHIJ!V?72)!

!E.! /@'! FGHIJ! '+,'&#-'./2! >?.8%>/'8T! 3! 2'/2! ?L!

.%>1'?/#8'!2'K%'.>'2T!'0>@!N#/@!WUX!2'K%'.>'2!N'&'!%2'8!02!#.,%/)! J@'! 7102/+! ,&?$&0-! L&?-! /@'! A=FE! FGHIJ! 2%#/'T!N@#>@! >?-,0&'2! .%>1'?/#8'2! 2'K%'.>'2! 0$0#.2/! ,&?/'#.!2'K%'.>'2! 7M! /&0.210/#.$! /@'! .%>1'?/#8'2! #./?! 0-#.?! 0>#82T!N02! %2'8)! =@0&0>/'&#2/#>2! ?L! /@'! 3! #.,%/! 2'/2! 0&'! 2@?N.! #.!JHFG*!EE)!

JHFG*!EE)!! *5D*SE:*AJHG!I*Y;*A=*I!=CHSH=J*SEIJE=I)!

! "#$!%&'(! "#$!)*&+,!-!&.!)/01/'2/)! WUX! WUX!3&'(/),!)/01/'2/! (TZ<[! Z(X!4*&+,/),!)/01/'2/! 3<(! <X<!4,5'65+6!6/785,8&'! (<(! <\X!9&,5%!'1:;/+!&.!'12%/&,86/)!

<TX\3T<4[! 43(T\[(!

<7/+5(/!'12%/&,86/)!=/+!)/01/'2/!

<<<U)Z3! 44[)4\!

4/01/',85%!/>/21,8&'!,8:/!

4[!@?%&2!0.8!4U!-#.%/'2!

<\!@?%&2!0.8!XU!-#.%/'2!

E9)!*9HG;HJE]A!HA^!*5D*SE:*AJHG!S*I;GJI!J?! 'B01%0/'! /@'! ?B'&@'08! ?L! -0>@#.'! B#&/%01#_0/#?.T!

'+,'&#-'./2! N#/@! FGHIJ! N'&'! >?.8%>/'8! ?.! ,@M2#>01!-0>@#.'2!P,%&'!G#.%+!2M2/'-2!N#/@?%/!5'.!9::R!0.8!10/'&!?.! /@'! 5'.! 9:2! @?2/'8! 7M! /@'! 20-'!-0>@#.'2T! %2#.$! /@'!,@M2#>01! .?8'2! 0/! ;")! 9#&/%01#_0/#?.! ?B'&@'08! #2! 70&'1M!.?/#>'071'!N@'.!'+'>%/#.$!FGHIJ!02!#/!>0.!7'!?72'&B'8!#.!"#$)!U)!J@#2!#2!8%'!/?!/@'!-#.#-01!.''8!L?&!E`]!?,'&0/#?.2!02!0! &'2%1/! ?L! &',1#>0/#.$! /@'! 80/0702'! #.! '0>@! .?8'! 0.8! L%11M!011?>0/#.$! #/! /?! -0#.! -'-?&M)! E/! 012?! >?.L#&-2! ?/@'&!?72'&B0/#?.2! ?L! B#&/%01#_0/#?.! @0B#.$! .'$1#$#71'! #-,0>/! ?.!

>?-,%/'68?-#.0/'8! 0,,1#>0/#?.2! ?&! 'B'.! #-,&?B#.$!,'&L?&-0.>'! 8%'! /?! 0&/#L0>/2! ?L! #-,1'-'./0/#?.T! ')$)! L#1'!2M2/'-!8?%71'!>0>@#.$!a4Ub)!!J?! 'B01%0/'! /@'! 2>0107#1#/M! ?L! -,#FGHIJ! 0.8!

=1?%8FGHIJT!'+,'&#-'./2!N#/@!8#LL'&'./!.%-7'&2!?L!.?8'2T!,&?>'22?&2T! 80/0702'! L&0$-'./2! 0.8! #.,%/! 80/0! 2'/2! N'&'!'+'>%/'8)! *+0>/1M! 3! ,&?>'22?&2! 0&'! %2'8! #.! '0>@! .?8')!Q@'.'B'&! ,?22#71'T! /@'! .%-7'&2! ?L! .?8'2! #.! '0>@! 2#/'! 0&'!O',/! 7010.>'8! P#)')T! L?&! '+,'&#-'./2! N#/@! Z! ,&?>'22?&2T! 3!.?8'2! #.! '0>@! 2#/'! 0&'! %2'8c! L?&! '+,'&#-'./2! N#/@! <U!,&?>'22?&2T! 4! .?8'2! 0&'! %2'8! #.! '0>@! 2#/'T! 0.8! 2?! ?.R)! "?&!'+,'&#-'./2!N#/@!U4!,&?>'22?&2T!<3!.?8'2!0/!;=!0.8!3X!0/!;"!N'&'! %2'8! 8%'! /?! 1#-#/'8!.%-7'&!?L! 0B0#1071'!.?8'2! 0/!;=)!S'2%1/2!0&'!2%--0&#_'8!#.!"#$)!\)!"?&!011!#.,%/!80/0!2'/2T!

=1?%8FGHIJ! 2@?N2! 0! 2-011! 08B0./0$'! ?B'&! -,#FGHIJ)!"?&! #.,%/!80/0!2'/!?L! 1?.$!2'K%'.>'2!0!(\6L?18!2,''8%,!>0.!7'! ?72'&B'8! N#/@! =1?%8FGHIJ!N@'.! %2#.$! U4! ,&?>'22?&2!?.!3!2#/'2T!N@#1'!-,#FGHIJ!8'1#B'&2!(36L?18!2,''8%,)!J@'!2'K%'./#01! '+'>%/#?.! /#-'! L?&! /@'! 1?.$! #.,%/! 2'/! ?L! 4[)\U!@?%&2!?.!0! 2#.$1'!;"!-0>@#.'! #2! &'8%>'8! /?! 4U!-#.%/'2!?.!/@'! >1?%8! N#/@! =1?%8FGHIJ! 0.8! /?! (X! -#.%/'2! N#/@!-,#FGHIJ)!I#-#10&1MT!L?&!/@'!2@?&/!#.,%/!2'/T!<\)<!@?%&2!?L!2'K%'./#01! '+'>%/#?.! #2! &'8%>'8! /?! <W)(! -#.%/'2! N#/@!=1?%8FGHIJ! 0.8! 3(! -#.%/'2! N#/@! -,#FGHIJ)! J@'!,'&L?&-0.>'! 8#LL'&'.>'! 7'/N''.! =1?%8FGHIJ! 0.8!-,#FGHIJ! #2! -?&'! .?/#>'071'! N@'.! 2'K%'.>'2! 0&'! 2@?&/T!#.8#>0/#.$! 0.! 'LL#>#'./! #-,1'-'./0/#?.! ?L! #.,%/! 2'K%'.>'!2,1#/!0.8!V?7!>??&8#.0/#?.!7M!C08??,)!JN?62#/'!&'2%1/2!2@?N!7'//'&!2,''8%,!/@0.!?.'62#/'!8%'!/?!,&?>'22?&2!/@0/!0&'!L02/!0/!;=!P0,,&?+#-0/'1M!<)3!/#-'2!L02/'&R)!J@'!,%71#>1M!0B0#1071'!-,#FGHIJ! >0.! 7'! L%&/@'&! @0.86/%.'8! L?&! @#$@'&!,'&L?&-0.>'T! 7%/! 2%>@! ?,/#-#_'8! #-,1'-'./0/#?.2! 0&'! .?/!,%71#>1M!0>>'22#71'!a4\b)!H1/@?%$@! /@'! './#&'! /0&$'/! 80/0702'! L#/2! #./?!-'-?&M! #.!

/@'! 2'/%,! %2'8T! &%.2! N#/@! /@'! 80/0702'! 2'$-'./'8! #./?! [X!L&0$-'./2!N'&'!,'&L?&-'8!#.!?&8'&!/?!#.>&'02'!/@'!0-?%./!?L!.'/N?&O! >?--%.#>0/#?.! L?&! -,#FGHIJ! &%.2! P"#$)! ZR)!I,''8%,!$&0,@2! P"#$)! \R! 2@?N! /@0/! /@'&'!N02!.?! #-,0>/!?.!/@'!,'&L?&-0.>'!8%'!/?!9A)!!

"#$%&'!U)!! =?-,0&#2?.!?L!FGHIJ!2,''86%,2!?.!,@M2#>01!0.8!B#&/%01!-0>@#.'2)!I,''8%,!#2!>01>%10/'8!&'10/#B'!/?!2'K%'./#01!FGHIJ!?.!B#&/%01!&'2?%&>')!9:2!8'1#B'&!,'&L?&-0.>'!>?-,0&071'!N#/@!,@M2#>01!-0>@#.'2!

N@'.!'+'>%/#.$!FGHIJ)!

33\


!"#$%&'!()!! *+,'&#-'./01!2'/%,)!34!5'.6702'8!9:2!0/!;"!0.8!<3!5'.6702'8!9:2!0/!;=!0&'!>?..'>/'8!/@&?%$@!9#A'!/?!,&?B#8'!0!8#2/&#7%/'8!'.B#&?.-'./!/?!'+'>%/'!C08??,!0.8!:DE6702'8!FGHIJ!0,,1#>0/#?.2)!H!%2'&!>0.!&'K%'2/!0!B#&/%01!>1%2/'&!L?&!FGHIJ!'+'>%/#?.!7M!>?./0>/#.$!/@'!

B#&/%01!N?&O2,0>'2!2'&B'&!P9QIR)!

()!H!9S!9:!#2!>1?.'8!0.8!2/0&/'8!#.!'0>@!2#/'T!?LL'&#.$!0116/?6011!>?..'>/#B#/M!/?!011!9:2)!

U)!J@'! 7#?1?$#2/! 1?$2! #./?! ?.'! ?L! /@'!9:2T! %,1?082! /@'!#.,%/!2'K%'.>'!0.8!2/0&/2!'+'>%/#.$!FGHIJ!V?72)!

!E.! /@'! FGHIJ! '+,'&#-'./2! >?.8%>/'8T! 3! 2'/2! ?L!

.%>1'?/#8'!2'K%'.>'2T!'0>@!N#/@!WUX!2'K%'.>'2!N'&'!%2'8!02!#.,%/)! J@'! 7102/+! ,&?$&0-! L&?-! /@'! A=FE! FGHIJ! 2%#/'T!N@#>@! >?-,0&'2! .%>1'?/#8'2! 2'K%'.>'2! 0$0#.2/! ,&?/'#.!2'K%'.>'2! 7M! /&0.210/#.$! /@'! .%>1'?/#8'2! #./?! 0-#.?! 0>#82T!N02! %2'8)! =@0&0>/'&#2/#>2! ?L! /@'! 3! #.,%/! 2'/2! 0&'! 2@?N.! #.!JHFG*!EE)!

JHFG*!EE)!! *5D*SE:*AJHG!I*Y;*A=*I!=CHSH=J*SEIJE=I)!

! "#$!%&'(! "#$!)*&+,!-!&.!)/01/'2/)! WUX! WUX!3&'(/),!)/01/'2/! (TZ<[! Z(X!4*&+,/),!)/01/'2/! 3<(! <X<!4,5'65+6!6/785,8&'! (<(! <\X!9&,5%!'1:;/+!&.!'12%/&,86/)!

<TX\3T<4[! 43(T\[(!

<7/+5(/!'12%/&,86/)!=/+!)/01/'2/!

<<<U)Z3! 44[)4\!

4/01/',85%!/>/21,8&'!,8:/!

4[!@?%&2!0.8!4U!-#.%/'2!

<\!@?%&2!0.8!XU!-#.%/'2!

E9)!*9HG;HJE]A!HA^!*5D*SE:*AJHG!S*I;GJI!J?! 'B01%0/'! /@'! ?B'&@'08! ?L! -0>@#.'! B#&/%01#_0/#?.T!

'+,'&#-'./2! N#/@! FGHIJ! N'&'! >?.8%>/'8! ?.! ,@M2#>01!-0>@#.'2!P,%&'!G#.%+!2M2/'-2!N#/@?%/!5'.!9::R!0.8!10/'&!?.! /@'! 5'.! 9:2! @?2/'8! 7M! /@'! 20-'!-0>@#.'2T! %2#.$! /@'!,@M2#>01! .?8'2! 0/! ;")! 9#&/%01#_0/#?.! ?B'&@'08! #2! 70&'1M!.?/#>'071'!N@'.!'+'>%/#.$!FGHIJ!02!#/!>0.!7'!?72'&B'8!#.!"#$)!U)!J@#2!#2!8%'!/?!/@'!-#.#-01!.''8!L?&!E`]!?,'&0/#?.2!02!0! &'2%1/! ?L! &',1#>0/#.$! /@'! 80/0702'! #.! '0>@! .?8'! 0.8! L%11M!011?>0/#.$! #/! /?! -0#.! -'-?&M)! E/! 012?! >?.L#&-2! ?/@'&!?72'&B0/#?.2! ?L! B#&/%01#_0/#?.! @0B#.$! .'$1#$#71'! #-,0>/! ?.!

>?-,%/'68?-#.0/'8! 0,,1#>0/#?.2! ?&! 'B'.! #-,&?B#.$!,'&L?&-0.>'! 8%'! /?! 0&/#L0>/2! ?L! #-,1'-'./0/#?.T! ')$)! L#1'!2M2/'-!8?%71'!>0>@#.$!a4Ub)!!J?! 'B01%0/'! /@'! 2>0107#1#/M! ?L! -,#FGHIJ! 0.8!

=1?%8FGHIJT!'+,'&#-'./2!N#/@!8#LL'&'./!.%-7'&2!?L!.?8'2T!,&?>'22?&2T! 80/0702'! L&0$-'./2! 0.8! #.,%/! 80/0! 2'/2! N'&'!'+'>%/'8)! *+0>/1M! 3! ,&?>'22?&2! 0&'! %2'8! #.! '0>@! .?8')!Q@'.'B'&! ,?22#71'T! /@'! .%-7'&2! ?L! .?8'2! #.! '0>@! 2#/'! 0&'!O',/! 7010.>'8! P#)')T! L?&! '+,'&#-'./2! N#/@! Z! ,&?>'22?&2T! 3!.?8'2! #.! '0>@! 2#/'! 0&'! %2'8c! L?&! '+,'&#-'./2! N#/@! <U!,&?>'22?&2T! 4! .?8'2! 0&'! %2'8! #.! '0>@! 2#/'T! 0.8! 2?! ?.R)! "?&!'+,'&#-'./2!N#/@!U4!,&?>'22?&2T!<3!.?8'2!0/!;=!0.8!3X!0/!;"!N'&'! %2'8! 8%'! /?! 1#-#/'8!.%-7'&!?L! 0B0#1071'!.?8'2! 0/!;=)!S'2%1/2!0&'!2%--0&#_'8!#.!"#$)!\)!"?&!011!#.,%/!80/0!2'/2T!

=1?%8FGHIJ! 2@?N2! 0! 2-011! 08B0./0$'! ?B'&! -,#FGHIJ)!"?&! #.,%/!80/0!2'/!?L! 1?.$!2'K%'.>'2!0!(\6L?18!2,''8%,!>0.!7'! ?72'&B'8! N#/@! =1?%8FGHIJ!N@'.! %2#.$! U4! ,&?>'22?&2!?.!3!2#/'2T!N@#1'!-,#FGHIJ!8'1#B'&2!(36L?18!2,''8%,)!J@'!2'K%'./#01! '+'>%/#?.! /#-'! L?&! /@'! 1?.$! #.,%/! 2'/! ?L! 4[)\U!@?%&2!?.!0! 2#.$1'!;"!-0>@#.'! #2! &'8%>'8! /?! 4U!-#.%/'2!?.!/@'! >1?%8! N#/@! =1?%8FGHIJ! 0.8! /?! (X! -#.%/'2! N#/@!-,#FGHIJ)!I#-#10&1MT!L?&!/@'!2@?&/!#.,%/!2'/T!<\)<!@?%&2!?L!2'K%'./#01! '+'>%/#?.! #2! &'8%>'8! /?! <W)(! -#.%/'2! N#/@!=1?%8FGHIJ! 0.8! 3(! -#.%/'2! N#/@! -,#FGHIJ)! J@'!,'&L?&-0.>'! 8#LL'&'.>'! 7'/N''.! =1?%8FGHIJ! 0.8!-,#FGHIJ! #2! -?&'! .?/#>'071'! N@'.! 2'K%'.>'2! 0&'! 2@?&/T!#.8#>0/#.$! 0.! 'LL#>#'./! #-,1'-'./0/#?.! ?L! #.,%/! 2'K%'.>'!2,1#/!0.8!V?7!>??&8#.0/#?.!7M!C08??,)!JN?62#/'!&'2%1/2!2@?N!7'//'&!2,''8%,!/@0.!?.'62#/'!8%'!/?!,&?>'22?&2!/@0/!0&'!L02/!0/!;=!P0,,&?+#-0/'1M!<)3!/#-'2!L02/'&R)!J@'!,%71#>1M!0B0#1071'!-,#FGHIJ! >0.! 7'! L%&/@'&! @0.86/%.'8! L?&! @#$@'&!,'&L?&-0.>'T! 7%/! 2%>@! ?,/#-#_'8! #-,1'-'./0/#?.2! 0&'! .?/!,%71#>1M!0>>'22#71'!a4\b)!H1/@?%$@! /@'! './#&'! /0&$'/! 80/0702'! L#/2! #./?!-'-?&M! #.!

/@'! 2'/%,! %2'8T! &%.2! N#/@! /@'! 80/0702'! 2'$-'./'8! #./?! [X!L&0$-'./2!N'&'!,'&L?&-'8!#.!?&8'&!/?!#.>&'02'!/@'!0-?%./!?L!.'/N?&O! >?--%.#>0/#?.! L?&! -,#FGHIJ! &%.2! P"#$)! ZR)!I,''8%,!$&0,@2! P"#$)! \R! 2@?N! /@0/! /@'&'!N02!.?! #-,0>/!?.!/@'!,'&L?&-0.>'!8%'!/?!9A)!!

"#$%&'!U)!! =?-,0&#2?.!?L!FGHIJ!2,''86%,2!?.!,@M2#>01!0.8!B#&/%01!-0>@#.'2)!I,''8%,!#2!>01>%10/'8!&'10/#B'!/?!2'K%'./#01!FGHIJ!?.!B#&/%01!&'2?%&>')!9:2!8'1#B'&!,'&L?&-0.>'!>?-,0&071'!N#/@!,@M2#>01!-0>@#.'2!

N@'.!'+'>%/#.$!FGHIJ)!

33\



Section 2

Design and Principles of MapReduce ExecutionRuntime


Subsection 1

File Systems


File System for Data-intensive Computing

MapReduce is associated with parallel file systems

• GFS (Google File System) and HDFS (Hadoop File System)• Parallel, Scalable, Fault tolerant file system• Based on inexpensive commodity hardware (failures)• Allows to mix storage and computation

GFS/HDFS specificities• Many files, large files (> x GB)• Two types of reads

• Large stream reads (MapReduce)• Small random reads (usually batched together)

• Write Once (rarely modified), Read Many• High sustained throughput (favored over latency)• Consistent concurrent execution


Example: the Hadoop Architecture

Data Analy)cs Applica)ons

NameNode

Distributed Computing (MapReduce)

Distributed Storage (HDFS)

JobTracker Masters

Data Node TaskTracker









Slaves


HDFS Main Principles

HDFS Roles• Application

• Split a file in chunks• The more chunks, the more

parallelism• Asks HDFS to store the

chunks

• Name Node• Keeps track of File metadata

(list of chunks)• Ensures the distribution and

replication (3) of chunks onData Nodes

• Data Node• Stores file chunks• Replicates file chunks

Applica'on

NameNode

Data Node 1

Data Node 2

Data Node 3

Data Node 4

Data Node 4

Data Node 5

Data Node 6

Data Node 7

A B C

File.txt

A

Write chunks A, B, C of file File.txt

Ok, use Data nodes 2, 4, 6

B

C


Data locality in HDFSRack Awareness

• Name node groups Data Nodein racks

rack 1 DN1, DN2, DN3, DN4rack 2 DN5, DN6, DN7, DN8

• Assumption that networkperformance differs if thecommunication is in-rack vsinter-racks

• Replication• Keep 2 replicas in-rack and 1

in an other rack

A DN1, DN3, DN5B DN2, DN4, DN7

• Minimizes inter-rackcommunications

• Tolerates full rack failure

Data Node 1

Data Node 2

Data Node 3

Data Node 4

Data Node 4

Data Node 5

Data Node 6

Data Node 7

A

A

A B

B

B


Writing to HDFS

Client Data Node 1 Data Node 3 Data Node 5 Name Node

write File.txt in dir d

Writeable

AddBlock A

Data nodes 1, 3, 5 (IP, port, rack number)

initiate TCP connection

Data Nodes 3, 5


Data Node 5


Pipeline Ready

Pipeline Ready

Pipeline Ready

For each Data chunkFor each Data chunk

• Name Nodes checks permissions and select a list of Data Nodes• A pipeline is opened from the Client to each Data Nodes


Pipeline Write HDFSClient Data Node 1 Data Node 3 Data Node 5 Name Node

Send Block A

Block Received A

Send Block A

Block Received A

Send Block A

Block Received A

Success

Close TCP connection

Success


Success


Success

For each Data chunkFor each Data chunk

• Data nodes send and receive chunk concurrently• Blocks are transferred one by one• Name Node updates the meta data table

A DN1, DN3, DN5


Fault Resilience and Replication

NameNode• Ensures fault detection

• Heartbeat is sent by DataNode to the NameNode every 3 seconds• After 10 successful heartbeats : Block Report• Meta data is updated

• Fault is detected• NameNode lists from the meta-data table the missing chunks• Name nodes instruct the Data Nodes to replicate the chunks that were on

the faulty nodes

Backup NameNode• updated every hour


Subsection 2

MapReduce Execution


Anatomy of a MapReduce Execution

Reducer

Combine

Mapper

Input FilesIntermediate

ResultsOutput Files

Output1

Mapper

Mapper

Reducer

Final OutputData Split

MapperReducer

Mapper

Output2

Output3

Distribution

• Distribution :• Split the input file in chunks• Distribute the chunks to the DFS



Reducer

Combine

Mapper


ResultsOutput Files

Output1

Mapper

Mapper

Reducer


MapperReducer

Mapper

Output2

Output3

Map

• Map :• Mappers execute the Map function on each chunks• Compute the intermediate results (i.e., list(k , v))• Optionally, a Combine function is executed to reduce the size of the IR



Reducer

Combine

Mapper


ResultsOutput Files

Output1

Mapper

Mapper

Reducer


MapperReducer

Mapper

Output2

Output3

Partition

• Partition :• Intermediate results are sorted• IR are partitioned, each partition corresponds to a reducer• Possibly user-provided partition function



Reducer

Combine

Mapper


ResultsOutput Files

Output1

Mapper

Mapper

Reducer


MapperReducer

Mapper

Output2

Output3

Shuffle

• Shuffle :• references to IR partitions are passed to their corresponding reducer• Reducers access to the IR partitions using remore-read RPC



Reducer

Combine

Mapper


ResultsOutput Files

Output1

Mapper

Mapper

Reducer


MapperReducer

Mapper

Output2

Output3

Reduce

• Reduce:• IR are sorted and grouped by their intermediate keys• Reducers execute the reduce function to the set of IR they have received• Reducers write the Output results



Reducer

Combine

Mapper


ResultsOutput Files

Output1

Mapper

Mapper

Reducer


MapperReducer

Mapper

Output2

Output3

Combine

• Combine :• the output of the MapReduce computation is available as R output files• Optionally, output are combined in a single result or passed as a parameter

for another MapReduce computation


Return to WordCount: Application-Level Optimization

The Combiner Optimization

• How to decrease the amount of information exchanged between mappersand reducers ?

→ Combiner : Apply reduce function to map output before it is sent toreducer

/ / Pseudo−code f o r the Combine f u n c t i o ncombine ( S t r i n g key , I t e r a t o r values ) :

/ / key : a word ,/ / values : l i s t o f countsi n t par t ia l_word_count = 0 ;f o r each v i n values :

par t ia l_word_count += parse In t ( v ) ;Emit ( key , AsStr ing ( par t ia l_word_count ) ;


Let’s go further: Word Frequency

• Input : Large number of text documents• Task: Compute word frequency across all the documents→ Frequency is computed using the total word count

• A naive solution relies on chaining two MapReduce programs• MR1 counts the total number of words (using combiners)• MR2 counts the number of each word and divide it by the total number of

words provided by MR1

• How to turn this into a single MR computation ?1 Property : reduce keys are ordered2 function EmitToAllReducers(k,v)

• Map task send their local total word count to ALL the reducers with the key ""• key "" is the first key processed by the reducer• sum of its values gives the total number of words


Word Frequency Mapper and Reducer

/ / Pseudo−code f o r the Map f u n c t i o nmap( S t r i n g key , S t r i n g value ) :

/ / key : document name, value : document contentsi n t word_count = 0 ;foreach word i n s p l i t ( value ) :

Emi t In te rmed ia te ( word , ’ ’ 1 ’ ’ ) ;word_count ++;

Emi t In termediateToAl lReducers ( " " , AsStr ing ( word_count ) ) ;

/ / Peudo−code f o r the Reduce f u n c t i o nreduce ( S t r i n g key , I t e r a t o r value ) :

/ / key : a word , values : a l i s t o f countsi f ( key=" " ) :

i n t to ta l_word_count = 0 ;foreach v i n values :

to ta l_word_count += Parse In t ( v ) ;e lse :

i n t word_count = 0 ;foreach v i n values :

word_count += Parse In t ( v ) ;Emit ( key , AsStr ing ( word_count / to ta l_word_count ) ) ;


Subsection 3

Key Features of MapReduce Runtime Environments


Handling Failures

On very large cluster made of commodity servers, failures are not a rareevent.

Fault-tolerant Execution• Workers’ failures are detected by heartbeat• In case of machine crash :

1 In progress Map and Reduce task are marked as idle and re-scheduled toanother alive machine.

2 Completed Map task are also rescheduled3 Any reducers are warned of Mappers’ failure

• Result replica “disambiguation”: only the first result is considered• Simple FT scheme, yet it handles failure of large number of nodes.


SchedulingData Driven Scheduling

• Data-chunks replication• GFS divides files in 64MB chunks and stores 3 replica of each chunk on

different machines• map task is scheduled first to a machine that hosts a replica of the input

chunk• second, to the nearest machine, i.e. on the same network switch

• Over-partitioning of input file• #input chunks >> #workers machines• # mappers > # reducers

Speculative Execution• Backup task to alleviate stragglers slowdown:

• stragglers : machine that takes unusual long time to complete one of the fewlast map or reduce task in a computation

• backup task : at the end of the computation, the scheduler replicates theremaining in-progress tasks.

• the result is provided by the first task that completes


Speculative Execution in Hadoop

• When a node is idle, the scheduler selects a task1 failed tasks have the higher priority2 non-running tasks with data local to the node3 speculative tasks (i.e., backup task)

• Speculative task selection relies on a progress score between 0 and 1• Map task : progress score is the fraction of input read• Reduce task :

• considers 3 phases : the copy phase, the sort phase, the reduce phase• in each phase, the score is the fraction of the data processed. Example : a task

halfway of the reduce phase has score 1/3 + 1/3 + (1/3 ∗ 1/2) = 5/6

• straggler : tasks which are 0.2 less than the average progress score


Section 3

Performance Improvements and Optimizations


Subsection 1

Handling Heterogeneity and Stragglers


Issues with Hadoop Scheduler and heterogeneousenvironment

Hadoop makes several assumptions :• nodes are homogeneous• tasks are homogeneous

which leads to bad decisions :• Too many backups, thrashing shared resources like network bandwidth• Wrong tasks backed up• Backups may be placed on slow nodes• Breaks when tasks start at different times


The LATE SchedulerThe LATE Scheduler (Longest Approximate Time to End)

• estimate task completion time and backup LATE task• choke backup execution :

• Limit the number of simultaneous backup tasks (SpequlativeCap)• Slowest nodes are not scheduled backup tasks (SlowNodeTreshold)• Only back up tasks that are sufficiently slow (SlowTaskTreshold)

• Estimate task completion time• progress rate = progress score

execution time• estimated time left = 1−progress score

progress rate

In practice SpeculativeCap=10% of available task slots, SlowNodeTreshold, SlowTaskTreshold = 25th percentileof node progress and task progress rate

Improving MapReduce Performance in Heterogeneous Environments. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. and Stoica, I. in OSDI’08: SixthSymposium on Operating System Design and Implementation, San Diego, CA, December, 2008.


Progress Rate ExampleProgress Rate Example

Time (min)

Node 1

Node 2

Node 3

3x slower

1.9x slower

1 task/min

1 min 2 min


LATE Example LATE Example

Node 1

Node 2

Node 3

2 min

Time (min)

Progress = 5.3%

Estimated time left:

(1-0.66) / (1/3) = 1

Estimated time left:

(1-0.05) / (1/1.9) = 1.8 Progress = 66%

LATE correctly picks Node 3


LATE Performance (EC2 Sort benchmark)

Heterogeneity243VM/99 Hosts 1-7VM/Host

Department of Computer Science, UIUC

!"#$%&'($)*(+$,-(-'&.-/-*(0$Each host sorted 128MB with a total of 30GB data

12-'3.-$!"#$45--675$&2-'$/382-9$$%#$&2-'$/&$:3;<754$

=$

=>#$

=>?$

=>@$

=>A$

B$

B>#$

B>?$

C&'4($ D-4($ 12-'3.-$

&'()*+,-./

01.23'

42.05,).0

E&$D3;<754$ ,36&&5$E382-$ F1G!$%;+-67H-'$

?A$

ResultsAverage 27% speedup overHadoop, 31% over no backups

Stragglers4 stragglers emulated by runningdd

Department of Computer Science, UIUC

!"#$%&'($)*(+$%(',--./'0$

12/',-/$!"#$03//453$&2/'$6,72/8$$$%#&&2/'$6&$9,:;530$

<=<$

<=>$

?=<$

?=>$

#=<$

#=>$

@&'0($ A/0($ 12/',-/$'()*+,-./0

&1/23(

42/&5-*/&

B&$A,:;530$ C,4&&3$B,72/$ D1E!$%:+/45./'$Each node sorted 256MB with a total of 25GB of data

Stragglers created with 4 CPU (800KB array sort) and 4 disk (dd tasks) intensive processes

FG$

ResultsAverage 58% speedup overHadoop, 220% over no backups


Understanding Outliers• Stragglers : Tasks that take ≥ 1.5 the median task in that phase• Re-compute : When a task output is lost, dependant tasks that wait until

the output is regenerated

Wenowquantify themagnitude of the outlier problem,before presenting our solution in detail.

! Quantifying the Outlier Problem

We characterize the prevalence and causes of outliers andtheir impact on job completion times and cluster resourceusage. We will argue that three factors – dynamics, con-currency and scale, that are somewhat unique to largeMap-Reduce clusters for e!cient and economic opera-tion, lie at the core of the outlier problem. To our knowl-edge, we are the "rst to report detailed experiences froma large production Map-Reduce cluster.

!." Prevalence of Outliers

Figure #(a) plots the fraction of high runtime outliers andrecomputes in a phase. For exposition, we arbitrarily saythat a task has high runtime if its time to "nish is longerthan $.%x the median task duration in its phase. By re-computes, we mean instances where a task output is lostand dependent tasks wait until the output is regenerated.We see in Figure #(a) that #%& of phases have more

than $%& of their tasks as outliers. 'e "gure also showsthat ((& of the phases see no recomputes. 'ough rare,recomputes have a widespread impact (§).*). Two out ofa thousand phases have over %+& of their tasks waiting fordata to be recomputed.How much longer do outliers run for? Figure #(b)

shows that ,+& of the runtime outliers last less than #.%times the phase’s median task duration, with a uniformprobability of being delayed by between $.%x to #.%x. 'etail is heavy and long– $+& take more than $+x the me-dian duration. Ignoring these if they happen early in aphase, as current approaches do, appears wasteful.Figure #(b) shows that most recomputations behave

normally, (+& of them are clustered about the mediantask, but *& take over $+x longer.

!.# Causes of Outliers

To tease apart the contributions of each cause, we "rst de-termine whether a task’s runtime can be explained by theamount of data it processes or reads across the network!.If yes, then the outlier is likely due to workload imbalanceor poor placement. Otherwise, the outlier is likely due toresource contention or problematic machines.Figure *(a) shows that in )+& of the phases (top right),

all the tasks with high runtimes (i.e., over $.%x the me-

!For each phase, we !t a linear regression model for task lifetimegiven the size of input and the volume of tra"c moved across lowbandwidth links. When the residual error for a task is less than #$%,i.e., its run time is within [.&, '.#]x of the time predicted by this model,we call it explainable.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 0.1 0.2 0.3 0.4 0.5

Cu

mu

lativ

e F

ract

ion

of

Ph

ase

s

Fraction of Outliers

high runtimerecompute

(a) What fraction of tasks in aphase are outliers?

0

0.2

0.4

0.6

0.8

1

0 1 2 4 6 8 10

Cu

mu

lativ

e

Ratio of Straggler Duration to the Duration of the Median Task

high runtimerecompute

(b) How much longer do out-liers take to !nish?

Figure #: Prevalence of Outliers.

dian task) are well explained by the amount of data theyprocess or move on the network. Duplicating these taskswould not make them run faster and will waste resources.At the other extreme, in $,& of the phases (bottom le-),none of the high runtime tasks are explained by the datathey process. Figure *(b) shows tasks that take longerthan they should, as predicted by the model, but do nottake over $.%x the median task in their phase. Such taskspresent an opportunity for improvement. 'ey may "n-ish faster if run elsewhere, yet current schemes donothingfor them. #+& of the phases (on the top right) have over%%& of such improvable tasks.

Data Skew: It is natural to ask why data size varies acrosstasks in a phase. Across phases, the coe!cient of vari-ation ( stdev

mean ) in data size is .*) and *.$ at the %+th and(+th percentiles. From experience, dividing work evenlyis non-trivial for a few reasons. First, scheduling each ad-ditional task has overhead at the job manager. Networkbandwidth is another reason. 'ere might be too muchdata on a machine for a task to process, but it may beworse to split the work into multiple tasks and move dataover the network. A third reason is poor coding practice.If the data is partitioned on a key space that has too littleentropy, i.e., a few keys correspond to a lot of data, thenthe partitions will di.er in size. Some reduce tasks are notamenable to splitting (neither commutative nor associa-tive [#/]), and hence each partition has to be processed byone task. Some joins and sorts are similarly constrained.Duplicating tasks that run for long because they have a lotof work to do is counter-productive.

Crossrack Tra$c: Reduce phases contribute over /+&of the cross rack tra!c in the cluster, while most of therest is due to joins. We focus on cross rack tra!c becausethe links upstream of the racks have less bandwidth thanthe cumulative capacity of servers in the rack.

We "nd that crossrack tra!c leads to outliers in twoways. First, in phases where moving data across racks isavoidable (through locality constraints), a task that endsup in a disadvantageous network location runs slowerthan others. Second, in phases where moving data acrossracks is unavoidable, not accounting for the competitionamong tasks within the phase (self-interference) leads tooutliers. In a reduce phase, for example, each task reads

Causes of outliers• Data skew: few keys corresponds to a lot of data• Network contention : 70% of the crossrack traffic induced by reduce• Bad and Busy machines : half of recomputes happen on 5% of the

machinesG. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 48 / 84

The Mantri Scheduler

• cause- and resource- aware scheduling• real-time monitoring of the tasks : define

!"

#"

$"

%""

!"#$%&'(()*&+!"#,*-"./0%*+

123#4#5"

6#78.*

9

:9

9 :9 ;9 <9 =9 >99

!"#,*-"./0%*+!*8%?*&

@A*')#,*ABC#D4E#8B#1"./)*%8"B#78.*

123#4#5"

6#

DF;CG4E

Figure !: Percentage speed-up of job completion time in theideal case when (some combination of) outliers do not occur.

!"#$%#$&"#'(")')%*"+),%*

-.$/*'/.0%'1&((2',.3.,&$4

• 1+35&,.$%• 6&557')%*$.)$

#%$8")6'.8.)%'35.,%9%#$

• )%35&,.$%'"+$3+$• 3)%:,"93+$%

*$.)$'$.*6*'$/.$'1"'9")%'(&)*$

;#%<+.5'8")6'1&0&*&"#

%=%,+$%>.*6'?3%).$&"#* )%.1'&#3+$

?+$5&%)!.+*%*

@"5+$&"#*

A#3+$''B%,"9%*'+#.0.&5.B5%

Figure ": #e Outlier Problem: Causes and Solutions

By inducing high variability in repeat runs of the samejob, outliers make it hard to meet SLAs. At median, theratio of stdev

mean in job completion time is $.", i.e., jobs have anon-trivial probability of taking twice as long or %nishinghalf as quickly.To summarize, we take the following lessons from our

experience.• High running times of tasks do not necessarily indicateslow execution - there are multiple reasons for legiti-mate variation in durations of tasks.

• Every job is guaranteed some slots, as determined bycluster policy, but can use idle slots of other jobs.Hence, judicious usage of resources while mitigatingoutliers has collateral bene%t.

• Recomputations a&ect jobs disproportionately. #eymanifest in select faulty machines and during times ofheavy resource usage. Nonetheless, there are no indi-cations of faulty racks.

! Mantri Design

Mantri identi%es points at which tasks are unable to makeprogress at the normal rate and implements targeted solu-tions. #e guiding principles that distinguishMantri fromprior outlier mitigation schemes are cause awareness andresource cognizance.Distinct actions are required for di&erent causes. Fig-

ure " speci%es the actions Mantri takes for each cause. If atask straggles due to contention for resources on the ma-chine, restarting or duplicating it elsewhere can speed itup (§'.(). However, not moving data over the low band-width cross rack links, and if unavoidable, doing so whileavoiding hotspots requires systematic placement (§'.)).To speed up tasks that wait for lost input to be recom-puted, we %nd ways to protect task output (§'.*). Finally,for tasks with a work imbalance, we schedule the large

!"#$%&%'%

!% !%!%!%'!% !%

'!%

!"#$%&%'%

!% !%!%

!%'!%!% '!%

!"#$%

()*!(%&%'%

!%+!%

!%!%

'!% !%'!%,-($)".$/%

0"))/%1$(!-1!%

234)"5-!$%

67*%$-1)8%

Figure +: A stylized example to illustrate our main ideas. Tasksthat are eventually killed are %lled with stripes, repeat instancesof a task are %lled with a lighter mesh.

tasks before the others to avoid being stuck with the largeones near completion (§'.,).#ere is a subtle point with outlier mitigation: reduc-

ing the completion time of a task may in fact increase thejob completion time. For example, replicating the outputof every task will drastically reduce recomputations–bothcopies are unlikely to be lost at the same time, but canslow down the job because more time and bandwidth areused up for this task denying resources to other tasks thatare waiting to run. Similarly, addressing outliers early ina phase vacates slots for outstanding tasks and can speedup completion. But, potentially uses more resources pertask. Unlike Mantri, none of the existing approaches actearly or replicate output. Further, naively extending cur-rent schemes to act early without being cognizant of thecost of resources, as we show in §-, leads to worse perfor-mance.Closed-loop action allows Mantri to act optimistically

by bounding the cost when probabilistic predictions goawry. For example, even when Mantri cannot ascertainthe cause of an outlier, it experimentally starts copies. Ifthe cause does not repeatedly impact the task, the copycan %nish faster. To handle the contrary case, Mantri con-tinuously monitors running copies and kills those whosecost exceeds the bene%t.Based on task progress reports, Mantri estimates for

each task the remaining time to %nish, trem, and the pre-dicted completion time of a new copy of the task, tnew .Tasks report progress once every ($s or ten times in theirlifetime, whichever is smaller. We use ! to refer to thisperiod. We defer details of the estimation to §'.' and pro-ceed to describe the algorithms for mitigating each of themain causes of outliers. All that matters is that trem bean accurate estimate and that the predicted distributiontnew account for the underlying work that the task has todo, the appropriateness of the network location and anypersistent slowness of the new machine.

!." Resource-aware Restart

We begin with a simple example to help exposition. Fig-ure + shows a phase that has seven tasks and two slots.

Reining in the Outliers in Map-Reduce Clusters using Mantri in G. Ananthanarayanan, S. Kandula, A.Greenberg, I. Stoica, Y. Lu, B. Saha, E. Harris OSDI’10: Sixth Symposium on Operating System Design andImplementation, Vancouver, Canada, December, 2010.


Resource Aware Restart• Restart Outliers tasks on better machines• Consider two options : kill & restart or duplicate

!"

#"

$"

%""

!"#$%&'(()*&+!"#,*-"./0%*+

123#4#5"

6#78.*

9

:9

9 :9 ;9 <9 =9 >99

!"#,*-"./0%*+!*8%?*&

@A*')#,*ABC#D4E#8B#1"./)*%8"B#78.*

123#4#5"

6#

DF;CG4E

Figure !: Percentage speed-up of job completion time in theideal case when (some combination of) outliers do not occur.

!"#$%#$&"#'(")')%*"+),%*

-.$/*'/.0%'1&((2',.3.,&$4

• 1+35&,.$%• 6&557')%*$.)$

#%$8")6'.8.)%'35.,%9%#$

• )%35&,.$%'"+$3+$• 3)%:,"93+$%

*$.)$'$.*6*'$/.$'1"'9")%'(&)*$

;#%<+.5'8")6'1&0&*&"#

%=%,+$%>.*6'?3%).$&"#* )%.1'&#3+$

?+$5&%)!.+*%*

@"5+$&"#*

A#3+$''B%,"9%*'+#.0.&5.B5%

Figure ": #e Outlier Problem: Causes and Solutions

By inducing high variability in repeat runs of the samejob, outliers make it hard to meet SLAs. At median, theratio of stdev

mean in job completion time is $.", i.e., jobs have anon-trivial probability of taking twice as long or %nishinghalf as quickly.To summarize, we take the following lessons from our

experience.• High running times of tasks do not necessarily indicateslow execution - there are multiple reasons for legiti-mate variation in durations of tasks.

• Every job is guaranteed some slots, as determined bycluster policy, but can use idle slots of other jobs.Hence, judicious usage of resources while mitigatingoutliers has collateral bene%t.

• Recomputations a&ect jobs disproportionately. #eymanifest in select faulty machines and during times ofheavy resource usage. Nonetheless, there are no indi-cations of faulty racks.

! Mantri Design

Mantri identi%es points at which tasks are unable to makeprogress at the normal rate and implements targeted solu-tions. #e guiding principles that distinguishMantri fromprior outlier mitigation schemes are cause awareness andresource cognizance.Distinct actions are required for di&erent causes. Fig-

ure " speci%es the actions Mantri takes for each cause. If atask straggles due to contention for resources on the ma-chine, restarting or duplicating it elsewhere can speed itup (§'.(). However, not moving data over the low band-width cross rack links, and if unavoidable, doing so whileavoiding hotspots requires systematic placement (§'.)).To speed up tasks that wait for lost input to be recom-puted, we %nd ways to protect task output (§'.*). Finally,for tasks with a work imbalance, we schedule the large

!"#$%&%'%

!% !%!%!%'!% !%

'!%

!"#$%&%'%

!% !%!%

!%'!%!% '!%

!"#$%

()*!(%&%'%

!%+!%

!%!%

'!% !%'!%,-($)".$/%

0"))/%1$(!-1!%

234)"5-!$%

67*%$-1)8%

Figure +: A stylized example to illustrate our main ideas. Tasksthat are eventually killed are %lled with stripes, repeat instancesof a task are %lled with a lighter mesh.

tasks before the others to avoid being stuck with the largeones near completion (§'.,).#ere is a subtle point with outlier mitigation: reduc-

ing the completion time of a task may in fact increase thejob completion time. For example, replicating the outputof every task will drastically reduce recomputations–bothcopies are unlikely to be lost at the same time, but canslow down the job because more time and bandwidth areused up for this task denying resources to other tasks thatare waiting to run. Similarly, addressing outliers early ina phase vacates slots for outstanding tasks and can speedup completion. But, potentially uses more resources pertask. Unlike Mantri, none of the existing approaches actearly or replicate output. Further, naively extending cur-rent schemes to act early without being cognizant of thecost of resources, as we show in §-, leads to worse perfor-mance.Closed-loop action allows Mantri to act optimistically

by bounding the cost when probabilistic predictions goawry. For example, even when Mantri cannot ascertainthe cause of an outlier, it experimentally starts copies. Ifthe cause does not repeatedly impact the task, the copycan %nish faster. To handle the contrary case, Mantri con-tinuously monitors running copies and kills those whosecost exceeds the bene%t.Based on task progress reports, Mantri estimates for

each task the remaining time to %nish, trem, and the pre-dicted completion time of a new copy of the task, tnew .Tasks report progress once every ($s or ten times in theirlifetime, whichever is smaller. We use ! to refer to thisperiod. We defer details of the estimation to §'.' and pro-ceed to describe the algorithms for mitigating each of themain causes of outliers. All that matters is that trem bean accurate estimate and that the predicted distributiontnew account for the underlying work that the task has todo, the appropriateness of the network location and anypersistent slowness of the new machine.

!." Resource-aware Restart

We begin with a simple example to help exposition. Fig-ure + shows a phase that has seven tasks and two slots.

• restart or duplicate if P(tnew < trem) is high• restart if the remaining time is large trem > E(tnew ) + ∆

• duplicates if it decreases the total amount of resources usedP(trem > tnew

(c+1)c ) > δ, where c < 3 is the number of copies and

δ = 0.25


Network Aware Placement

• Schedule tasks so that it minimizes network congestion• Local approximation of optimal placement

• do not track bandwidth changes• do not require global co-ordination

• Placement of a reduce phase with n tasks, running on a cluster with rracks, In,r the input data matrix.

• Let d iu, d i

v the data to be uploaded and downloaded, and biu, bi

d thecorresponding available bandwidth

• the placement of is arg min(max( d iu

biv,

d iv

bid))


Mantri Performances

• Speed up the job median by55%

• Network aware dataplacement reduces thecompletion time of typicalreduce phase of 31%

• Network aware placement oftasks speeds up half of thereduce phases by > 60%

!"#$

%"#&!'#!

!"#(!&

%&

(&

)&

*+,-.

/01234

+24+

53678-1234+926-

&

"&

!&

*+,-.

/01234

+24+

53678-1234+926-

:3;.53/41

9<=8->324

?;3/7++++@A

?;-7

(a) Completion Time

!"#$%&

"&

$&

'()*+

,-./01

(/1(

203()*

40,5-*4

%#6

!"#$7#8 9#6

&

!&

'()*+

,-./01

(/1(

203()*

40,5-*4

:05+;0,1.

<=3>*20/1

?50,@((((AB

?5*@

(b) ! Resource Usage

Figure "#: Comparing Mantri’s straggler mitigation with thebaseline implementation on a production cluster of thousandsof servers for the four representative jobs.

! reduction in completion timeavg min max

Phase !".# $%.& !&.$Job "$.' (.) "*.$

Table $: Comparing Mantri’s network-aware spread of taskswith the baseline implementation on a production cluster ofthousands of servers.

to the cluster. Jobs that occupy the cluster for half the timesped up by at least #$."%. Figure "$(b) shows that &&% ofjobs see a reduction in resource consumption while theothers use up a few extra resources. 'ese gains are dueto Mantri’s ability to detect outliers early and accurately.'e success rate ofMantri’s copies, i.e., the fraction of timethey (nish before the original copy, improves by $.)x overthe earlier build. At the same time, Mantri expends fewerresources, it starts .*+x fewer copies. Further, Mantri actsearly, over &,% of its copies are started before the originaltask has completed *$% of its work as opposed to ++%with the earlier build.

Straggler Mitigation: To cross-check the above resultson standard jobs, we ran four prototypical jobs with andwithout Mantri twenty times each. Figure "# shows thatjob completion times improve by roughly $&% and re-source usage falls by roughly ",%. 'e histograms plotthe average reduction, error bars are the ",th and -,th

percentiles of samples. Further, we logged all the progressreports for these jobs. We (nd that Mantri’s predictor,based on reports from the recent past, estimates trem towithin a $.-% error of the actual completion time.

Placement of Tasks: To evaluate Mantri’s network-awarespreading of reduce tasks, we ran Group By, a job with along-running reduce phase, ten times on the productioncluster. Table $ shows that the reduce phase’s completiontime reduces by $).*% on average causing the job to speedup by an average of "$..%. To understand why, we mea-sure the spread of tasks, i.e., the ratio of the number ofconcurrent reduce tasks to the number of racks they ranin. High spread implies that some racks have more taskswhich interfere with each other while other racks are idle.Mantri’s spread is ".& compared to &.& for the earlier build.

To compare against alternative schemes and to pieceapart gains from the various algorithms in Mantri, we

!"

#"

$"

%""

!"#$%&$%''()*+,

-!./0/12$

34/!5"$67'8

9

:9

;:9 9 :9 <9 =9 >9 ?99

)*+,@$(A4%5B4@$86"7

-!./0/12$

34/

0/A4%5B67'8/78/-'C(D467'8/+7C4(a) Change in Completion Time

!"

#"

$"

%""

!"#$%&$%''()*+,!"#$%&'(%

)*+,-,./"

0%,*'1"2345

6

76

86

986 976 6 76 86 :6 ;6 <66

!"#$%&'(%!"5213

)*+,-,./"

0%,

-,$%&'(2345,35,$%04'1(%,=0">%(b) Change in Resource Usage

Figure "*: Comparing straggler mitigation strategies. Mantriprovides a greater speed-up in completion time while usingfewer resources than existing schemes.

present results from the trace-driven simulator.

!." CanMantrimitigate stragglers?

Figure "* compares stragglermitigation strategies in theirimpact on completion time and resource usage. 'e y-axes weighs phases by their lifetime since improving thelonger phases improves cluster e/ciency. 'e (gures plotthe cumulative reduction in these metrics for the $",Kphases in Table " with each repeated thrice. For this sec-tion, our common baseline is the scheduler that takes noaction on outliers. Recall from §.." that the simulator re-plays the task durations and the anomalies observed inproduction.Figures "*(a) and "*(b) show that Mantri improves

completion time by $"% and *$% at the &,th and +&th

percentiles and reduces resource usage by #% and +% atthese percentiles. From Figure "*(a), at the &,th per-centile, Mantri sped up phases by an additional #."x overthe ..-% improvement of Hadoop, the next best scheme.To achieve the smaller improvement Hadoop uses "&.-%more resources (Fig. "*(b)). Map-Reduce and Dryadhave no positive impact for ),% and &,% of the phasesrespectively. Up to the #,th percentile Dryad increasesthe completion time of phases. LATE is similar in its timeimprovement to Hadoop but uses fewer resources.'e reason for poor performance is that they miss out-

liers that happen early in the phase and by not knowingthe true causes of outliers, the duplicates they schedule aremostly not useful. Mantri and Dryad schedule .$ restartsper task for the average phase (.,. and .&. for LATE and


Subsection 2

Extending MapReduce Application Range


Twister: Iterative MapReduce!"#$%&'()(*%&'+,-&(.+/0&123&(! !"#$"%&$"'%('%(#$)$"&()%*(

+),")-./(*)$)(! (0'%123,)-./(.'%2(,3%%"%2(

4&)&5/)-./6(7)89,/*3&/($)#:#((! ;3-9#3-(7/##)2"%2(-)#/*(

&'773%"&)$"'%9*)$)($,)%#</,#((! =>&"/%$(#388',$(<',(?$/,)$"+/(

@)8A/*3&/(&'783$)$"'%#(4/B$,/7/.C(<)#$/,($5)%(D)*''8(',(!,C)*9!,C)*E?FG6((

! 0'7-"%/(85)#/($'(&'../&$()..(,/*3&/('3$83$#((

! !)$)()&&/##(+")(.'&).(*"#:#(E"25$H/"25$(4IJKLL(."%/#('<(M)+)(&'*/6((

! N388',$(<',($C8"&).(@)8A/*3&/(&'783$)$"'%#(O''.#($'(7)%)2/(*)$)((

Ekanayake, Jaliya, et al. Twister: a Runtime for Iterative Mapreduce. International Symposium on High Performance Distributed Computing. HPDC 2010.


Moon: MapReduce on Opportunistic Environments

• Consider volatile resources during their idleness time• Consider the Hadoop runtime→ Hadoop does not tolerate massive failures

• The Moon approach :• Consider hybrid resources: very volatile nodes and more stable nodes• two queues are managed for speculative execution : frozen and slow tasks• 2-phase task replication : separate the job between normal and homestretch

phase→ the job enter the homestretch phase when the number of remaining tasks < H%

available execution slots

Lin, Heshan, et al. Moon: Mapreduce on opportunistic environments International Symposium on High Performance Distributed Computing. HPDC 2010.


Towards Greener Data Center• Sustainable Computing Infrastructures which minimizes the impact on the

environment• Reduction of energy consumption• Reduction of Green House Gaz emission• Improve the use of renewable energy (Smart Grids, Co-location, etc...)

Solar Bay Apple Maiden Data Center (20MW, ) 00 acres : 400 000 m2. 20 MW peakG. Fedak (INRIA, France) MapReduce Salvador da Bahia, Brazil, 22/10/2013 56 / 84

Section 4

MapReduce beyond the Data Center


Introduction to Computational Desktop Grids

High Throughput Computingover Large Sets of IdleDesktop Computers

• Volunteer Computing Systems(SETI@Home, Distributed.net)

• Open Source Desktop Grid(BOINC, XtremWeb, Condor,OurGrid)

• Some numbers with BOINC(October 21, 2013)

• Active: 0.5M users, 1Mhosts.

• 24-hour average: 12PetaFLOPS.

• EC2 equivalent : 1b$/year

!"#$$%&'%&()$(*$&+&,)-%&'./"'#0)1%*"-&2%"-/00%$-

! 34"54%"&$%-&2*#--)0(%-&'%&()$(*$&'%&67-&2%0')01&&$%*"&1%82-&'.#0)(1#9#15

" :/$&'%&(;($%&<&5(/0/8#-%*"&'.5(")0

" )22$#()1#/0-&2)")$$=$%-&'%&4")0'%&1)#$$%

" >"4)0#-)1#/0&'%&?(/0(/*"-@&'%&2*#--)0(%

" A%-&*1#$#-)1%*"-&'/#9%01&2/*9/#"&8%-*"%"&$%*"&(/01"#,*1#/0

5(/0/8#%&'*&'/0


MapReduce over the Internet

Computing on Volunteers’ PCs (a la SETI@Home)• Potential of aggregate computing power and storage

• Average PC : 1TFlops, 4BG RAM, 1TByte disc• 1 billion PCs, 100 millions GPUs + tablets + smartphone + STB

• hardware and electrical power are paid for by consumers• Self-maintained by the volunteers

But ...

• High number of resources• Volatility

• Low performance• Owned by volunteer

• Scalable but mainly for embarrassingly parallel applications with few I/Orequirements

• Challenge is to broaden the application domain


Subsection 1

Background on BitDew


Large Scale Data Management

22/06/08 07:48bitdew.svg

Page 1 sur 1file:///Users/fedak/shared/projects/bitdew/trunk/doc/bitdew.svg

BitDew : a Programmable Environment for Large Scale DataManagement

Key Idea 1: provides an API and a runtime environment which integratesseveral P2P technologies in a consistent way

Key Idea 2: relies on metadata (Data Attributes) to drive transparently datamanagement operation : replication, fault-tolerance, distribution,placement, life-cycle.

BitDew: A Programmable Environment for Large-Scale Data Management and Distribution Gilles Fedak, Haiwu He and Franck Cappello International

Conference for High Performnance Computing (SC’08), Austin, 2008


BitDew : the Big Cloudy Picture

• Aggregates storage in a singleData Space:

• Clients put and get datafrom the data space

• Clients defines dataattributes

• Distinguishes service nodes(stable), client and Workernodes (volatile)

• Service : ensure faulttolerance, indexing andscheduling of data to Workernodes

• Worker : stores data onDesktop PCs

• push/pull protocol betweenclient→ service←Worker

Data Space

getput

client

REPLICAT =3


BitDew : the Big Cloudy Picture

• Aggregates storage in a singleData Space:

• Clients put and get datafrom the data space

• Clients defines dataattributes

• Distinguishes service nodes(stable), client and Workernodes (volatile)

• Service : ensure faulttolerance, indexing andscheduling of data to Workernodes

• Worker : stores data onDesktop PCs

• push/pull protocol betweenclient→ service←Worker

Reservoir NodesService Nodes

Data Space

getput

client

pull

pull

pull


Data Attributes

REPLICA : indicates how many occurrences of data shouldbe available at the same time in the system

RESILIENCE : controls the resilience of data in presence of ma-chine crash

LIFETIME : is a duration, absolute or relative to the existenceof other data, which indicates when a datum is ob-solete

AFFINITY : drives movement of data according to depen-dency rules

TRANSFER PROTOCOL : gives the runtime environment hints about the filetransfer protocol appropriate to distribute the data

DISTRIBUTION : which indicates the maximum number of piecesof Data with the same Attribute should be sent toparticular node.


Architecture Overview

File System

Http Ftp Bittorrent

Data Scheduler

Data Transfer

Active Data BitDew Transfer Manager

Command-line Tool

Service Container Storage Master/

Worker

Back

-end

sSe

rvice

API

Applications

DataCatalog

DataRepository

SQLServer

BitDew Runtime Environnement

DHT

• Programming APIs to create Data, Attributes, manage file transfers andprogram applications

• Services (DC, DR, DT, DS) to store, index, distribute, schedule, transferand provide resiliency to the Data

• Several information storage backends and file transfer protocols


Anatomy of a BitDew Application

• Clients1 creates Data andAttributes

2 put file into data slot created3 schedule the data with their

corresponding attribute

• Reservoir• are notified when data are

locally scheduledonDataScheduled ordeleted onDataDeleted

• programmers installcallbacks on these events

• Clients and Reservoirs canboth create data and installcallbacks


Data Space

createputschedule

pull

pull

dc/dsdr/dt


Anatomy of a BitDew Application

• Clients1 creates Data andAttributes

2 put file into data slot created3 schedule the data with their

corresponding attribute

• Reservoir• are notified when data are

locally scheduledonDataScheduled ordeleted onDataDeleted

• programmers installcallbacks on these events

• Clients and Reservoirs canboth create data and installcallbacks


Data Space

onDataScheduled

onDataSceduled

dc/dsdr/dt


Examples of BitDew Applications

• Data-Intensive Application• DI-BOT : data-driven master-worker Arabic characters recognition (M.

Labidi, University of Sfax)• MapReduce, (X. Shi, L. Lu HUST, Wuhan China)• Framework for distributed results checking (M. Moca, G. S. Silaghi

Babes-Bolyai University of Cluj-Napoca, Romania)

• Data Management Utilities• File Sharing for Social Network (N. Kourtellis, Univ. Florida)• Distributed Checkpoint Storage (F. Bouabache, Univ. Paris XI)• Grid Data Stagging (IEP, Chinese Academy of Science)


Experiment Setup

Hardware Setup• Experiments are conducted in a controlled environment (cluster for the

microbenchmark and Grid5K for scalability tests) as well as an Internetplatform

• GdX Cluster : 310-nodes cluster of dual opteron CPU (AMD-64 2Ghz,2GB SDRAM), network 1Gbs Ethernet switches

• Grid5K : an experimental Grid of 4 clusters including GdX• DSL-lab : Low-power, low-noise cluster hosted by real Internet users on

DSL broadband network. 20 Mini-ITX nodes, Pentium-M 1Ghz, 512MBSDRam, with 2GB Flash storage

• Each experiments is averaged over 30 measures


Fault-Tolerance Scenario20

DSL01

6040 80 1201000

DSL02

DSL03

DSL04

DSL05

DSL06

DSL07

DSL08

DSL09

DSL10

492KB/s

211KB/s

254KB/s

247KB/s

384KB/s

53KB/s

412KB/s

332KB/s

304KB/s

259KB/s

Evaluation of Bitdew in Presence of Host FailuresScenario: a datum is created with the attribute : REPLICA = 5,

FAULT_TOLERANCE = true and PROTOCOL = "ftp".Every 20 seconds, we simulate a machine crash by killing theBitDew process on one machine owning the data.We run the experiment in the DSL-Lab environment

Figure: The Gantt chart presents the main events : the blue box showsthe download duration, red box is the waiting time, and red starindicates a node crash.


Collective File Transfer with Replication

ALL_TO_ALLd1 d2

d4 d3

d1 + d2 + d3 + d4

Scenario: All-to-all file transfer withreplication.

• {d1,d2,d3,d4} is createdwhere each datum has theREPLICA attribute set to 4

• All-to-all collectiveimplemented with the AFFINITY

attribute. Each data attributesis modified on the fly so thatd1.affinity=d2, d2.affinity=d3,d3.affinity=d4, d4.affinity=d1.

0

0.2

0.4

0.6

0.8

1

150 200 250 300 350 400 450

Fra

ctio

n

KBytes/sec

Results: experiments on DSL-Lab• one run every hour during

approximately 12 hours• during the collective, file

transfers are performedconcurrently and the data sizeis 5MB.

• Cumulative DistributionFunction (CDF) plot forcollective all-to-all.


Multi-sources and Multi-protocols File Distribution

Scenario• 4 clusters of 50 nodes each,

each cluster hosting its owndata repository (DC/DR).

• first phase: a client uploads alarge datum to the 4 datarepositories

• second phase: the clusternodes get the datum from thedata repository of their cluster.

We vary:

1 the number of datarepositories from 1 to 4

2 the protocol used to distributethe data (BitTorrent vs FTP)

C1

C2

C3

C4

1st phase 2nd phase

dr2/dc2

dr1/dc1

dr3/dc3

dr4/dc4


Multi-sources and Multi-protocols File Distribution

BitTorrent (MB/s) FTP (MB/s)orsay nancy bordeaux rennes mean orsay nancy bordeaux rennes mean

1st phase1DR/1DC 30.34 30.34 98.50 98.502DR/2DC 29.79 12.35 21.07 73.85 49.86 61.864DR/4DC 22.07 10.36 11.51 9.72 13.41 49.41 22.95 26.19 22.89 30.362nd phase1DR/1DC 3.93 3.93 3.78 3.96 3.90 0.72 0.70 0.56 0.60 0.652DR/2DC 4.04 4.88 9.50 5.04 5.86 1.43 1.38 1.14 1.29 1.314DR/4DC 8.46 7.58 7.65 5.14 7.21 2.90 2.78 2.33 2.54 2.64

• for both protocols, increasing the number of data sources also increasesthe available bandwidth

• FTP performs better in the first phase and BitTorrent gives superiorperformances in the second phase

• the best combination (4DC/4DR, FTP for the first phase and BitTorrent forthe second phase) takes 26.46 seconds and outperforms by 168.6% thebest single data source, single protocol (BitTorrent) configuration


Subsection 2

MapReduce for Desktop Grid Computing


Challenge of MapReduce over the Internet

Reducer

Combine

Mapper


ResultsOutput Files

Output1

Mapper

Mapper

Reducer


MapperReducer

Mapper

Output2

Output3

• no shared file system nor directcommunication between hosts

• Faults and hosts churn

• Needs Data Replica Management

• Result Certification ofIntermediate Data

• Collective Operation (scatter +gather/reduction)

Towards MapReduce for Desktop Grids B. Tang, M. Moca, S. Chevalier, H. He, G. Fedak, in Proceedings of the Fifth International Conference on P2P,Parallel, Grid, Cloud and Internet Computing 3GPCIC, Fukuoka, Japan, Novembre 2010


Implementing MapReduce over BitDew (1/2)

Initialisation by the Master

1 Data slicing : Master createsDataCollectionDC = d1, . . . , dn, list of datachuncks sharing the sameMapInput attribute.

2 Creates reducer tokens eachone with its ownTokenReducX attribute(affinity set to the Reducer).

3 Creates and pin collectortokens withTokenCollector attribute(affinity set to the CollectorToken).

Algorithm 2 EXECUTION OF MAP TASKS

Require: Let Map be the Map function to executeRequire: Let M be the number of Mappers and m a single mapperRequire: Let dm = d1,m, . . .dk,m be the set of map input data received by

worker m

1. {on the Master node}2. Create a single data MapToken with affinity set to DC3.4. {on the Worker node}5. if MapToken is scheduled then6. for all data d j,m ! dm do7. execute Map(d j,m)8. create list j,m(k,v)9. end for10. end if

Algorithm 3 SHUFFLING INTERMEDIATE RESULTS

Require: Let M be the number of Mappers and m a single workerRequire: Let R be the number of ReducersRequire: Let listm(k,v) be the set of key, values pairs of intermediate

results on worker m

1. {on the Master node}2. for all r ! [1, . . . ,R] do3. Create Attribute ReduceAttrr with distrib = 14. Create data ReduceTokenr5. schedule data ReduceTokenr with attribute ReduceTokenAttr6. end for7.8. {on the Worker node}9. split listm(k,v) in i fm,1, . . . , i fm,r intermediate files10. for all file i fm,r do11. create reduce input data irm,r and upload i fm,r12. schedule data irm,r with a f f inity = ReduceTokenr13. end for

workers, the ReduceTokenAttr has distrib=1 flag value whichensures a fair distribution of the workload between reducers. Finally,the Shuffle phase is simply implemented by scheduling the portionedintermediate data with an attribute whose affinity tag is equalto the corresponding ReduceToken.

Algorithm 4 presents the Reduce phase. When a reducer, that isa worker which has the ReduceToken, starts to receive intermediateresults, it calls the Reduce function on the (k, list(v)). When allthe intermediate files have been received, all the values have beenprocessed for a specific key. If the user wishes, he can get all theresults back to the master and eventually combine them. To proceedto this last step, the worker creates an MasterToken data, which ispinnedAndScheduled. This operation means that the MasterTokenis known from the BitDew scheduler but will not be sent to anynodes in the system. Instead, MasterToken is pinned on the Masternode, which allows the result of the Reduce tasks, scheduled witha tag affinity set to MasterToken, to be sent to the Master.

D. Desktop Grid Features

We detail now some of the high level features that have beendeveloped to address the characteristics of Desktop Grid systems.

Latency hiding: Because computing resources are spread overthe Internet and because we cannot assume that direct communi-cation between hosts is always possible due to firewall settings,the communication latency can be orders of magnitude higher thanlatency provided by interconnection networks found in clusters.One of the mechanisms to hide high latency in parallel systems is

Algorithm 4 EXECUTION OF REDUCE TASKS

Require: Let Reduce be the Map function to executeRequire: Let R be the number of Mappers and r a single workerRequire: Let irr = ir1,r, . . . irm,r be the set of intermediate results received

by reducer r

1. {on the Master node}2. Create a single data MasterToken3. pinAndSchedule(MasterToken)4.5. {on the Worker node}6. if irm,r is scheduled then7. execute Reduce(irm,r)8. if all irr have been processed then9. create or with a f f inity = MasterToken10. end if11. end if12.13. {on the Master node}14. if all or have been received then15. Combine or into a single result16. end if

overlapping communication with computation and prefetch of data.We have designed a multi-threaded worker (see Figure 3) which canprocess several concurrent files transfers, even if the protocol used issynchronous such as HTTP for instance. As soon as a file transfer isfinished, the corresponding Map or Reduce tasks are enqueued andcan be processed concurrently. The number of maximum concurrentMap and Reduce tasks can be configured, as well as the minimumnumber of tasks in the queue before computations can start. InMapReduce, prefetch of data is natural as data are staged on thedistributed file system before computations are launched.

!"#$%&

'(

')

*+,-%./0%

1234%3

*+,,%3

-%./0%3

-%./0%35/##%3

'6

!

7+889+04"#$ "#$

%#$%#$

%#$

%#$

:0;%./8%

%#$

Figure 3. Worker handles the following threads: Mapper, Reducer and Redu-cePutter. Q1, Q2 and Q3 represent queues and are used for communication betweenthreads.

Collective file operation: MapReduce processing is composedof several collective file operations which in some aspects, looksimilar with collective communications in parallel programmingsuch as MPI. Namely, the collectives are the Distribution, theinitial distribution of file chunks (similar to MPI_Scatter), theShuffle, that is the redistribution of intermediate results between theMapper and the Reducer nodes (similar to MPI_AlltoAll) andthe Combine which is the assemblage of the final result on themaster node (similar to MPI_Gather). We have augmented theBitDew middleware with the notion of DataCollection to manipulateset of data as a whole, and DataChunk, to manipulate a part of dataindividually to implement the collective efficiently.

Fault-tolerance: In Desktop Grid, computing resources havehigh failure rates, therefore the execution runtime must be resilientto a massive number of crash failures. In the context of MapReduce,

Execution of Map Tasks

1 When a Worker receivesMapToken it launches thecorresponding MapTask andcomputes the intermediate valueslistj,m(k , v).


Implementing MapReduce over BitDew (2/2)

Algorithm 2 EXECUTION OF MAP TASKS

Require: Let Map be the Map function to executeRequire: Let M be the number of Mappers and m a single mapperRequire: Let dm = d1,m, . . .dk,m be the set of map input data received by

worker m

1. {on the Master node}2. Create a single data MapToken with affinity set to DC3.4. {on the Worker node}5. if MapToken is scheduled then6. for all data d j,m ! dm do7. execute Map(d j,m)8. create list j,m(k,v)9. end for10. end if

Algorithm 3 SHUFFLING INTERMEDIATE RESULTS

Require: Let M be the number of Mappers and m a single workerRequire: Let R be the number of ReducersRequire: Let listm(k,v) be the set of key, values pairs of intermediate

results on worker m

1. {on the Master node}2. for all r ! [1, . . . ,R] do3. Create Attribute ReduceAttrr with distrib = 14. Create data ReduceTokenr5. schedule data ReduceTokenr with attribute ReduceTokenAttr6. end for7.8. {on the Worker node}9. split listm(k,v) in i fm,1, . . . , i fm,r intermediate files10. for all file i fm,r do11. create reduce input data irm,r and upload i fm,r12. schedule data irm,r with a f f inity = ReduceTokenr13. end for

workers, the ReduceTokenAttr has distrib=1 flag value whichensures a fair distribution of the workload between reducers. Finally,the Shuffle phase is simply implemented by scheduling the portionedintermediate data with an attribute whose affinity tag is equalto the corresponding ReduceToken.

Algorithm 4 presents the Reduce phase. When a reducer, that isa worker which has the ReduceToken, starts to receive intermediateresults, it calls the Reduce function on the (k, list(v)). When allthe intermediate files have been received, all the values have beenprocessed for a specific key. If the user wishes, he can get all theresults back to the master and eventually combine them. To proceedto this last step, the worker creates an MasterToken data, which ispinnedAndScheduled. This operation means that the MasterTokenis known from the BitDew scheduler but will not be sent to anynodes in the system. Instead, MasterToken is pinned on the Masternode, which allows the result of the Reduce tasks, scheduled witha tag affinity set to MasterToken, to be sent to the Master.

D. Desktop Grid Features

We detail now some of the high level features that have beendeveloped to address the characteristics of Desktop Grid systems.

Latency hiding: Because computing resources are spread overthe Internet and because we cannot assume that direct communi-cation between hosts is always possible due to firewall settings,the communication latency can be orders of magnitude higher thanlatency provided by interconnection networks found in clusters.One of the mechanisms to hide high latency in parallel systems is

Algorithm 4 EXECUTION OF REDUCE TASKS

Require: Let Reduce be the Map function to executeRequire: Let R be the number of Mappers and r a single workerRequire: Let irr = ir1,r, . . . irm,r be the set of intermediate results received

by reducer r

1. {on the Master node}2. Create a single data MasterToken3. pinAndSchedule(MasterToken)4.5. {on the Worker node}6. if irm,r is scheduled then7. execute Reduce(irm,r)8. if all irr have been processed then9. create or with a f f inity = MasterToken10. end if11. end if12.13. {on the Master node}14. if all or have been received then15. Combine or into a single result16. end if

overlapping communication with computation and prefetch of data.We have designed a multi-threaded worker (see Figure 3) which canprocess several concurrent files transfers, even if the protocol used issynchronous such as HTTP for instance. As soon as a file transfer isfinished, the corresponding Map or Reduce tasks are enqueued andcan be processed concurrently. The number of maximum concurrentMap and Reduce tasks can be configured, as well as the minimumnumber of tasks in the queue before computations can start. InMapReduce, prefetch of data is natural as data are staged on thedistributed file system before computations are launched.

!"#$%&

'(

')

*+,-%./0%

1234%3

*+,,%3

-%./0%3

-%./0%35/##%3

'6

!

7+889+04"#$ "#$

%#$%#$

%#$

%#$

:0;%./8%

%#$

Figure 3. Worker handles the following threads: Mapper, Reducer and Redu-cePutter. Q1, Q2 and Q3 represent queues and are used for communication betweenthreads.

Collective file operation: MapReduce processing is composedof several collective file operations which in some aspects, looksimilar with collective communications in parallel programmingsuch as MPI. Namely, the collectives are the Distribution, theinitial distribution of file chunks (similar to MPI_Scatter), theShuffle, that is the redistribution of intermediate results between theMapper and the Reducer nodes (similar to MPI_AlltoAll) andthe Combine which is the assemblage of the final result on themaster node (similar to MPI_Gather). We have augmented theBitDew middleware with the notion of DataCollection to manipulateset of data as a whole, and DataChunk, to manipulate a part of dataindividually to implement the collective efficiently.

Fault-tolerance: In Desktop Grid, computing resources havehigh failure rates, therefore the execution runtime must be resilientto a massive number of crash failures. In the context of MapReduce,

Shuffling Intermediate Results

1 Worker partitions intermediateresults according to the partitionfunction and the number ofreducers and createsReduceInput with the attributeReducTokenR.

Execution of Reduce tasks

1 When a reducer receivesReduceInput irm,r files it lauchesthe corresponding launchReduceTask Reduce(irm,r ).

2 When all the irr have beenprocessed, the reducer combinesthe result in a single final resultOr .

3 eventually, the final result can besent back to the master using theTokenCollector attribute.


Handling Communications

Latency Hiding• Multi-thredead worker to overlap

communication and computation.

• The number of maximum concurrent Mapand Reduce tasks can be configured, aswell as the minimum number of tasks inthe queue before computations can start.

Barrier-free computation• Reducers detect duplication of intermediate results (that happen because of faults and/or

lags).

• Early reduction : process IR as they arrive→ allowed us to remove the barrier betweenMap and Reduce tasks.

• But ... IR are not sorted.


Scheduling and Fault-tolerance

2-level scheduling1 Data placement is ensured by the BitDew scheduler, which is mainly guided by the data

attribute.

2 Workers periodically report to the MR-scheduler, running on the master node the state oftheir ongoing computation.

3 The MR-scheduler determines if there are more nodes available than tasks to executewhich can avoid the lagger effect.

Fault-tolerance

• In Desktop Grid, computing resources have high failure rates :

→ during the computation (either execution of Map or Reduce tasks)→ during the communication, that is file upload and download.

• MapInput data and ReduceToken token have the resilient attribute enabled.

• Because intermediate results irj,r , j ∈ [1,m] have the affinity set toReduceTokenr , they will be automatically downloaded by the node on whichthe ReduceToken data is rescheduled.


Data Security and PrivacyDistributed Result Checking

• Traditional DG or VC projects implements result checking on the server.• IR are too large to be sent back to the server→ distributed result checking

• replicates the MapInput and Reducers• select correct results by majority voting

!"

!#

!$

%" &%"'"

(%

!)*+"

(a) No replica (b) Map replica

Figure 1. Dataflow

RF = {xi|xi = {fi,k|k = 1, rm}, i ∈ N}, to a set of rm

distinct workers as input for Map task. Consequently, theReducer receives a set of rm versions of intermediate filescorresponding to a map input fi. After receiving rm

2 + 1identical versions, the Reducer considers the respectiveresult correct and further, accepts it as input for a Reducetask.

Figure 1b illustrates the dataflow corresponding to aMapReduce execution with rm = 3.

In order to activate the checking mechanism for the resultsreceived from Reducers, the Master node schedules theReduceTokens T , with the attribute replica = rn (rn >2, rn ∈ N). Consequently, the BitDew services distributea set of copies of T , RT = {yj |yj = {tj,p|p = 1, rn}},to a set of rn distinct nodes. As effect, rn workers willperform Reduce tasks with identical input, for the partitiontj . Finally, the Master node receives rn versions for a resultrfj , and selects the correct one.

In figure 2a we present the dataflow corresponding to aMapReduce execution where the system checks the resultsproduced by both Mappers and Reducers. In this scenariowe consider rm = 3 and rn = 3.

It is possible that some malicious workers send super-fluous intermediate results, in the sense that the interme-diate result is entirely fabricated thus not being the outputcorresponding to a Map input fn. Reducers could easilyinhibit this type of behavior by filtering received intermedi-ate results. Hence, we implement a naming scheme whichallows the Reducer to check whether an intermediate resultcorresponds to a Map input fn.

We continue the discussion by presenting the naming

scheme mechanism. Before distributing the Map input F toworkers the Master, creates a set of codes K = {kn, kn =δ(FileName(fn))}, where kn represents a code (such asgiven by a MD5 Hashing function) corresponding to fn. Thelist accompanies each tj scheduled to workers, as discussedearlier in this section. Further, for each fn received, theMapper computes the code kn and attaches this value toifn,m, which is sent to the appropriate Reducer. Finally,the Reducer checks whether the accompanying code kn ofa received intermediate result is present in its own codes listK, ignoring failed results.

D. Evaluation Of The Results Checking Mechanism

In this section we describe the model for characteriz-ing errors while applying the majority voting method forMapReduce in Desktop Grids.

Notations:

• f : the fraction of error-prone nodes; such a nodesrandomly produce erroneous results, independently ofthe behavior of other nodes. The probability to selectsuch a faulty worker is 0 < f < 0.5.

• s: the sabotage rate of faulty / malicious workers.Given that a worker is faulty, it produces an erroneousresult with probability 0 < s < 1. Thus, in general,the probability of having an erroneous result from allworkers is fs.

• NT : is the number of ReduceTokens. For a giveninput fn, a Mapper outputs a result value for each tokentj .

• NF is the number of distinct input files (F ).

(a) No replication

!"

!#

!$

%" &%"'"

(%

!)*+"

(a) No replica (b) Map replica

Figure 1. Dataflow

RF = {xi|xi = {fi,k|k = 1, rm}, i ∈ N}, to a set of rm

distinct workers as input for Map task. Consequently, theReducer receives a set of rm versions of intermediate filescorresponding to a map input fi. After receiving rm

2 + 1identical versions, the Reducer considers the respectiveresult correct and further, accepts it as input for a Reducetask.

Figure 1b illustrates the dataflow corresponding to aMapReduce execution with rm = 3.

In order to activate the checking mechanism for the resultsreceived from Reducers, the Master node schedules theReduceTokens T , with the attribute replica = rn (rn >2, rn ∈ N). Consequently, the BitDew services distributea set of copies of T , RT = {yj |yj = {tj,p|p = 1, rn}},to a set of rn distinct nodes. As effect, rn workers willperform Reduce tasks with identical input, for the partitiontj . Finally, the Master node receives rn versions for a resultrfj , and selects the correct one.

In figure 2a we present the dataflow corresponding to aMapReduce execution where the system checks the resultsproduced by both Mappers and Reducers. In this scenariowe consider rm = 3 and rn = 3.

It is possible that some malicious workers send super-fluous intermediate results, in the sense that the interme-diate result is entirely fabricated thus not being the outputcorresponding to a Map input fn. Reducers could easilyinhibit this type of behavior by filtering received intermedi-ate results. Hence, we implement a naming scheme whichallows the Reducer to check whether an intermediate resultcorresponds to a Map input fn.

We continue the discussion by presenting the naming

scheme mechanism. Before distributing the Map input F toworkers the Master, creates a set of codes K = {kn, kn =δ(FileName(fn))}, where kn represents a code (such asgiven by a MD5 Hashing function) corresponding to fn. Thelist accompanies each tj scheduled to workers, as discussedearlier in this section. Further, for each fn received, theMapper computes the code kn and attaches this value toifn,m, which is sent to the appropriate Reducer. Finally,the Reducer checks whether the accompanying code kn ofa received intermediate result is present in its own codes listK, ignoring failed results.

D. Evaluation Of The Results Checking Mechanism

In this section we describe the model for characteriz-ing errors while applying the majority voting method forMapReduce in Desktop Grids.

Notations:

• f : the fraction of error-prone nodes; such a nodesrandomly produce erroneous results, independently ofthe behavior of other nodes. The probability to selectsuch a faulty worker is 0 < f < 0.5.


• NT : is the number of ReduceTokens. For a giveninput fn, a Mapper outputs a result value for each tokentj .

• NF is the number of distinct input files (F ).

(b) Replication of the Map tasks

(a) Map and Reduce replica

Figure 2. Dataflow

• rM : is the replica degree of Map input files F . AReducer receives rM versions of the intermediate resultcorresponding to fn, in order to accept it as correct.Thus, rM distinct nodes are needed to Map the fn

input.• rR: is the replica degree of the intermediate results.

Thus, rR distinct nodes Reduce rR versions of anintermediate result.

In the following we present the probabilistic schema thatwe apply for our system. Supposing that the Reducer is hon-est (does not sabotage) and computes the result verificationfor rM versions of intermediate results corresponding to aMap input fn. The probability for the Reducer to report anerroneous result after applying MVM is [2]:

�M (f, s, rM ) =

rM�

j=rM2 +1

�rM

j

�(fs)j(1− fs)rM−j (1)

If this error happens for at least one fn Map input, theReducer produces an error. Thus, for the set of input filesF , the probability for an honest reducer to produce an erroris:

�NF

1

��NF

(1−�NF)NF −1+

�NF

2

��2NF

(1−�NF)NF −2+· · ·+

�NF

NF

��NF

NF= �T

(2)If now we assume that the Reducer sabotages, then

regardless of whether the incoming intermediate results arecorrect or not, it will produce an error.

Further, the Master validates by majority voting setsof rR versions of the final results corresponding to eachReduceToken tj .

The probability that a replica of an intermediate resultis scheduled to a erroneous Reducer is fs, and to anhonest Reducer is (1 − fs). Also, the probability that thehonest Reducer reports an erroneous result due to obtainingerroneous intermediate results from Mappers is �T . Further,the probability that an honest Reducer produces an error is�T (1 − fs). Consequently, the probability that a Reduceroutputs an erroneous result is �R = fs + �T (1 − fs). Theerror rate for majority voting when checking the final resultscorresponding to a partition tj is then:

� =

rR�

j=rR2 +1

�rR

j

�(�R)j(1− �R)rR−j (3)

Now, if we suppose that the general (aggregated) errorrate E is obtained if at least one of the validations for tj iserroneous, then:

E =

�NT

1

��(1−�)NT −1+

�NT

2

��2(1−�)NT −2+· · ·+

�NT

NT

��NT

(4)

IV. EXPERIMENTS AND RESULTS

In this section we present the results of the theoreticalevaluation of our model employing the MVM method.We drive the evaluation by following the variation of theaggregated error rate E while modifying the key parameters

(c) Replication of both Map and Reduce tasks

Figure 2. Dataflows in the MapReduce implementation

distinct workers as input for Map task. Consequently, thereducer receives a set of rm versions of intermediate filescorresponding to a map input fi. After receiving rm

2 + 1identical versions, the reducer considers the respective resultcorrect and further, accepts it as input for a Reduce task.

Figure 2b illustrates the dataflow corresponding to aMapReduce execution where rm = 3.

In order to activate the checking mechanism for theresults received from reducers, the master node schedules theReduceTokens T , with the attribute replica = rr (rr >2, rr ∈ N). Consequently, the BitDew services distribute aset of copies of T , RT = {yj |yj = {tj,p|p = 1, rn}}, to aset of rr distinct nodes. As effect, rr workers will performReduce tasks with same input, for the partition tj . Finally,the master receives rr versions for a result rfj , and selectsthe correct one by applying a verification method.

In figure 2c we present the dataflow corresponding to aMapReduce execution where the system checks the resultsproduced by both mappers and reducers. In this scenario weconsider rm = 3 and rn = 3.

It is possible that some malicious workers send super-fluous intermediate results, in the sense that the intermedi-ate result is entirely fabricated thus not being the outputcorresponding to a Map input fn. reducers could easilyinhibit this type of behavior by filtering received intermedi-ate results. Hence, we implement a naming scheme whichallows the reducer to check whether an intermediate resultcorresponds to a Map input fn.

We continue by presenting the naming scheme mech-anism. Before distributing the Map input F to workersthe master, creates a set of codes K = {kn, kn =δ(FileName(fn))}, where kn represents a code (such asgiven by a MD5 Hashing function) corresponding to fn. Thelist accompanies each tj scheduled to workers, as discussedearlier in this section. Further, for each fn received, themappers compute the code kn and attaches this value toifn,m, which is sent to the appropriate reducer. Finally, thereducer checks whether the accompanying code kn of a

received intermediate result is present in its own codes listK, ignoring failed results.

IV. EVALUATION OF THE RESULTS CHECKINGMECHANISM

In this section we describe the model for characteriz-ing the errors produced by the MapReduce algorithm inVolunteer Computing. We assume that workers sabotageindependently of one another, thus we do not take intoconsideration the colluding nodes behavior. We employ theMVM both for the verification of the results produced bymappers and by reducers.

We give the following notations:• f : the fraction of error-prone nodes; such a nodes

randomly produce erroneous results, independently ofthe behavior of other nodes. The probability to selectsuch a faulty worker is 0 < f < 0.5.


• NT : is the number of ReduceTokens. For a giveninput fn, a mapper outputs a result value for each tokentj .

• NF is the number of distinct input files (F ).• rm: is the replication degree of Map input files F . A

reducer receives rM versions of the intermediate resultcorresponding to fn, in order to accept it as correct.Thus, rM distinct nodes are needed to Map the fn

input.• rr: is the replication degree of the intermediate results.

Thus, rR distinct nodes Reduce rR versions of anintermediate result.

• �acc is the acceptable error rate.The acceptable error rate is a threshold for the calculated

system error rate, and it can be assessed according to theapplication type running in Volunteer Computing. Error rates

Ensuring Data Privacy• Use a hybrid infrastructure : composed of private and public resources

• Use IDA (Information Dispersal Algorithms) approach to distribute and store securely thedata


BitDew Collective Communication

!

"!!

#!!!

#"!!

$!!!

$"!!

%!!!

# & ' #$ #( $& %$ &' (& '! )( #$'

*+,-./

012.3/4

56,7-87/9:012.:3;0<4

56,7-87/9:012.:35190,66.+94

=8799.6:012.

>79?.6:012.

Figure 5. Execution time of the Broadcast, Scatter, Gather collective for a 2.7GBfile to a varying number of nodes.

The next experiment evaluates the performance of the collectivefile operation. Considering that we need enough chunks to measurethe Scatter/Gather collective, we selected 16MB as the chunks sizefor Scatter/Gather measurement and 48MB for the broadcast. Thebenchmark consists of the following: we prepare a 2.7GB large fileinto chunks and measure the time to transfer the chunks to all of thenodes according to the collective scheme. Broadcast measurementis conducted using both FTP and BitTorrent, while FTP protocol isused for Scatter/Gather measurement. The broadcast, scatter, gatherresults are presented respectively in the Figure 5. The time to broad-cast the data linearly increases with the number of nodes becausenodes compete for network bandwidth, and using BitTorrent obtainsa better efficiency than using FTP in the same number of nodes. Withrespect to the gather/scatter, each node only downloads/uploads partof all chunks, and the time remain constant because the amount ofdata transferred does not increase with the number of nodes.

B. MapReduce Evaluation

We evaluate now the performance of our implementation ofMapReduce. The benchmark used is the WordCount application,which is a representative example of MapReduce application,adapted from the Hadoop distribution. WordCount counts the num-ber of occurrences of each word in a large collection of documents.The file transfer protocol used is HTTP protocol.

The first experiment evaluates the scalability of our implementa-tion when the number of nodes increases. Each node has a different5GB file to process, splitted into 50 local chunks. Thus, when thenumber of nodes doubles, the size of the whole document counteddoubles too. For 512 nodes2, the benchmark processes 2.5TB of dataand executes 50000 Map and Reduce tasks. Figure 6 presents thethroughput of the WordCount benchmark in MB/s versus the numberof worker nodes. This result shows the scalability of our approachand illustrates the potential of using Desktop Grid resources toprocess a vast amount of data.

The next experiment aims at evaluating the impact of a varyingnumber of mappers and reducers. Table II presents the time spentin Map function, the time spent in Reduce function and the totalmakespan of WordCount for a number of mappers varying from 4to 32 and a number of reducers varying from 1 to 16. As expected,the Map and Reduce time decreases when the number of mappersand reducers increases. The difference between the makespan and

2GdX has 356 double core nodes, so to measure the performance on 521 nodeswe run two workers per node on 256 nodes.

!

$!!

&!!

(!!

'!!

#!!!

#$!!

#&!!

& ' #( %$ (& #$' $"( "#$

*+,-./

0?6,@A?B@9:3C5D/4

Figure 6. Scalability evaluation on the WordCount application: the y axis presentsthe throughput in MB/s and the x axis the number of nodes varying from 1 to 512.

Table IIEVALUATION OF THE PERFORMANCE ACCORDING TO THE NUMBER OF MAPPERS

AND REDUCERS.

#Mappers 4 8 16 32 32 32 32#Reducers 1 1 1 1 4 8 16Map (sec.) 892 450 221 121 123 121 125

Reduce (sec.) 4.5 4.5 4.3 4.4 1 0.5Makespan (sec.) 473 246 142 146 144 150

the Map plus Reduce time is explained by the communication andtime elapsed in waiting loops. Although the Reduce time seemsvery low compared to the Map time, this is typical of MapReduceapplication. A survey [16] of scientific MapReduce applications atthe research Yahoo cluster showed that more than 93% of the codeswhere Map-only or Map-mostly applications.

C. Desktop Grid Scenario

In this section, we emulate a Desktop Grid on the GdX clusterby confronting our prototype to scenarios involving host crashes,laggers hosts and slow network connection.

The first scenario aims at testing if our system is fault tolerant,that is, if a fraction of our system fails, the remaining participantsare able to terminate the MapReduce application. In order todemonstrate this capacity, we propose a scenario where workerscrash at different times: the first fault (F1) is a worker node crashwhile downloading a map input file, the second fault (F2) occursduring the map phase and the third crash (F3) happens after theworker has performed the map and reduce tasks. We execute thescenario and emulate worker crash by killing the worker process.In Figure 7 we report the events as they were measured during theexecution of the scenario in a Gantt chart. We denote w1 ! w5 theworkers and m the master node. The execution of our experimentbegins with the master node uploading and scheduling two reducetoken files: Ut1 and Ut2 . Worker w1 receives t1 and worker w2receives t2. Then, the master node uploads and schedules the mapinput files (chunks) (UC1!5). Each worker downloads one suchchunk, denoted with DC1!5. Node w4 fails (F1) while downloadingmap input chunk DC4 . As the BitDew scheduler periodically checkswhether participants are still present, after a short moment followingthe failure, node w4 is considered to be failed. This conveys to therescheduling of C4 to node w2. Node w3 fails (F2) while performingthe map task M(C3). Then, chunk C3 is rescheduled to node w5.At F3, node w1 fails after having already performed the map taskM(C1) and several reduce tasks: RF1,1 , RF1,2 and RF1,5 . The notationRFp,k refers to the reduce task which takes as input the intermediate

Figure: Execution time of the Broadcast, Scatter, Gather collective for a 2.7GB file to avarying number of nodes


MapReduce Evaluation

!

"!!

#!!!

#"!!

$!!!

$"!!

%!!!

# & ' #$ #( $& %$ &' (& '! )( #$'

*+,-./

012.3/4

56,7-87/9:012.:3;0<4

56,7-87/9:012.:35190,66.+94

=8799.6:012.

>79?.6:012.

Figure 5. Execution time of the Broadcast, Scatter, Gather collective for a 2.7GBfile to a varying number of nodes.

The next experiment evaluates the performance of the collectivefile operation. Considering that we need enough chunks to measurethe Scatter/Gather collective, we selected 16MB as the chunks sizefor Scatter/Gather measurement and 48MB for the broadcast. Thebenchmark consists of the following: we prepare a 2.7GB large fileinto chunks and measure the time to transfer the chunks to all of thenodes according to the collective scheme. Broadcast measurementis conducted using both FTP and BitTorrent, while FTP protocol isused for Scatter/Gather measurement. The broadcast, scatter, gatherresults are presented respectively in the Figure 5. The time to broad-cast the data linearly increases with the number of nodes becausenodes compete for network bandwidth, and using BitTorrent obtainsa better efficiency than using FTP in the same number of nodes. Withrespect to the gather/scatter, each node only downloads/uploads partof all chunks, and the time remain constant because the amount ofdata transferred does not increase with the number of nodes.

B. MapReduce Evaluation

We evaluate now the performance of our implementation ofMapReduce. The benchmark used is the WordCount application,which is a representative example of MapReduce application,adapted from the Hadoop distribution. WordCount counts the num-ber of occurrences of each word in a large collection of documents.The file transfer protocol used is HTTP protocol.

The first experiment evaluates the scalability of our implementa-tion when the number of nodes increases. Each node has a different5GB file to process, splitted into 50 local chunks. Thus, when thenumber of nodes doubles, the size of the whole document counteddoubles too. For 512 nodes2, the benchmark processes 2.5TB of dataand executes 50000 Map and Reduce tasks. Figure 6 presents thethroughput of the WordCount benchmark in MB/s versus the numberof worker nodes. This result shows the scalability of our approachand illustrates the potential of using Desktop Grid resources toprocess a vast amount of data.

The next experiment aims at evaluating the impact of a varyingnumber of mappers and reducers. Table II presents the time spentin Map function, the time spent in Reduce function and the totalmakespan of WordCount for a number of mappers varying from 4to 32 and a number of reducers varying from 1 to 16. As expected,the Map and Reduce time decreases when the number of mappersand reducers increases. The difference between the makespan and

2GdX has 356 double core nodes, so to measure the performance on 521 nodeswe run two workers per node on 256 nodes.

!

$!!

&!!

(!!

'!!

#!!!

#$!!

#&!!

& ' #( %$ (& #$' $"( "#$

*+,-./

0?6,@A?B@9:3C5D/4

Figure 6. Scalability evaluation on the WordCount application: the y axis presentsthe throughput in MB/s and the x axis the number of nodes varying from 1 to 512.

Table IIEVALUATION OF THE PERFORMANCE ACCORDING TO THE NUMBER OF MAPPERS

AND REDUCERS.

#Mappers 4 8 16 32 32 32 32#Reducers 1 1 1 1 4 8 16Map (sec.) 892 450 221 121 123 121 125

Reduce (sec.) 4.5 4.5 4.3 4.4 1 0.5Makespan (sec.) 473 246 142 146 144 150

the Map plus Reduce time is explained by the communication andtime elapsed in waiting loops. Although the Reduce time seemsvery low compared to the Map time, this is typical of MapReduceapplication. A survey [16] of scientific MapReduce applications atthe research Yahoo cluster showed that more than 93% of the codeswhere Map-only or Map-mostly applications.

C. Desktop Grid Scenario

In this section, we emulate a Desktop Grid on the GdX clusterby confronting our prototype to scenarios involving host crashes,laggers hosts and slow network connection.

The first scenario aims at testing if our system is fault tolerant,that is, if a fraction of our system fails, the remaining participantsare able to terminate the MapReduce application. In order todemonstrate this capacity, we propose a scenario where workerscrash at different times: the first fault (F1) is a worker node crashwhile downloading a map input file, the second fault (F2) occursduring the map phase and the third crash (F3) happens after theworker has performed the map and reduce tasks. We execute thescenario and emulate worker crash by killing the worker process.In Figure 7 we report the events as they were measured during theexecution of the scenario in a Gantt chart. We denote w1 ! w5 theworkers and m the master node. The execution of our experimentbegins with the master node uploading and scheduling two reducetoken files: Ut1 and Ut2 . Worker w1 receives t1 and worker w2receives t2. Then, the master node uploads and schedules the mapinput files (chunks) (UC1!5). Each worker downloads one suchchunk, denoted with DC1!5. Node w4 fails (F1) while downloadingmap input chunk DC4 . As the BitDew scheduler periodically checkswhether participants are still present, after a short moment followingthe failure, node w4 is considered to be failed. This conveys to therescheduling of C4 to node w2. Node w3 fails (F2) while performingthe map task M(C3). Then, chunk C3 is rescheduled to node w5.At F3, node w1 fails after having already performed the map taskM(C1) and several reduce tasks: RF1,1 , RF1,2 and RF1,5 . The notationRFp,k refers to the reduce task which takes as input the intermediate

Figure: Scalability evaluation on the WordCount application: the y axis presents thethroughput in MB/s and the x axis the number of nodes varying from 1 to 512.


Fault-Tolerance Scenario

!"

!#

!$

%&'

()!*+)&,-%&'-.*'/0

1'+)&,-23+4

54,/6478

7"

7#

9),4-2&3+/:4

%;68<

!"#"$%

=3>4

?6@4,/+4-,&0&!8

!A

>1;68-B-A< (;:8<

1;:8<5;3 8C"<

(;38C$<

5;3 8CA<

5;3 8C#<

5;38C$< 5;38C8<%;6#<%;6A<

%;6#<

%;6"< %;6$<

%;68<

1;38CA<1;3"CA<

(;3 8CA<(;3"CA<

(;3 8C"<

(;6A<

(;6$<

(;6#<

(;6"< (;6$<

(;68<

(;3"C#<

(;6#<

(;68<

(;3 8C"<

(;3 8CA<

5;3"C"<

(;3"C8<

1;3"C#<

5;3 "CA<

5;3 "C8<

5;3"C$<1;38C"<

1;3"C8<5;38C8< 5;38

C"<5;38CA<

1;3"C8<

=

1;:"<

(;:"<

1;38C$<

54D6@4,/+4-,&0&

5;3 "C#<

E "AE8EEAE 8AE "EE AEE#AE#EE $EE $AE

1;0 8<

1;0 "<

(;08<

(;0"<

(;0 8<

Figure: Fault-tolerance scenario (5 mappers, 2 reducers): crashes are injected duringthe download of a file (F1), the execution of the map task (F2) and during the executionof the reduce task on the last intermediate result (F3).


Hadoop vs MapReduce/BitDew on Extreme ScenariosObjective

• Evaluates MapReduce/BitDew on Grid5K according to ten scenarios thatmimic an execution on the Internet

• firewall, sabotage, host churn, massive failures, network failures, laggers,heterogeneity etc. . .

Some results• Host churn : We periodically kill the MapReduce worker process on one node and launch a

new one on another node.

December 4th, 2012Anthony SIMONET - MapReduce on Desktop Grids

Evaluation: host churn

• We kill worker processes on one node and launch on another node to emulate volunteers joining and leaving the computation

• Hadoop setup:- HDFS chunk replica factor to 3- DataNode heartbeat timeout to 20 seconds

• Very small effect on BitDew• Hadoop fails when churn interval < 30 seconds

14

4 !"##$%

&'(#)*

!!" #"! $%&&

$%' %"# "$%

$ $ $

#(&)* $+!,* %+",*

#(&&&& $!&&&&& %"&&&&&

$-$&+(" $--&+'$ %&'!+!!

! " #$ %& $! #&" &'$ '#&

(

&((

!((

$((

"((

#(((

#&((

#!((

)*+,-.*/-0123,456

7,81!

7,81%

Fig. 1: Scalability evaluation on the WordCount application: the y axis presents the throughput in MB/s andthe x axis the number of nodes varying from 1 to 512.

of nodes doubles, the size of the whole document counted doubles too. For 512 nodes3, the benchmark processes2.5TB of data and executes 50,000 Map and Reduce tasks. Figure 1 presents the throughput of the WordCountbenchmark in MB/s versus the number of worker nodes. This result shows the scalability of our approach andillustrates the potential of using Desktop Grid resources to process a vast amount of data.

Fault Tolerance In a typical desktop grid system like BOINC[11] and Condor[1] in addition to the user’sbehavior, like shutting down the computer, running tasks are suspended when keyboard or mouse events aredetected. CPU availability traces of participating nodes gathered from a real enterprise desktop grid [12] showthat the independent single node unavailability rate is about 40% on average and that up to 90% of the resourcescan be unavailable simultaneously, causing catastrophic effect on the running jobs. We emulate this kind of faultsby killing worker processes on 25 worker nodes at different job progress point during the map phase.

We find that whenever we kill the running tasks on 25 nodes, the Hadoop JobTracker just re-schedulesthe 50 killed map tasks and prolongs the job make span time for about 6.5% in contrast to the normal case.For the second test, the JobTracker blindly re-executes all successfully completed and progressing map taskson the failed TaskTrackers. This indicates that all 25 chosen worker nodes do not contribute to the whole jobexecution progress at all. On the other hand, BitDew-MapReduce avoids substantial unnecessary fault tolerantworks. Because, in BitDew-MapReduce, the intermediate outputs of completed map tasks are safely stored onthe stable central storage server, the master does not re-execute the successfully completed map tasks of failedworkers.

Host Churn The independent arrival and departure of thousands or even millions of peer machines leads tohost churn. We periodically kill the MapReduce worker process on one node and launch it on a new node toemulate the host churn effect. To increase the survival probability of Hadoop job completion, we increase theHDFS chunk replica factor to 3, and set the DataNode heartbeat timeout value to 20 seconds. Because theBitDew MapReduce runtime does not waste the work completed by failing workers, host churn causes verysmall effects on the job completion time. On the other hand, as shown in table 2, for host churn intervals of 5,10 and 25 seconds, Hadoop jobs could only progress up to 80% of the map phase before failing.

Churn Interval (sec.) 5 10 25 30 50

Hadoop job makespan (sec.) failed failed failed 2357 1752

BitDew-MR joa makespan (sec.) 457 398 366 361 357

Fig. 2: Performance evaluation of host churn scenario

Network Connectivity We set custom firewall and NAT rules on all the worker nodes to turn down somenetwork links and observe how MapReduce jobs perform. In this test, Hadoop cannot even launch a job,

3 GdX has 356 double core nodes, so to measure the performance on 521 nodes we run two workers per node on 256nodes.

lundi 3 décembre 12

→ Hadoop fails when churn interval < 30 seconds• Network failure : network is shutdown during 25 seconds at different progress time

December 4th, 2012Anthony SIMONET - MapReduce on Desktop Grids

Evaluation: network fault tolerance

• Worker heartbeat set to 20 seconds on Hadoop and BitDew-MapReduce• We induce 25 seconds offline window periods to 25 workers at different

progress points

• The Hadoop job tracker marks all temporary disconnected nodes as dead- These tasks must be re-executed

• BitDew MapReduce allows temporary offline periods

17

5

tasks of it to other nodes, which significantly delays the total job execution time.

TABLE II. PERFORMANCE EVALUATION OF HOST CHURN SCENARIO

Churn Interval (sec.) 5 10 25 30 50 Hadoop job make-

span (sec.) failed failed failed 2357 1752

BitDew-MR job make-span (sec.) 457 398 366 361 357

Similar to fault tolerance scenario, the BitDew-MR runtime does not waste the completed works done by the eventually failed worker nodes, therefore the host churn exerts very little effects on the jobs performance.

D. Network Connectivity In this test, Hadoop could not even launch a job because

the HDFS needs inter-communication between two DataNodes. On the other hand, BitDew-MR works properly and the performance is almost the same as the baseline in the normal case of network conditions.

E. CPU Heterogeneity and Straggler As Figure 5 shows, Hadoop works very well with

dynamic task scheduling approach when worker nodes have different classes of CPU. Nodes from the fast group process 20 tasks on average and the ones from the slow group get about 11 tasks. Although, BitDew-MR has same scheduling heuristic but it does not perform well according to Figure 6. The nodes from the two groups get almost the same number of tasks. The reason is that we only maintain one chunk copy on the worker nodes for the sake of the assumption that there are no inter-worker communication and data transfer. Thus, although fast nodes spend half of time to process their local chunks compare to the slow nodes, they still need to take much time to download new chunks before launching additional map tasks.

TABLE III. PERFORMANCE EVALUATION OF STRAGGLERS SCENARIO

Straggler Number 1 2 5 10 Hadoop job make-

span (sec.) 477 481 487 490

BitDew-MR job make-span (sec.) 211 245 267 298

Figure 5. Hadoop map task distribution over 50 workers.

Figure 6. BitDew-MapReduce map task distribution over 50 workers.

F. Network Failure Tolerance In case of network failures, as shown in Table IV,

Hadoop JobTracker just marks all the temporary disconnected nodes as dead - although they are still running tasks, and blindly removes all the tasks done by these nodes from the successful task list. Re-executing these tasks significantly prolongs the job make-span. Meanwhile, the BitDew-MR clearly allows workers to go temporarily offline without any performance penalty.

The key idea behind map tasks re-execution avoidance which makes BitDew-MR outperforms Hadoop under machine and network failure scenarios is allowing reduce tasks to re-download map outputs that have already been uploaded to the central stable storage rather than re-generate them. Another benefit of this method is that it does not introduce any extra overhead since all the intermediate data should be always transferred through the central storage regardless of whether or not use the fault tolerant strategy. However, the overhead of data transfer to the central stable storage makes desktop grid more suitable for the applications which generate a few intermediate data, such as Word Count and Distributed Grep. To mitigate the data transfer overhead, we can use a storage cluster with large aggregated bandwidth. The emerging public cloud storage services also provide an alternative solution, which can be included in our future work.

VI. RELATED WORK There have been many studies on improving MapReduce

performance [17, 18, 19, 28] and exploring ways to support MapReduce on different architectures [16, 22, 23, 26]. A closely related work is MOON [23] which stands for MapReduce On Opportunistic eNvironments. Unlike our work, MOON limits the system scale within a campus area and assumes the underlying resources are hybrid and organized by provisioning a fixed fraction of dedicated stable computers to supplement other volatile personal computers, which is much more difficult to implement in the Internet scale desktop grids. The main idea of MOON is prioritizing new tasks and important data blocks and assigning them to the dedicated stable machines to guarantee smoothly progressing of jobs even many volatile PCs join and leave the system dynamically. MOON also makes some tricky

(a) Hadoop distribution

tasks of it to other nodes, which significantly delays the total job execution time.

TABLE II. PERFORMANCE EVALUATION OF HOST CHURN SCENARIO

Churn Interval (sec.) 5 10 25 30 50 Hadoop job make-

span (sec.) failed failed failed 2357 1752

BitDew-MR job make-span (sec.) 457 398 366 361 357

Similar to fault tolerance scenario, the BitDew-MR runtime does not waste the completed works done by the eventually failed worker nodes, therefore the host churn exerts very little effects on the jobs performance.

D. Network Connectivity In this test, Hadoop could not even launch a job because

the HDFS needs inter-communication between two DataNodes. On the other hand, BitDew-MR works properly and the performance is almost the same as the baseline in the normal case of network conditions.

E. CPU Heterogeneity and Straggler As Figure 5 shows, Hadoop works very well with

dynamic task scheduling approach when worker nodes have different classes of CPU. Nodes from the fast group process 20 tasks on average and the ones from the slow group get about 11 tasks. Although, BitDew-MR has same scheduling heuristic but it does not perform well according to Figure 6. The nodes from the two groups get almost the same number of tasks. The reason is that we only maintain one chunk copy on the worker nodes for the sake of the assumption that there are no inter-worker communication and data transfer. Thus, although fast nodes spend half of time to process their local chunks compare to the slow nodes, they still need to take much time to download new chunks before launching additional map tasks.

TABLE III. PERFORMANCE EVALUATION OF STRAGGLERS SCENARIO

Straggler Number 1 2 5 10 Hadoop job make-

span (sec.) 477 481 487 490

BitDew-MR job make-span (sec.) 211 245 267 298

Figure 5. Hadoop map task distribution over 50 workers.

Figure 6. BitDew-MapReduce map task distribution over 50 workers.

F. Network Failure Tolerance In case of network failures, as shown in Table IV,

Hadoop JobTracker just marks all the temporary disconnected nodes as dead - although they are still running tasks, and blindly removes all the tasks done by these nodes from the successful task list. Re-executing these tasks significantly prolongs the job make-span. Meanwhile, the BitDew-MR clearly allows workers to go temporarily offline without any performance penalty.

The key idea behind map tasks re-execution avoidance which makes BitDew-MR outperforms Hadoop under machine and network failure scenarios is allowing reduce tasks to re-download map outputs that have already been uploaded to the central stable storage rather than re-generate them. Another benefit of this method is that it does not introduce any extra overhead since all the intermediate data should be always transferred through the central storage regardless of whether or not use the fault tolerant strategy. However, the overhead of data transfer to the central stable storage makes desktop grid more suitable for the applications which generate a few intermediate data, such as Word Count and Distributed Grep. To mitigate the data transfer overhead, we can use a storage cluster with large aggregated bandwidth. The emerging public cloud storage services also provide an alternative solution, which can be included in our future work.

VI. RELATED WORK There have been many studies on improving MapReduce

performance [17, 18, 19, 28] and exploring ways to support MapReduce on different architectures [16, 22, 23, 26]. A closely related work is MOON [23] which stands for MapReduce On Opportunistic eNvironments. Unlike our work, MOON limits the system scale within a campus area and assumes the underlying resources are hybrid and organized by provisioning a fixed fraction of dedicated stable computers to supplement other volatile personal computers, which is much more difficult to implement in the Internet scale desktop grids. The main idea of MOON is prioritizing new tasks and important data blocks and assigning them to the dedicated stable machines to guarantee smoothly progressing of jobs even many volatile PCs join and leave the system dynamically. MOON also makes some tricky

(b) BitDew-MapReduce distribution

Fig. 3: Map task distribution over 50 workers divided in two groups according to their CPU clock rate

because it needs inter-communication between worker nodes, and BitDew mapreduce is almost independent ofsuch network conditions.

CPU Heterogeneity and Straggler We emulate CPU heterogeneity by adjusting the CPU frequency of halfthe worker nodes to 50% with wrekavoc. We emulate a straggler (a machine that takes an unusually long timeto complete one of the last few tasks in the computation) by adjusting the target nodes’ CPU frequency to 10%.Both Hadoop and BitDew MapReduce can launch backup tasks on idle machines when a mapreduce operationis close to completion to alleviate the stragglers problem. However, as shown in figure 3, BitDew MapReduceis slower than Hadoop in this process because only one copy of a chunk is available on the worker nodes at agiven time. Hence, the node performing the additional time first has to download the chunk before launchingthe map task.

Network Fault Tolerance To work in a Desktop Grid environment, the runtime system must be resilientto temporary network failures. We set the worker heartbeat timeout value to 20 seconds for both Hadoop andBitDew mapreduce, and inject a 25 seconds off-line window period to 25 worker nodes at different job progresspoints of the map phase. In this test, the Hadoop JobTracker simply marks all temporary disconnected nodesas “dead” when they are still running tasks. The tasks performed by these nodes are blindly removed from thesuccessful task list and must be re-executed, significantly prolonging the job makespan, as table 1. Meanwhile,the BitDew-MapReduce clearly allows workers to go temporarily offline without any performance penalty.

Job progress of the crash points 12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100%

HadoopRe-executed map tasks 50 100 150 200 250 300 350 400Job makespan (sec.) 425 468 479 512 536 572 589 601

BitDew-MRRe-executed map tasks 0 0 0 0 0 0 0 0Job makespan (sec.) 246 249 243 239 254 257 274 256

Table 1: Performance evaluation of the network fault tolerance scenario

4.2 Evaluation of the incremental BitDew MapReduce

To evaluate the performance of incremental MapReduce, we compare the time to process the full data setcompared with the time to update the result after modifying a part of the dataset. The experiment is configuredas follows; the benchmark is the word count application running with 10 mappers and 5 reducers, the dataset is 3.2 GB split in 200 chucks. Table 2 presents the time to update the result with respect to the originalcomputation time when a varying fraction of the dataset is modified. As expected, the less the dataset ismodified, the less time it takes to update the result: it takes 27% of the original computation time to updatethe result when 20% of the data chunks are modified. However, there is an overhead due to the fact that thelundi 3 décembre 12

• Hadoop marks all temporary disconnected nodes as failed→ tasks arere-executed


Section 5

Conclusion


Conclusion

• MapReduce Runtimes• Attractive programming model for data-intense computing• Many research issues concerning execution runtimes (heterogeneity,

stragglers, scheduling )• Many others : security, QoS, iterative MapReduce,

• MapReduce beyond the Data Center :• enable large scale data-intensive computing : Internet MapReduce

• Want to learn more ?• Home page for the Mini-Curso : http://graal.ens-lyon.fr/ gfedak/pmwiki-

lri/pmwiki.php/Main/MarpReduceClass• the MapReduce Workshop@HPDC (http://graal.ens-lyon.fr/mapreduce)• Desktop Grid Computing ed. C. Cerin and G. Fedak – CRC Press• our websites : http://www.bitdew.net and http://www.xtremweb-hep.org


mapreduce runtime environments: design, performance, optimizations

Technology