only aggressive elephants are fast elephants -...

69
Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve Computer Science Dept. Worcester Polytechnic Institute (WPI)

Upload: lehuong

Post on 19-Apr-2018

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Only Aggressive Elephants are Fast Elephants

Jun Fan, Vijay Sukhadeve

Computer Science Dept.

Worcester Polytechnic Institute (WPI)

Page 2: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Introduction/Motivation

Typical analysts Problem analyzing Web Logs

Source IP

Visit date

Add Revenue

Modifications to map Reduce Jobs for records of interest

Different Filter criteria

Page 3: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Introduction/Motivation

The approach is modify as we go along

Slow query run times - Not acceptable

Page 4: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Introduction/Motivation

Use Trojan Indexes ( Hadoop++)

Expensive Index Creation

Question is same old. ( Which attribute to Index)

Note - Trojan Index is a covering index consisting of a sparse directory over the sorted split data.

Page 5: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Idea

HDFS Keeps copies (default 3) of data

Failover

Same Physical Layout

Instead – Use different sort order – Indexes in replica. How? – (Use different vertical Layout)

Obviously – Use configurations

Physical Design Algorithm(Intelligent)

Page 6: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

HAIL

Changes upload pipeline of HDFS

Create different Clustered Indexes on each data block replica.

Sorting plus indexing done in memory

Cluster index on sourceIP

Replica – Index on visitdate

Page 7: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Figure 1: The HAIL upload pipeline

Page 8: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Figure 2: HAIL data column index

Page 9: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

HAIL

Considerations

Never split row into two blocks – different from HDFS

Convert data block into PAX(Pattern attribute Across)

Create sparse clustered B(+) Tree – single root directory (no multilevel tree) because block sizes are less than 1GB

Never flush data and checksum to Disk

Page 10: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

CONCLUSION This paper-work presents HAIL (Hadoop Aggressive Indexing Library).

HAIL improves the upload pipeline of HDFS to create different clustered

indexes on each replica. As a consequence each HDFS block will be

available in at least three different sort orders and with different indexes.

HAIL also works for a larger number of replicas.

A major advantage of HAIL is that the long upload and indexing times which had to

be invested on previous systems are not required anymore.

Hadoop++ created indexes per logical HDFS block whereas HAIL creates different

indexes for each physical replica.

HAIL is experimentally compared with Hadoop as well as Hadoop++ using

different datasets and a number of different clusters. The results demonstrated the

high

efficiency of HAIL. Users can upload their datasets up to 1.6x faster than Hadoop

and run jobs up to 68x faster than Hadoop.

Page 11: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

HAIL

Research Challenges

Create Indexes while upload

Modify MR to exploit sort order and indexes at run time.

Page 12: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

QUESTIONS?

Page 13: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve
Page 14: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Bob’s Perspective

—Hadoop’s Perspective

HAIL’s Perspective

Query Pipeline

Page 15: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Bob’s Perspective

Writes a MapReduce job, including a job

configuration class, a map function, and a

reduce function.

Specifies the HailInputFormat to enables

his MapReduce job to read HAIL Blocks.

Annotates his map function to specify the

selection predicate and the projected

attributes required by his MapReduce job.

Page 16: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Bob’s Perspective

SELECT sourceIP

FROM UserVisits WHERE visitDate

BETWEEN ‘1999-01-01’ AND ‘2000-01-

01’;

@HailQuery(filter="@3 between(1999-

01-01,2000-01-01)", projection={@1})

void map(Text key, Text v) { ... }

@3 = visitDate @1 = sourceIP; the

position of attributes in UserVisits

Page 17: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Bob’s Perspective

In case do not specify filter predicates,

MapReduce will perform a full scan on the

HAIL Blocks as in standard Hadoop.

At query time, HAIL checks whether an

index exists on the filter attribute using

the Index Metadata of a data block.

Page 18: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Bob’s Perspective

Use a HailRecord object as input value in

the map function to directly read the

projected attributes without splitting the

record into attributes.

Do not have to filter out the incoming

records, because this is automatically

handled by HAIL via the HailQuery

annotation

Page 19: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Bob’s Perspective

void map(Text key, Text v) {

String[] attributes = v.toString().split(",");

if (DateUtils.isBetween(attributes[2],

"1999-01-01", "2000-01-01"))

output(attributes[0], null);

}

void map(Text key, HailRecord v) {

output(v.getInt(1), null);

}

Page 20: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Hadoop’s Perspective

JobClient copies all the resources needed

to run the MapReduce job.

JobClient breaks the input into smaller

pieces called input splits using the

InputFormat to a distinct HDFS block.

JobClient retrieves for each input split all

datanode locations having a replica of that

block.

JobClient submits the job to the

JobTracker with a set of splits to process

Page 21: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Hadoop’s Perspective

JobTracker creates a map task for each

input split and JobTracker allocates the

map task to the TaskTracker.

The map task uses a RecordReader UDF

in order to read its input data block from

the closest datanode.

The RecordReader breaks block into

records and makes a call to the map

function for each record and writes its

output back to HDFS

Page 22: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

HAIL’s Perspective

HailInputFormat uses a more elaborate

splitting policy, called HailSplitting.

HailSplitting is to map one input split to

several data blocks whenever a

MapReduce job performs an index scan

over its input.

Allow HAIL to reduce the number of map

tasks to process and hence to reduce the

aggregated cost of initializing and finalizing

map tasks

Page 23: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

HAIL’s Perspective

To improve data locality, HailSplitting first

clusters the blocks of the input of an

incoming MapReduce job by locality.

HailSplitting produces as many collections

of blocks as there are datanodes storing

at least one block of the given input.

For each collection of blocks, HailSplitting

creates as many input splits as map slots

each TaskTracker has.

Page 24: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

HAIL’s Perspective

Creates a map task per resulting input

split and schedules these map tasks to the

replicas having the matching index.

If no relevant index exists, HAIL,

scheduling falls back to standard Hadoop

scheduling by optimizing data locality only.

Page 25: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

HAIL’s Perspective

The HailRecordReader is responsible for

retrieving the records that satisfy the

selection predicate of MapReduce jobs.

Those records are then passed to the

map function.

For example, need to find all records

having visitDate between(1999-01-01,

2000-01-01)

Page 26: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

HAIL’s Perspective

First open an input stream to the block

having the required index by using the

getHostsWithIndex method of each

BlockLocation so as to choose the closest

datanode with the required index.

Read the index entirely into main

memory (typically a few KB) to perform

an index lookup.

Page 27: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

HAIL’s Perspective

Reconstruct the projected attributes of

qualifying tuples from PAX to row layout.

Make a call to the map function for each

qualifying tuple.

Page 28: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Query Pipeline

Page 29: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Experiments

Six different clusters. Each one is a

physical 10-node cluster. Each node has

one 2.66GHz Quad Core Xeon

processor running 64-bit platform Linux

openSuse 11.1 OS, 4x4GB of main

memory, 6x750GB SATA hard disks.

compared (1) Hadoop, (2) Hadoop++ and

(3) HAIL

Page 30: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Experiments

Page 31: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Experiments

Page 32: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Experiments

the Synthetic dataset is well suited for

binary representation

Page 33: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Experiments

Page 34: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Experiments

Page 35: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Experiments

Page 36: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Experiments

Tideal = #MapTasks/

(#ParallelMapTasks *

Avg(TRecordReader))

Toverhead = Tend-to-end - Tideal

Page 37: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

LOGO

www.nordridesign.com

Experiments

HAIL-1Idx creates an index on the same

attribute for all three replicas.

Page 38: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

THANK YOU

Questions!

Page 39: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Only Aggressive Elephants are

fast Elephants Vijay Sukhadeve and Fan Jun

f

Page 40: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Motivation(...))

fast querying

Page 41: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

One Day in Bob´s Life

Detail View of TaskTracker 5

Adaptive Indexer

Mapper Block 42

Block Metadata

map(K, V) Index Metadata

{...} Index d 5 a 0 1 1 1 0 1 0 1 0 1 . . .

b 0 1 0 1 1 0 1 0 0 0 . . .

c 1 1 0 0 1 0 1 0 1 1 . . .

d 0 0 0 0 0 0 0 0 1 1 . . .

m

bob$ head UserVisits.dat

170.122.19.7|0eqt.html|1994-4-27|336.43|NetAnts/1.2|MLI|MLI-HE|Hertzsprung-Russell|7

161.110.31.8|0gzhwrmzfd.html|2003-7-16|176.15|MojeekBot|GUY|GUY-ZQ|astrochemistry|7

166.103.36.25|0fwrbsuzn.html|1991-2-14|182.915|KAIST AITrc|CAF|CAF-AZ|molecules|1

175.101.2.40|0nfxnbnuu.html|1988-3-24|130.32|Mozilla/5.0|MKD|MKD-RQ|A&A|7

156.100.40.7|0iczechptml|1983-12-5|39.906|Mozilla/5.0|MOZ|MOZ-RF|cosmology:|5

174.118.37.24|0setzifibmk.html|1976-10-11|71.28|Robozilla/1.|MLT|MLT-HT|GALAXIES|5

173.125.36.22|0hhzmry.html|1974-8-27|15.58|Incutio HttpClient|MTQ|MTQ-RE|microwave|2

SOURCE IP 14.1 n.html|2002-4-18|7.930|Mozilla/4.03|MYT|MYT-PK|techniques|1 5.1|0fwrbjsuz 168.1 164.110.6.28|0vpkgsvohgkxdz.html|2004-5-1|223.571|Pagestacker|ASM|ASM-WN|function|4 161.107.40.29|0ctxwdzsbm.html|1975-1-20|273.854|Privoxy/3.0|BHR|BHR-MD|individual|6

158.118.25.1|0sfwtqxjwaavk.html|1977-5-23|160.077|Toutatis x.x-|JOR|JOR-TW|nebulae|5

169.113.12.45|0vxwzb.html|2001-1-20|351.129|Wavefire/0.8-dev|TLS|TLS-PW|spots|2

155.106.15.8|0vlmwypbhhlucr.html|1975-1-19|239.101|OutfoxBot/0.x|MDG|MDG-MG|radial|5

150.101.22.8|0aoclhkqqzcas.html|1973-4-30|220.202|MSIE 4.0|EGY|EGY-KB|occultations|8

175.109.8.39|0jsup.html|1983-9-8|352.6177|RSSOwl/1.2.3|ATA|ATA-BV|interferometers|7

163.111.9.17|0vefhgquwbzgh.html|1976-6-18|268.3|Mozilla/4.05|SOM|SOM-IV|surveys|8

160.122.23.10|0dnbzhr.html|1990-5-24|237.3636|NextGenSearchBot|VUT|VUT-ZL|infrared|6

160.106.6.40|0rwjyocdra.html|1981-10-29|94.965|libwww-perl/5.6|VIR|VIR-FI|Scuti|5

155.110.37.21|0az567.html|1979-1-27|375.101|Jakarta-HttpClient|LBN|LBN-XS|MEDIUM|6

URL DURATION

Page 42: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Uploading UserVisits.dat (2.6 TB) to HDFS...

Stop

Page 43: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

30 minutes later...

Uploading UserVisits.dat (2.6 TB) to HDFS...

Stop

Uploading 2.6 TB of UserVisits.dat

completed

Page 44: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Map Progress

Reduce Progress

Stop

Page 45: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Reduce Progress

Stop

Map Progress

Map Progress

Reduce Progress

Stop

Page 46: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

MapReduce Job 4711

canceled

Creating Trojan Index on Attribute 0...

Stop

Jo

b

run

tim

e

[s]

0

100 nodes [ J. Dittrich, J.-A. Quiane-Ruiz,A. Jindal,Y. Kargin,V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without it Even Noticing). PVLDB 2010. ]

30

60

90

120 Hadoop Hadoop++ (Trojan Index)

Trojan Index Query Time

Page 47: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Stop

Creating Trojan Index on Attribute 0...

Stop

Creating Trojan Index on Attribute 0...

MapReduce Job 1337

canceled

Page 48: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Trojan Index Creation Time

0

10000

20000

30000

40000

50000

10 nodes 50 nodes 100 nodes

run

tim

e

[se

co

nd

s]

Hadoop HadoopDB

Hadoop++(256MB) Hadoop++(1GB)

(I)ndex Creation (C)o-Partitioning Data (L)oading

L L C

I

C

I

L L

C C

I

I

L L

C

I

C

I

[ J. Dittrich, J.-A. Quiane-Ruiz,A. Jindal,Y. Kargin,V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without it Even Noticing). PVLDB 2010. ]

Hadoop Aggressive Indexing Library

vs.

Page 49: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

HDFS

+ MapReduce

HDFS

HDFS

...

Bob

http://www.istockphoto.com/stock-

photo- 5 9 1 1 3 4 - l o ng- l i ve- the-

king.php

DN1 DN2 DN3 DN4 DN5 DN6 DNn

Page 50: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

HDFS

horizontal partitions

DNn DN6 DN5 DN4 DN3 DN2 DN1

...

HDFS

DNn DN6 DN5 DN4 DN3 DN2 DN1

...

HDFS blocks 64MB (default) horizontal partitions

HDFS

4 5 6

7 8 9

...

1 2 3 1 1 2 3

DN1 DN2 DN3 DN4 DN5 DN6 DNn

2 3

Page 51: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

DNn DN6 DN5 DN4 DN3 DN2 DN1

... 1

4 5 6 4 5 6

2

3

1 1 2 2

3

3

HDFS

DNn DN5 DN4 DN3 DN2 DN1

... 1

5 6 4

7

2

5 6 3 5

8 9

1 1 2 3

4 4

7 8 9

HDFS

... 4 8 5 6 4 9

2 9 1 7 3 8

HDFS

DN3 DN4 DN5

4 5 6

7 8 9

2

3 6

DN6

7 8 9

1

4 7

DNn

2 8

3 6

DN6

1 7 2 9

5 6 3 5

DN1 DN2

Page 52: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

DN6 DN5 DN4 DN3 DN2 DN1

... 4 8 5 6 4 9 3 6

2 9 1 7 3 8 2 8

5 6 3 5

1 7 2 9

Failover

HDFS

MapReduce

DN6 DN5 DN4 DN3 DN2

... 4 8 5 6 4 9 3 6

2 9 1 7 3 8 2 8

HDFS

1

4 7

DNn

1

4 7

DNn

1 7 2 9

5 6 3 5

DN1

Page 53: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

DN6 DN5 DN4 DN3 DN2 DN1

... 4 8 5 6 4 9 3 6

2 9 1 7 3 8 2 8

5 6 3 5

1 7 2 9

HDFS

DN6 DN5 DN4 DN3 DN2 DN1

... 4 8 5 6 4 9 3 6

2 9 1 7 3 8 2 8

5 6 3 5

1 7 2 9

MapReduce

HDFS

map(row) -> set of (ikey, value)

DNn DN6 DN5 DN4 DN3 DN2

... 1

4 8 5 6 4 9 3 6 4 7

2 9 1 7 3 8 2 8

Map Phase

HDFS

MapReduce map(row) -> set of (ikey, value)

1

4 7

DNn

1

4 7

DNn

1 7 2 9

5 6 3 5

DN1

Page 54: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

DN6 DN5 DN4 DN3 DN2 DN1

... 2 9 1 7

4 8 5 6

Map Phase

HDFS

MapReduce map(row) -> set of (ikey, value)

M1 M2 M3 M4 M5 M6 M7

DN6 DN5 DN4 DN3 DN2 DN1

... 2 9 1 7

4 8 5 6

Map Phase

HDFS

MapReduce map(row) -> set of (ikey, value)

M1 M2 M3 M4 M5 M6 M7

DNn DN6 DN5 DN4 DN3 DN1

... 1

4 8 5 6 4 9 3 6

2 9 1 7 3 8 2 8

Map Phase

6‘ 8‘ 1‘ 3‘ 2‘ 7‘

9 1 2 3

7 6 8 4 7

HDFS

MapReduce map(row) -> set of (ikey, value)

M1 M2 M3 M4 M5 M6 M7

1

4 7

DNn

3 8 2 8

4 9 3 6

1 7 2 9

5 6 3 5

1

4 7

DNn

1 7 2 9

5 6 3 5

3 8 2 8

4 9 3 6

1 7 2 9

5 6 3 5

9‘

DN2

Page 55: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

DN6 DN5 DN4 DN3 DN2 DN1

... 2 9 1 7 3 8 2 8

4 5

1 7 2 9

4 8 5 6 4 9 3 6 5 6 3 5

Map Phase

9‘ 5‘ 3‘ 4‘ 2‘ 6‘ 8‘ 1‘

HDFS

MapReduce

M8 M9

map(docID, document) -> set of (term, docID)

HDFS

+ MapReduce

vs.

HDFS HAIL

+ MapReduce

HDFS

+ MapReduce

1

4 7

7‘

DNn

Page 56: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

HDFS HAIL

+ MapReduce

HAIL

HAIL

...

Bob

DN1 DN2 DN3 DN4 DN5 DN6 DNn

Page 57: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

HAIL

horizontal partitions

DNn DN6 DN5 DN4 DN3 DN2 DN1

...

HAIL

DNn DN6 DN5 DN4 DN3 DN2 DN1

...

HDFS blocks 64MB (default) horizontal partitions

HAIL

4 5 6

7 8 9

...

1 2 3 1 1 2 3

DN1 DN2 DN3 DN4 DN5 DN6 DNn

2 3

Page 58: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

HAIL

DNn DN6 DN5 DN4 DN3 DN2 DN1

... 1

4

2

3

5 6

1 1 2 2

3

3

4 5 6

HAIL

DNn DN6 DN5 DN4 DN3 DN2 DN1

...

4 5 6 4 5 6

1 1 1 2 2 2 3

3 3

HAIL

... 1 1 1 2

3

2 2

3

3 5 6 4 5 5 6 4 4 6

DN1 DN2 DN3 DN4 DN5 DN6 DNn

4 5 6

7 8 9

4 5 6

7 8 9

7 8 9

Page 59: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

DNn DN6 DN5 DN4 DN3 DN2 DN1

...

HAIL

1 1 1 2 2 2 3

3 6 3 5 5 6 4 5 6 4 4

DNn DN6 DN5 DN4 DN3 DN2 DN1

...

HAIL

1 1 1 2 2 2 3

3 6 5 6 3 5 5 6 4 4 4

7 8 9 7 8 9

...

HAIL

1 1

5 6 4

1

4 5 6 4

2

3 5

2 2

3 6

3 9 9

8

8

9

8 7 7

7

DN1 DN2 DN3 DN4 DN5 DN6 DNn

7 8 9

7 8 9

Page 60: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

DNn DN6 DN5 DN4 DN3 DN2 DN1

...

HAIL

2 9 1 7 3 8 2 8

4 8 5 6 4 9

1

4 7 3 6 3 5 5 6

1 7 2 9

DNn DN6 DN5 DN4 DN3 DN2 DN1

... 1 7 1 1 2 2 9 2 7 3 8

3 6 5 6 3 5 4 8 5 6 4 9 4 7

8 9

Failover

HAIL

Details

Page 61: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Network

Network

PCK

2

C

B

A PCK

1

PCK

2

PCK

1

Block Metadata

C

B

A

Block Metadata

C

B

A

Network

...

8 forward Block Metadata

1

append

12

HAIL Client Datanode DN1 upload

check

acknowledge

10

PCK

2

PCK

1

0010

1110

ACK 1

3 2 1

ACK 2

3 2 1

ACK 1

3

ACK 1

3 2 1 ACK 2

3 2 forward

13

15

9

14

HDFS Namenode Block directory HAIL Replica directory

register register

ACK 2

3

convert

2

4 5

6 reassemble

7

reassemble

Block Metadata

C

B

A

Index Metadata Index A

HAIL Block

0001001100011 C 0111100111111

1001110111011 B 1010101101010

Block Metadata 0010011101101 A 1101001101101

Index Metadata Index C

Bob

A B C

1

preprocess

11

notify

get location

3

OK

Datanode DN3

HAIL Upload Pipeline

HAIL Query Pipeline

MapReduce Pipeline

System's Perspective

Hadoop MapReduce Pipeline

HDFS HDFS

Task Tracker Job Tracker Job Client

Split Phase Scheduler Map Phase

for each spliti in splits {

allocate spliti to closest DataNode storing blocki splits[] }

Record Reader:

- Perform Index Scan

- Perform Post-Filtering

- for each Record invoke

map(HailRecord)

3

send allocate

Map Task

chose 4 computing

Node read

6 block42

block42

C DN3 DN4

block42

A

DN5

DN6

block42

B

DN7

DNn DN2

...

2

run

Job 5

7 output store

...

@HailQuery( filter="@3 between(1999-01-01, 2000-01-01)",

projection={@1})

void map(Text k, HailRecord v) {

output(v.getInt(1), null);

}

...

Bob's Perspective

HAIL Annotation

Bob

DN1

for each blocki in input { locations = blocki.getHostWithIndex(@3); splitBuilder.add(locations,

blocki); } splits[] = splitBuilder.result;

Experiments

PAX Block build

0101001010111

0110010111010

0101010101010

0011010001100

0110100010111

0101000110110

PAX Block build

0101001010111

0110010111010

0101010101010

0011010001100

0110100010111

0101000110110 PAX Block

0101001010111

0110010111010

0101010101010

0011010001100

0110100010111

0101000110110

HAIL Block

0000100100011

0011110111111

1011000110110

1000101110101

1000101010001

0111101100110

1 write Job

MapReduce Job

Main Class

map(...)

reduce(...)

Page 62: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Upload Times

0

1700

3400

5100

6800

0 3 1 2

Number of created indexes

5766

704 712 717

671

3472

1132

Uplo

ad

tim

e [

sec]

Hadoop Hadoop++ HAIL

all with 3 replicas

Upload Time

0

1700

5766

5100

3472

0 1 2

Number of created indexes

3

712 717 704 671 1132

Uplo

ad

tim

e [

sec]

Hadoop Hadoop++ HAIL

all with 3 replicas

Upload Time

Page 63: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

0

1700

3400

5100

6800

0 1 2

Number of created indexes

3

717 712 704 671

5766

3472

1132

Uplo

ad

tim

e [

sec]

Hadoop Hadoop++ HAIL

all with 3 replicas

Upload Time

0

1700

3400

5100

6800

0 1 2

Number of created indexes

3

717 712 704 671

5766

3472

1132

Uplo

ad

tim

e [

sec]

Hadoop Hadoop++ HAIL

all with 3 replicas

Upload Time

0

1075

2150

3225

4300

3 (default)

5 6 7 10

1700 1254 1089 956 717

1773

1132

Uplo

ad

tim

e [

sec]

Number of created replicas

Hadoop HAIL

Hadoop upload time with 3 replicas 3710

2712 2256

Replica Scalability

Page 64: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

0

550

1100

1650

2200

Syn UV Syn UV Syn UV

1476 1486

633

1530

684

1742

600

1026

1836

918

1284

827

Uplo

ad

Tim

e [sec]

Number of Nodes

Hadoop HAIL

10 nodes 50 nodes 100 nodes

Scale-Out

Query Times

Individual Jobs: Weblog, RecordReader

0

1000

2000

3000

4000

73 83 75

Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5

683 333

53 52

2864 2917 2776 2442 2470

2112 2156

3358

RR

Ru

ntim

e

[ms]

MapReduce Jobs

Hadoop Hadoop ++ HAIL

Page 65: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Individual Jobs: Weblog, Job

375

0

750

1125

1500

Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4

MapReduce Jobs

Bob-Q5

602 598 651598 598 601

10991145 10991143

705

1160 942 1006 1094

Jo

b

Ru

ntim

e

[se

c]

Hadoop Hadoop++ HAIL

Scheduling Overhead

1125

750

375

0

Bob-Q1 Bob-Q2 Bob-Q3 Bob-Q4 Bob-Q5

MapReduce Jobs

1500

Jo

b

Ru

ntim

e

[se

c]

Hadoop Hadoop++ HAIL Overhead

HAIL Scheduling

Page 66: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Summary(Summary(...))

fast querying

Page 67: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

0

375

750

1125

1500

Bob-Q1

15 15 22

Bob-Q2 Bob-Q3 Bob-Q4

MapReduce Jobs

Bob-Q5

65 16

10991145 10991143

651 705

1160 942 1006 1094

Job

Runtim

e

[se

c]

Hadoop Hadoop++ HAIL

Individual Jobs: Weblog

Failover

Failover

750

375

0

1500

1125

Hadoop HAIL

Systems

HAIL-1Idx

Job

Runtim

e

[sec]

Hadoop HAIL Slowdown

10 nodes, one killed

10.3 % slowdown

1099

10.5 % slowdown 5.5 % slowdown

598 598

Page 68: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

19 21 12

6 15 14

DN1

4 12 1

7

Hadoop Scheduling

HAIL

MapReduce

5

11 2 22

23 sort order

7 map tasks (aka waves)

7 times scheduling overhead

map(row) -> set of (ikey, value)

HAIL Split DN1

6 15 14

4 12 1

23 7

HAIL Scheduling

HAIL

MapReduce

5

11 2 22

19 21 12

sort order

1 map task (aka wave)

1 times scheduling overhead

map(row) -> set of (ikey, value)

Query Times with HAIL Scheduling

Page 69: Only Aggressive Elephants are Fast Elephants - WPIweb.cs.wpi.edu/~cs525/s13-MYE/lectures/presentations... · Only Aggressive Elephants are Fast Elephants Jun Fan, Vijay Sukhadeve

Summary(Summary(...))

fast querying