the performance of mapreduce: an in-depth study

The Performance of MapReduce: An In-depth

Study

Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu,School of Computing, NUS

Presented by Tang Kai

Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark

Index

MapReduce-based systems are increasingly being used.◦ Simple yet impressive interface

Map() Reduce()◦ Flexible

Storage system independence◦ Scalable◦ Fine-grain fault tolerance

Introduction

Previous study◦ Fundamental difference

Schema support Data access Fault tolerance

◦ Benchmark Parallel DB >> MR-based

Motivation

Is it not possible to have a flexible, scalable and efficient MapReduce-based systems?

Works◦ Identify several performance bottlenecks◦ manage bottlenecks and tune performance

well-known engineering and database techniques

Conclusion◦ 2.5x-3.5x

Object


Index

7 steps of a MapReduce job

Factors affecting Performance of MR

1) Map2) Parse3) Process4) Sort5) Shuffle6) Merge7) Reduce

I/O mode Indexing Parsing Sorting

Factors affecting Performance of MR

Direct I/O◦ read data from the disk directly◦ Local

Streaming I/O◦ streaming data from the storage system by an

inter-process communication scheme, such as TCP/IP or JDBC.

◦ Local and remote

Direct I/O > Streaming I/O by 10%-15%

I/O mode

Input of a MapReduce job◦ a set of files stored in a distributed file system,

i.e. HDFS Ranged-indexes

◦ input HDFS files are not sorted but each data chunk in the files are indexed by keys Block-level indexes

◦ tables stored in database servers Database indexed tables

Indexing

Boost selection task 2x-10x depending on the selectivity

Raw data -> <k,v> pair

Immutable decoding◦ Read-only records (set once)

Mutable decoding

Mutable decoder is 10x faster.◦ boost selection task 2x overall

Parsing

Map-side sorting affects performance of aggregation◦ Cost of key comparison is non-trivial.

Example◦ SourceIP in UserVisits Table◦ Sort intermediate records.◦ sourceIP variable-length string

String compare (byte-to-byte) Fingerprint compare (integer)

Fingerprint-based is 4x-5x faster.◦ 20%-25% overall

Sorting

Why◦ 4 factors

Resulting in large search space (2*2*3*2)◦ Budget limit on Amazon EC2

Greedy

Pruning search space

Greedy Stategy

Pruning search space

I/O mode

Parser

Different sort schemesIn various architecture

Direct I/O

Stream I/O

Hadoop Writable

Google’s ProtocolBuffer

Berkeley DB

3 datasets

4 queries

Benchmark


Index

Hadoop 0.19.2 as code base Direct I/O

◦ Modification of data node implementation Text decoder

◦ Immutable same as Dewitt◦ Mutable by ourselves

Binary decoder◦ Hadoop

Immutable Writable decoder Mutable using hadoop API by ourselves

◦ Google Protocol buffer Build-in compiler->mutable Immutable by ourselves

◦ Berkeley DB BDB binding API (mutable)

Implementation details

Amazon EC2 (Elastic computing cloud)◦ 7.5GB memory◦ 2 virtual cores◦ 64-bits Fedora 8

Tuning EC2 disk I/O by shifting peak time. Hadoop Setting

◦ Block size of HDFS: 512MB◦ Heap size of JVM: 1024MB

Environment details


Index

Benchmark for I/O Results for different I/O mode

◦ Single node◦ No-op job w/ map w/o reduce

Results for record parsing◦ Run in Java process instead of MapReduce job◦ Time start after loading into memory

Mutable > Immutable◦ Mutable text> mutable binary

Benchmark for parsing

In between hadoop-based system◦ Cache factor

In between hadoop-based and Parallel DB◦ Close

Benchmark for Grep Task

Selection task -> scan -> Index Caching Indexing

Benchmark for Selection Task

Parsing: 2x faster Sorting: 20%-25% faster

◦ Not significant in small size aggregation task

Benchmark for Aggregation Task

Large: SELECT sourceIP, SUM(adRevenue)FROM UserVisits GROUP BY sourceIP;

Small: SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7)

On decoding scheme Comparison of tuned MR-based & Parallel

DB

Benchmark for Join Task

Cons◦ Need to be committed/forked to Hadoop source

code tree◦ A complete framework is needed instead of

miscellaneous patches.◦ Various API support: CLI, Web rather than Java.

Future work◦ Provide query parser, optimizer etc to build a

complete solution◦ Elastic power-aware data intensive Cloud

http://www.comp.nus.edu.sg/~epic/download/MapReduceBenchmark.tar.gz

Cons & Future work

Tenzing: A SQL Implemetation On The MapReduce Framework

http://www.comp.nus.edu.sg/~epic/mrbenchmark.htm

http://www.comp.nus.edu.sg/~epic/mrbenchmark.htm

the performance of mapreduce: an in-depth study

Technology

remote direct io streaming

compilermutable immutable

dewitt mutable

hadoop api

performance bottlenecks

introduction factors

data chunkin

tuning ec2 disk io