the performance of mapreduce: an in-depth study
DESCRIPTION
TRANSCRIPT
The Performance of MapReduce: An In-depth
Study
Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu,School of Computing, NUS
Presented by Tang Kai
Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark
Index
MapReduce-based systems are increasingly being used.◦ Simple yet impressive interface
Map() Reduce()◦ Flexible
Storage system independence◦ Scalable◦ Fine-grain fault tolerance
Introduction
Previous study◦ Fundamental difference
Schema support Data access Fault tolerance
◦ Benchmark Parallel DB >> MR-based
Motivation
Is it not possible to have a flexible, scalable and efficient MapReduce-based systems?
Works◦ Identify several performance bottlenecks◦ manage bottlenecks and tune performance
well-known engineering and database techniques
Conclusion◦ 2.5x-3.5x
Object
Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark
Index
7 steps of a MapReduce job
Factors affecting Performance of MR
1) Map2) Parse3) Process4) Sort5) Shuffle6) Merge7) Reduce
I/O mode Indexing Parsing Sorting
Factors affecting Performance of MR
Direct I/O◦ read data from the disk directly◦ Local
Streaming I/O◦ streaming data from the storage system by an
inter-process communication scheme, such as TCP/IP or JDBC.
◦ Local and remote
Direct I/O > Streaming I/O by 10%-15%
I/O mode
Input of a MapReduce job◦ a set of files stored in a distributed file system,
i.e. HDFS Ranged-indexes
◦ input HDFS files are not sorted but each data chunk in the files are indexed by keys Block-level indexes
◦ tables stored in database servers Database indexed tables
Indexing
Boost selection task 2x-10x depending on the selectivity
Raw data -> <k,v> pair
Immutable decoding◦ Read-only records (set once)
Mutable decoding
Mutable decoder is 10x faster.◦ boost selection task 2x overall
Parsing
Map-side sorting affects performance of aggregation◦ Cost of key comparison is non-trivial.
Example◦ SourceIP in UserVisits Table◦ Sort intermediate records.◦ sourceIP variable-length string
String compare (byte-to-byte) Fingerprint compare (integer)
Fingerprint-based is 4x-5x faster.◦ 20%-25% overall
Sorting
Why◦ 4 factors
Resulting in large search space (2*2*3*2)◦ Budget limit on Amazon EC2
Greedy
Pruning search space
Greedy Stategy
Pruning search space
I/O mode
Parser
Different sort schemesIn various architecture
Direct I/O
Stream I/O
Hadoop Writable
Google’s ProtocolBuffer
Berkeley DB
3 datasets
4 queries
Benchmark
Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark
Index
Hadoop 0.19.2 as code base Direct I/O
◦ Modification of data node implementation Text decoder
◦ Immutable same as Dewitt◦ Mutable by ourselves
Binary decoder◦ Hadoop
Immutable Writable decoder Mutable using hadoop API by ourselves
◦ Google Protocol buffer Build-in compiler->mutable Immutable by ourselves
◦ Berkeley DB BDB binding API (mutable)
Implementation details
Amazon EC2 (Elastic computing cloud)◦ 7.5GB memory◦ 2 virtual cores◦ 64-bits Fedora 8
Tuning EC2 disk I/O by shifting peak time. Hadoop Setting
◦ Block size of HDFS: 512MB◦ Heap size of JVM: 1024MB
Environment details
Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark
Index
Benchmark for I/O Results for different I/O mode
◦ Single node◦ No-op job w/ map w/o reduce
Results for record parsing◦ Run in Java process instead of MapReduce job◦ Time start after loading into memory
Mutable > Immutable◦ Mutable text> mutable binary
Benchmark for parsing
In between hadoop-based system◦ Cache factor
In between hadoop-based and Parallel DB◦ Close
Benchmark for Grep Task
Selection task -> scan -> Index Caching Indexing
Benchmark for Selection Task
Parsing: 2x faster Sorting: 20%-25% faster
◦ Not significant in small size aggregation task
Benchmark for Aggregation Task
Large: SELECT sourceIP, SUM(adRevenue)FROM UserVisits GROUP BY sourceIP;
Small: SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7)
On decoding scheme Comparison of tuned MR-based & Parallel
DB
Benchmark for Join Task
Cons◦ Need to be committed/forked to Hadoop source
code tree◦ A complete framework is needed instead of
miscellaneous patches.◦ Various API support: CLI, Web rather than Java.
Future work◦ Provide query parser, optimizer etc to build a
complete solution◦ Elastic power-aware data intensive Cloud
http://www.comp.nus.edu.sg/~epic/download/MapReduceBenchmark.tar.gz
Cons & Future work
Tenzing: A SQL Implemetation On The MapReduce Framework