hadoop distributions: bottlenecks and tuning
DESCRIPTION
This presentation by Alexey Diomin, R&D Engineer at Altoros, explains how to spot performance bottlenecks in Hadoop and overviews five approaches to eliminating them.TRANSCRIPT
Hadoop distributions. Bottlenecks and tuning.
Diomin AliakseyR&D
2014, Minsk
3
Hadoop Matrix
OpenSource Monitoring Target Group
Apache Hadoop Yes X Developers
Cloudera Yes Good All
Hortonworks Yes Good All
MapR No Bad Enterprise
PivotalHD No Bad Enterprise
4
How to find the bottleneck?
5
Monitoring & Logs
6
Brain
All stages
8
Map stage
9
Fetch stage
10
Merge stage
11
All stages
12
All stages
13
1. Increase size of cluster
2. Increase input block size
3. Increase buffer size
The most popular approaches
14
1. Increase size of cluster
2. Increase input block size
3. Increase buffer size
Popular approach
15
Small cluster, slow tasks
16
We need more gold ……
17
Large cluster, slow tasks
18
1. Increase size of cluster
2. Increase input block size
3. Increase buffer size
Popular approach
19
Increase input block size
20
1. Increase size of cluster
2. Increase input block size
3. Increase buffer size
Popular approach
21
1. Compression
Other techniques
22
1. Compression
2. Combiner
Other techniques
23
Wordcount
Reduce function as Combine
combine 1: <a, 1> <b, 1> <a, 1> => <a, 2> <b, 1>
combine 2: <a, 1> <b, 1> => <a, 1> <b, 1>
Reduce: <a, {1, 2}> <b, {1, 1}> => <a, 3> <b, 2>
Combiner
24
Mean
combine 1: <k,40> <k,30> <k,20> => <k, 30>
combine 2: <k,2> <k,8> => <k, 5>
Reduce: <k, {30, 5}> => <k, 17.5>
Combiner
25
Mean
combine 1: <k,40> <k,30> <k,20> => <k, 30>
combine 2: <k,2> <k,8> => <k, 5>
Reduce: <k, {30, 5}> => <k, 17.5>
(40 + 30 + 20 + 2 + 8)/5 = 17.5
Combiner
26
Mean
combine 1:
<k,<40,1>> <k,<30,1>>, <k,<20,1>> => <k, <90,3> >
combine 2:
<k,<2,1>> <k, <8,1>> => <k, <10, 2> >
Reduce: <k, {<90,3>, <10,2>} > => <k, 20>
Combiner
27