hadoop distributions: bottlenecks and tuning

Hadoop distributions. Bottlenecks and tuning.

Diomin AliakseyR&D

2014, Minsk

3

Hadoop Matrix

OpenSource Monitoring Target Group

Apache Hadoop Yes X Developers

Cloudera Yes Good All

Hortonworks Yes Good All

MapR No Bad Enterprise

PivotalHD No Bad Enterprise

4

How to find the bottleneck?

5

Monitoring & Logs

6

Brain

All stages

8

Map stage

9

Fetch stage

10

Merge stage

11

All stages

12

All stages

13

1. Increase size of cluster

2. Increase input block size

3. Increase buffer size

The most popular approaches

14




Popular approach

15

Small cluster, slow tasks

16

We need more gold ……

17

Large cluster, slow tasks

18




Popular approach

19

Increase input block size

20




Popular approach

21

1. Compression

Other techniques

22

1. Compression

2. Combiner

Other techniques

23

Wordcount

Reduce function as Combine

combine 1: <a, 1> <b, 1> <a, 1> => <a, 2> <b, 1>

combine 2: <a, 1> <b, 1> => <a, 1> <b, 1>

Reduce: <a, {1, 2}> <b, {1, 1}> => <a, 3> <b, 2>

Combiner

24

Mean

combine 1: <k,40> <k,30> <k,20> => <k, 30>

combine 2: <k,2> <k,8> => <k, 5>

Reduce: <k, {30, 5}> => <k, 17.5>

Combiner

25

Mean

combine 1: <k,40> <k,30> <k,20> => <k, 30>

combine 2: <k,2> <k,8> => <k, 5>

Reduce: <k, {30, 5}> => <k, 17.5>

(40 + 30 + 20 + 2 + 8)/5 = 17.5

Combiner

26

Mean

combine 1:

<k,<40,1>> <k,<30,1>>, <k,<20,1>> => <k, <90,3> >

combine 2:

<k,<2,1>> <k, <8,1>> => <k, <10, 2> >

Reduce: <k, {<90,3>, <10,2>} > => <k, 20>

Combiner

hadoop distributions: bottlenecks and tuning

Technology

size of cluster

input block size

buffer size13

buffer size14

buffer size18

buffer size20

diomin aliaksey rd

wordcountreduce function