hadoop jungle

Hadoop Jungle

Alexey Zinovyev, Java/BigData Trainer in EPAM

2Hadoop Jungle

About

I am a <graph theory, machine learning,

traffic jams prediction, BigData algorithms>

scientist

But I'm a <Java, Scala, NoSQL, Hadoop,

Spark> programmer and trainer

3Hadoop Jungle

Contacts

E-mail : [email protected]

Twitter : @zaleslaw @BigDataRussia

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

4Hadoop Jungle

Are you a Hadoop developer?

5Hadoop Jungle

Let’s do THIS!

6Hadoop Jungle

WHAT IS BIG DATA?

7Hadoop Jungle

Joke about Excel

8Hadoop Jungle

Every 60 seconds…

9Hadoop Jungle

From Mobile Devices

10Hadoop Jungle

We started to keep and handle stupid new things!

11Hadoop Jungle

10^6 rows

in MySQL

12Hadoop Jungle

GB->TB->PB->?

13Hadoop Jungle

Is BigData about PBs?

14Hadoop Jungle

Is BigData about PBs?

15Hadoop Jungle

It’s hard to …

• .. store

• .. handle

• .. search in

• .. visualize

• .. send in network

16Hadoop Jungle

Just do it … in parallel

17Hadoop Jungle

Scale-up vs scale-out

Scale - Up 16 CPUs

16 CPUs Scale - Out

48 CPUs

16 CPUs 16 CPUs 16 CPUs

18Hadoop Jungle

Motivation: Fault tolerance

19Hadoop Jungle

Motivation: Fault tolerance

20Hadoop Jungle

You need to write

• .. distributed on-disk storage

21Hadoop Jungle

You need to write


• .. in-memory storage (or shared memory buffer)

22Hadoop Jungle

You need to write



• .. thread pool to run hundreds of threads

23Hadoop Jungle

You need to write




• .. synchronize all components

24Hadoop Jungle

You need to write




• .. synchronize all components

• .. provide API for reusing by other developers

25Hadoop Jungle

All we love reinvent bicycles, but…

26Hadoop Jungle

HADOOP

27Hadoop Jungle

Hadoop

28Hadoop Jungle

Disks Performance

29Hadoop Jungle

The main concept

Let’s read data in parallel

30Hadoop Jungle

“Cheap” cluster

31Hadoop Jungle

The main concept

Let’s read data in parallel

32Hadoop Jungle

It must survive

33Hadoop Jungle

Parallel Computing vs Distributed Computing

34Hadoop Jungle

Hadoop and its workers: send me your little code

35Hadoop Jungle

Hadoop

Jobs

36Hadoop Jungle

• Hadoop Commons

• Hadoop Clients

• HDFS

• YARN

• MapReduce

Main components

37Hadoop Jungle

Hadoop frameworks

• Universal MapReduce, Tez, Kudu, RDD in Spark)

• Abstract (Pig, Pipeline Spark)

• SQL - like (Hive, Impala, Spark SQL)

• Processing graph (Giraph, GraphX)

• Machine Learning (Mahout, MLib)

• Stream processing (Spark Streaming, Storm)

38Hadoop Jungle

Hadoop Architecture

39Hadoop Jungle

• Automatic parallelization and distribution

• Fault-tolerance

• Data Locality

• Writing the Map and Reduce functions only

• Single-threaded model

Key features

40Hadoop Jungle

HDFS DAEMONS

41Hadoop Jungle

The main idea

'Time to transfer' > 'Time to seek'

42Hadoop Jungle

Main idea

43Hadoop Jungle

Files in HDFS

44Hadoop Jungle

• NameNode

HDFS node types

45Hadoop Jungle

• NameNode

• DataNode

HDFS node types

46Hadoop Jungle

• NameNode

• DataNode

• SecondaryNode (not for HA)

HDFS node types

47Hadoop Jungle

• NameNode

• DataNode

• SecondaryNode

• StandbyNode

HDFS node types

48Hadoop Jungle

• NameNode

• DataNode

• SecondaryNode

• StandbyNode

• Checkpoint Node

HDFS node types

49Hadoop Jungle

• NameNode

• DataNode

• SecondaryNode

• StandbyNode

• Checkpoint Node

• Backup Node

HDFS node types

50Hadoop Jungle

The main thought about HDFS

HDFS node is JVM daemon

51Hadoop Jungle

• monitor with JMX

• use jmap, jps and so on..

• configure NameNode Heap Size

• use power of JVM flags

You can do it with HDFS node

52Hadoop Jungle

MAPREDUCE THEORY

53Hadoop Jungle

MapReduce in different languages

54Hadoop Jungle

• WordCount

• Log handling

• Filtering

• Reporting Preparation

MR Typical Tasks

55Hadoop Jungle

Why should we use MapReduce?

56Hadoop Jungle

We try to reduce ‘time to act’ but keep BigData

57Hadoop Jungle

Classic Batch

58Hadoop Jungle

Do you like batches?

59Hadoop Jungle

Fixed Windows

60Hadoop Jungle

Filter all elements

61Hadoop Jungle

Yet One YARN’s lover

62Hadoop Jungle

Pay attention!

Hadoop != MapReduce

63Hadoop Jungle

Think in Key-Value style

map (k1, v1) → list(k2, v2)

reduce (k2, list(v2*) )→ list(k3, v3)

64Hadoop Jungle

• Map

Main steps

65Hadoop Jungle

• Map

• Shuffle

Main steps

66Hadoop Jungle

• Map

• Shuffle

• Reduce

Main steps

67Hadoop Jungle

68Hadoop Jungle

69Hadoop Jungle

The main performance idea

Reduce shuffle time & resources

70Hadoop Jungle

Minimal

Runner

public class MinimalMapReduce extends Configured implements Tool {@Overridepublic int run(String[] args) throws Exception {

Job job = new Job(getConf());job.setJarByClass(getClass());FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new MinimalMapReduce(), args);System.exit(exitCode);

}}

71Hadoop Jungle

Minimal

Runner



}



}}

72Hadoop Jungle

Minimal

Runner



}



}}

73Hadoop Jungle

Job Config

job.setInputFormatClass(TextInputFormat.class);

job.setMapperClass(Mapper.class);

job.setMapOutputKeyClass(LongWritable.class);

job.setMapOutputValueClass(Text.class);

job.setPartitionerClass(HashPartitioner.class);

job.setNumReduceTasks(1);

job.setReducerClass(Reducer.class);

job.setOutputKeyClass(LongWritable.class);

job.setOutputValueClass(Text.class);

job.setOutputFormatClass(TextOutputFormat.class);

74Hadoop Jungle

Customize MapReduce!

75Hadoop Jungle

Hadoop club rules

76Hadoop Jungle

MAPREDUCE ADVANCED

77Hadoop Jungle

MapReduce for WordCount

78Hadoop Jungle

WordCount Combiner

79Hadoop Jungle

Combine

80Hadoop Jungle

Combiner as a Local Reducer

81Hadoop Jungle

Can we make something around map(), reduce()

calls?

82Hadoop Jungle

Setup

/*** Called once at the start of the task.*/protected void setup(Context context

) throws IOException, InterruptedException {

// Prepare something for each Mapper or Reducer// Validate external sources

}

83Hadoop Jungle

Setup

/*** Called once at the end of the task.*/protected void cleanup(Context context

) throws IOException, InterruptedException{

// Finish something after each Mapper or Reducer// Handle specific exceptions

}

84Hadoop Jungle

Full control

85Hadoop Jungle

Run Mapper

/*** Expert users can override this method for more complete control over the* execution of the Mapper.* @param context* @throws IOException*/public void run(Context context) throws IOException, InterruptedException {

setup(context);try {

while (context.nextKeyValue()) {map(context.getCurrentKey(), context.getCurrentValue(),

context);}

} finally {cleanup(context);

}}

86Hadoop Jungle

Run

Reducer

/*** Advanced application writers can use the * {@link #run(*.Reducer.Context)} method to* control how the reduce task works.*/public void run(Context context) throws IOException, InterruptedException {

setup(context);try {

while (context.nextKey()) {reduce(context.getCurrentKey(), context.getValues(),

context);// If a back up store is used, reset itIterator<VALUEIN> iter = context.getValues().iterator();if(iter instanceof ReduceContext.ValueIterator) {

((ReduceContext.ValueIterator<VALUEIN>)iter).resetBackupStore(); }

}} finally {

cleanup(context);}

}

87Hadoop Jungle

MR APPROACHES

88Hadoop Jungle

How many Reducers do we need?

89Hadoop Jungle

Data Flow with single Reducer

90Hadoop Jungle

• receives all keys in sorted order

• the output will be completely in sorted order

• take a long time to run

• can throw OutOfMemory

One Reducer

91Hadoop Jungle

Data Flow with multiple Reducers

92Hadoop Jungle

How many Reducers should we do?

93Hadoop Jungle

• Key will be the weekday

• Seven Reducers will be specified

• A Partitioner will be written which sends one key to each

Reducer

Ideal Case: data grouped by day of week

94Hadoop Jungle

Could we skip shuffle step?

95Hadoop Jungle

Map-Only ‘MapReduce’ Jobs

96Hadoop Jungle

HADOOP ADVANCED

97Hadoop Jungle

98Hadoop Jungle

The main concept of Distributed Cache

Share common data

99Hadoop Jungle

• For ToolRunner in command line with –files option

• - archives

• - libjars

• In code Job.addCacheFile(URI uriToFileInHdfs);

• Job.addCacheArchive(URI uriToArchiveFileInHdfs);

• Job.addArchiveToClassPath(Path pathToJarInHdfs);

How to implement?

100Hadoop Jungle

How to get data from Cache?

To list the content of Distributed Cache:

Path[] pathes = JobContext.getLocalCacheFiles();

Then you have to check this URI to pick up your file

for(Path u : pathes) {

if(u.getName().toUpperCase.contains(“TARGET”)) {

….

}

}

101Hadoop Jungle

Counters usage

102Hadoop Jungle

Counters

103Hadoop Jungle

May I customize DataFlow before shuffling?

104Hadoop Jungle

Hash Partitioner just do it..

public class HashPartitioner<K2, V2> extends Partitioner<K2, V2> {

public int getPartition(K2 key, V2 value, int numReduceTasks) {

return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

}}

105Hadoop Jungle

Partitioner’s Role in Shuffle and Sort

106Hadoop Jungle

• You want to send by other rule (hash to value)

• Need know more about secondary sort

• Extend Partitioner abstract class

• getPartition(key, value, number of Reducers)

• Return int [0, number of Reducers – 1]

• Don’t forget job.setPartitionerClass(MyPartitioner.class);

How to customize?

107Hadoop Jungle

Full power

108Hadoop Jungle

JOINS

109Hadoop Jungle

Reduce

JOIN

110Hadoop Jungle

The main concept of JOIN

Let’s skip Reduce + Shuffle

111Hadoop Jungle

Map-side join

112Hadoop Jungle

Map-side join for large datasets

113Hadoop Jungle

What about Really Large Tables?

SELECT Employees.Name, Employees.Age, Department.Name FROM Employees INNER JOIN Department ON Employees.Dept_Id=Department.Dept_Id

114Hadoop Jungle

The main JOIN idea fro large tables

Redis or Memcache cluster as Distributed Cache

115Hadoop Jungle

PERFORMANCE

116Hadoop Jungle

Dig to DAG

117Hadoop Jungle

The main concept of DAG

Decompose you problem, build DAG, skip HDFS

Let’s run on

JVM!

119Hadoop Jungle

JVM SETTINGS

120Hadoop Jungle

Typical mistakes

• Collections in memory

• Sorting all data in memory

• Logging each input key-value pairs

• Many JARs

• JVM issues

121Hadoop Jungle

Performance tips

• Correct data storage (on JVM )

• Don’t forget about combiner

• Use appropriate Writable type

• Min required replication factor

• Tune your JVM

• Think in terms Big-O

122Hadoop Jungle

JVM tuning

• mapred.child.java.opts

• -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

• Low-latence GC collector –XX: +

UseConcMarkSweepGC, -XX:ParallelGCThreads

• Xmx == Xms (max and starting heap size)

123Hadoop Jungle

JVM Reusing: Uber Task

124Hadoop Jungle

TESTING & DEVELOPMENT

125Hadoop Jungle

MR Unit idea

126Hadoop Jungle

Do it

separatly

127Hadoop Jungle

Simple Test

public class MRUnitHelloWorld {MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;

@Beforepublic void setUp() {

WordMapper mapper = new WordMapper();mapDriver = new MapDriver<LongWritable, Text, Text,

IntWritable>();mapDriver.setMapper(mapper);

}

@Testpublic void testMapper() {

mapDriver.withInput(new LongWritable(1), new Text("cat dog"));

mapDriver.withOutput(new Text("cat"), new IntWritable(1));mapDriver.withOutput(new Text("dog"), new IntWritable(1));mapDriver.runTest();

}}

128Hadoop Jungle

• First develop/test in local mode using small amount of

data

Testing strategies

129Hadoop Jungle


data

• Test in pseudo-distributed mode and more data

Testing strategies

130Hadoop Jungle


data


• Test on fully distributed mode and even more data

Testing strategies

131Hadoop Jungle


data


• Test on fully distributed mode and even more data

• Final execution: fully distributed mode & all data

Testing strategies

132Hadoop Jungle

And we can DO IT!

Real-Time Data-Marts

Batch Data-Marts

Relations Graph

Ontology Metadata

Search Index

Events & Alarms

Real-time Dashboarding

Events & Alarms

All Raw Data backupis stored here

Real-time DataIngestion

Batch DataIngestion

Real-Time ETL & CEP

Batch ETL & Raw Area

Scheduler

Internal

External

Social

HDFS → CFSas an option

Time-Series Data

Titan & KairosDBstore data in Cassandra

Push Events & Alarms (Email, SNMP etc.)

133Hadoop Jungle

It reminds me …

134Hadoop Jungle

Infrastructure issues are waiting YOU!

135Hadoop Jungle

• Move to Java 8

• Support more than 2 NameNodes (multiple standby

NameNodes)

• Derive heap size or mapreduce.*.memory.mb automatically

• Work with SSD, RAM, HDD, CPU as resources for YARN

• Support Docker containers

Hadoop 3: Roadmap

136Hadoop Jungle

MapReduce is not a ideal approach! But it works!

hadoop jungle

Technology