hadoop breizhjug

INTRODUCTION TO HADOOP

Rennes – 2014-11-06David Morin - @davAtBzh

BreizhJug

Me

Solutions Engineer at

@davAtBzhDavid Morin

3

What is Hadoop ?

4

An elephant – This one ?

5

No, this one !

6

The father

7

Let's go !

8

Let's go !

9

Timeline

10

How did the story begin ?

=> Deal with high volume of data

11

Big Data – Big Server ?

12

Big Data – Big Server ?

13

Big Data – Big Problems ?

14

Big Data – Big Problems ?

15

Split is the key

16

How to find data ?

17

Define a master

18

Try again

19

Not so bad

20

Hadoop fundamentals

● Distributed FileSystem for high volume of data● Use of common servers (limit costs)● Scalable / fault tolerance

21

HDFS

HDFS

22

Hadoop Distributed FileSystem

23

Hadoop fundamentals

● Distributed FileSystem for high volume of data● Use of common servers (limit costs)● Scalable / fault tolerance ??

24

Hadoop Distributed FileSystem

25

MapReduce

HDFS MapReduce

26

Mapreduce

27

Mapreduce : word count

Map Reduce

28

Data Locality Optimization

29

Mapreduce in action

30

Hadoop v1 : drawbacks

– One Namenode : SPOF – One Jobtracker : SPOF and un-scalable (nodes limitation)– MapReduce only : open this platform to non MR

applications– MapReduce v1 : do not fit well with iterative algorithms

used by Machine Learning

31

Hadoop v2

Improvements :– HDFS v2 : Secondary namenode– YARN (Yet Another Resource Negociator)

● JobTracker => Resource Manager + Applications Master (more than one)

● Can be used by non MapReduce applications– MapReduce v2 : uses Yarn

32

Hadoop v2

33

YARN

34

YARN

35

YARN

36

YARN

37

YARN

38

YARN

39

What about monitoring ?

● Command line : hadoop job, yarn● IHM to monitor cluster status● IHM to check status of running jobs● Access to logs files about nodes activity from the IHM

40

What about monitoring ?

41

What can we do with Hadoop ?

(Me) 2 projects in Credit Mutuel Arkea :– LAB : Anti-money laundering– Operational reporting for a B2B customer

42

LAB : Context

● Tracfin : supervised by the Economic and Financial department in France

43

LAB : Context

● Difficulties to provide accurate alerts : complexity to maintain the system and develop new features

44

LAB : Context

● Batch Cobol (z/OS) : started at 19h00 until 9h00 the day after

45

LAB : Migration to Hadoop

● Pig : Pig dataflow model fits well for this kind of process (lot of data manipulation)

46


● Lot of data in input : +1 for Pig

47


● A lot of jobs tasks can be parallelized : +1 for Hadoop

48


● Time spent for data manipulation reduced by more than 50 %

49


● Previous Job was a batch : MapReduce Ok

50

Context : – Provide a large variety of reporting to a B2B partner

Why Hadoop :– New project– Huge amount of different data sources as input : Pig Help

me !– Batch is ok

Operational Reporting

52

Pig – Why a new langage ?

● With Pig write MR Jobs becomes easy● Dataflow model : data is the key !● Langage : PigLatin● No limit : Used Defined Functions

http://pig.apache.org/docs/r0.13.0/https://github.com/linkedin/datafuhttps://github.com/twitter/elephant-birdhttps://cwiki.apache.org/confluence/display/PIG/PiggyBank

http://pig.apache.org/docs/r0.13.0/

https://github.com/linkedin/datafu

https://github.com/twitter/elephant-bird

https://cwiki.apache.org/confluence/display/PIG/PiggyBank

https://cwiki.apache.org/confluence/display/PIG/PiggyBank

53

● Pig-Wordcount-- Load file on HDFS

lines = LOAD '/user/XXX/file.txt' AS (line:chararray);

-- Iterate on each line-- We use TOKENISE to split by word and FLATTEN to obtain a tuple

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- Group by word

grouped = GROUP words BY word;

-- Count number of occurences for each group (word)

wordcount = FOREACH grouped GENERATE group, COUNT(words);

-- Display results on sysout

DUMP wordcount;

Pig “Hello world”

54

=> 130 lines of code !

Import …

public class WordCount2 {

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

static enum CountersEnum { INPUT_WORDS }

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

private boolean caseSensitive; private Set<String> patternsToSkip = new HashSet<String>();

private Configuration conf; private BufferedReader fis;

...

Pig vs MapReduce

55

● SQL like : HQL● Metastore : data abstraction and data discovery● UDFs

Hive

56

● Hive-Wordcount

-- Create table with structure (DDL)

CREATE TABLE docs (line STRING);

-- Load data..

LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;

-- Create table for results-- Select data from previous table, split lines and group by word-- And Count records per group CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) wGROUP BY wordORDER BY word;

Hive “Hello world”

57

Zookeeper

Purpose : Coordinate relations between the different actors. Provide a global configuration we have pushed.

58

Zookeeper● Distributed coordination service

59

● Dynamic configuration● Distributed locking

Zookeeper

60

● Messaging System with a specific design● Topic / Point to Point in the same time● Suitable for high volume of data

Kafka

https://kafka.apache.org/

https://kafka.apache.org/

61

Hadoop : Batch but not only..

62

Tez

● Interactive processing uppon Hive and Pig

63

HBase

● Online database (realtime querying) ● NoSQL : columm oriented database● Based on Google BigTable● Storage on HDFS

64

Storm

● Streaming mode● Plug well with Apache Kafka● Allow data manipulation during input

http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos

http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign

http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos

http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign

65

Cascading

● Application development platform on Hadoop● APIs in Java : standard API, data processing, data

integration, scheduler API

66

Scalding

● Scala API for Cascading

67

Phoenix

● Relational DB Layer over Hbase● HBase access delivered as a JDBC client● Perf : on the order of milliseconds for small

queries, or seconds for tens of millions of rows

68

Spark

● Big data analytics in-memory / disk● Complements Hadoop● Fast and more flexible

https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark

http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html


http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html


69

??

hadoop breizhjug

Technology

hadoop pig

hadoop lot of data

hadoop job

hadoop fundamentals

hadoop v1

hadoop v2improvements

hadoop time

big data big server