hadoop breizhjug

69
INTRODUCTION TO HADOOP Rennes – 2014-11-06 David Morin - @davAtBzh BreizhJug

Upload: david-morin

Post on 07-Jul-2015

208 views

Category:

Technology


0 download

DESCRIPTION

Inttroduction to Hadoop and its ecosystem at BreizhJug

TRANSCRIPT

Page 1: Hadoop breizhjug

INTRODUCTION TO HADOOP

Rennes – 2014-11-06David Morin - @davAtBzh

BreizhJug

Page 2: Hadoop breizhjug

Me

Solutions Engineer at

@davAtBzhDavid Morin

Page 3: Hadoop breizhjug

3

What is Hadoop ?

Page 4: Hadoop breizhjug

4

An elephant – This one ?

Page 5: Hadoop breizhjug

5

No, this one !

Page 6: Hadoop breizhjug

6

The father

Page 7: Hadoop breizhjug

7

Let's go !

Page 8: Hadoop breizhjug

8

Let's go !

Page 9: Hadoop breizhjug

9

Timeline

Page 10: Hadoop breizhjug

10

How did the story begin ?

=> Deal with high volume of data

Page 11: Hadoop breizhjug

11

Big Data – Big Server ?

Page 12: Hadoop breizhjug

12

Big Data – Big Server ?

Page 13: Hadoop breizhjug

13

Big Data – Big Problems ?

Page 14: Hadoop breizhjug

14

Big Data – Big Problems ?

Page 15: Hadoop breizhjug

15

Split is the key

Page 16: Hadoop breizhjug

16

How to find data ?

Page 17: Hadoop breizhjug

17

Define a master

Page 18: Hadoop breizhjug

18

Try again

Page 19: Hadoop breizhjug

19

Not so bad

Page 20: Hadoop breizhjug

20

Hadoop fundamentals

● Distributed FileSystem for high volume of data● Use of common servers (limit costs)● Scalable / fault tolerance

Page 21: Hadoop breizhjug

21

HDFS

HDFS

Page 22: Hadoop breizhjug

22

Hadoop Distributed FileSystem

Page 23: Hadoop breizhjug

23

Hadoop fundamentals

● Distributed FileSystem for high volume of data● Use of common servers (limit costs)● Scalable / fault tolerance ??

Page 24: Hadoop breizhjug

24

Hadoop Distributed FileSystem

Page 25: Hadoop breizhjug

25

MapReduce

HDFS MapReduce

Page 26: Hadoop breizhjug

26

Mapreduce

Page 27: Hadoop breizhjug

27

Mapreduce : word count

Map Reduce

Page 28: Hadoop breizhjug

28

Data Locality Optimization

Page 29: Hadoop breizhjug

29

Mapreduce in action

Page 30: Hadoop breizhjug

30

Hadoop v1 : drawbacks

– One Namenode : SPOF – One Jobtracker : SPOF and un-scalable (nodes limitation)– MapReduce only : open this platform to non MR

applications– MapReduce v1 : do not fit well with iterative algorithms

used by Machine Learning

Page 31: Hadoop breizhjug

31

Hadoop v2

Improvements :– HDFS v2 : Secondary namenode– YARN (Yet Another Resource Negociator)

● JobTracker => Resource Manager + Applications Master (more than one)

● Can be used by non MapReduce applications– MapReduce v2 : uses Yarn

Page 32: Hadoop breizhjug

32

Hadoop v2

Page 33: Hadoop breizhjug

33

YARN

Page 34: Hadoop breizhjug

34

YARN

Page 35: Hadoop breizhjug

35

YARN

Page 36: Hadoop breizhjug

36

YARN

Page 37: Hadoop breizhjug

37

YARN

Page 38: Hadoop breizhjug

38

YARN

Page 39: Hadoop breizhjug

39

What about monitoring ?

● Command line : hadoop job, yarn● IHM to monitor cluster status● IHM to check status of running jobs● Access to logs files about nodes activity from the IHM

Page 40: Hadoop breizhjug

40

What about monitoring ?

Page 41: Hadoop breizhjug

41

What can we do with Hadoop ?

(Me) 2 projects in Credit Mutuel Arkea :– LAB : Anti-money laundering– Operational reporting for a B2B customer

Page 42: Hadoop breizhjug

42

LAB : Context

● Tracfin : supervised by the Economic and Financial department in France

Page 43: Hadoop breizhjug

43

LAB : Context

● Difficulties to provide accurate alerts : complexity to maintain the system and develop new features

Page 44: Hadoop breizhjug

44

LAB : Context

● Batch Cobol (z/OS) : started at 19h00 until 9h00 the day after

Page 45: Hadoop breizhjug

45

LAB : Migration to Hadoop

● Pig : Pig dataflow model fits well for this kind of process (lot of data manipulation)

Page 46: Hadoop breizhjug

46

LAB : Migration to Hadoop

● Lot of data in input : +1 for Pig

Page 47: Hadoop breizhjug

47

LAB : Migration to Hadoop

● A lot of jobs tasks can be parallelized : +1 for Hadoop

Page 48: Hadoop breizhjug

48

LAB : Migration to Hadoop

● Time spent for data manipulation reduced by more than 50 %

Page 49: Hadoop breizhjug

49

LAB : Migration to Hadoop

● Previous Job was a batch : MapReduce Ok

Page 50: Hadoop breizhjug

50

Context : – Provide a large variety of reporting to a B2B partner

Why Hadoop :– New project– Huge amount of different data sources as input : Pig Help

me !– Batch is ok

Operational Reporting

Page 51: Hadoop breizhjug

51

Page 52: Hadoop breizhjug

52

Pig – Why a new langage ?

● With Pig write MR Jobs becomes easy● Dataflow model : data is the key !● Langage : PigLatin● No limit : Used Defined Functions

http://pig.apache.org/docs/r0.13.0/https://github.com/linkedin/datafuhttps://github.com/twitter/elephant-birdhttps://cwiki.apache.org/confluence/display/PIG/PiggyBank

Page 53: Hadoop breizhjug

53

● Pig-Wordcount-- Load file on HDFS

lines = LOAD '/user/XXX/file.txt' AS (line:chararray);

-- Iterate on each line-- We use TOKENISE to split by word and FLATTEN to obtain a tuple

words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- Group by word

grouped = GROUP words BY word;

-- Count number of occurences for each group (word)

wordcount = FOREACH grouped GENERATE group, COUNT(words);

-- Display results on sysout

DUMP wordcount;

Pig “Hello world”

Page 54: Hadoop breizhjug

54

=> 130 lines of code !

Import …

public class WordCount2 {

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

static enum CountersEnum { INPUT_WORDS }

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

private boolean caseSensitive; private Set<String> patternsToSkip = new HashSet<String>();

private Configuration conf; private BufferedReader fis;

...

Pig vs MapReduce

Page 55: Hadoop breizhjug

55

● SQL like : HQL● Metastore : data abstraction and data discovery● UDFs

Hive

Page 56: Hadoop breizhjug

56

● Hive-Wordcount

-- Create table with structure (DDL)

CREATE TABLE docs (line STRING);

-- Load data..

LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;

-- Create table for results-- Select data from previous table, split lines and group by word-- And Count records per group CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) wGROUP BY wordORDER BY word;

Hive “Hello world”

Page 57: Hadoop breizhjug

57

Zookeeper

Purpose : Coordinate relations between the different actors. Provide a global configuration we have pushed.

Page 58: Hadoop breizhjug

58

Zookeeper● Distributed coordination service

Page 59: Hadoop breizhjug

59

● Dynamic configuration● Distributed locking

Zookeeper

Page 60: Hadoop breizhjug

60

● Messaging System with a specific design● Topic / Point to Point in the same time● Suitable for high volume of data

Kafka

https://kafka.apache.org/

Page 61: Hadoop breizhjug

61

Hadoop : Batch but not only..

Page 62: Hadoop breizhjug

62

Tez

● Interactive processing uppon Hive and Pig

Page 63: Hadoop breizhjug

63

HBase

● Online database (realtime querying) ● NoSQL : columm oriented database● Based on Google BigTable● Storage on HDFS

Page 64: Hadoop breizhjug

64

Storm

● Streaming mode● Plug well with Apache Kafka● Allow data manipulation during input

http://fr.slideshare.net/hugfrance/hugfr-6-oct2014ovhantiddos

http://fr.slideshare.net/miguno/apache-storm-09-basic-training-verisign

Page 65: Hadoop breizhjug

65

Cascading

● Application development platform on Hadoop● APIs in Java : standard API, data processing, data

integration, scheduler API

Page 66: Hadoop breizhjug

66

Scalding

● Scala API for Cascading

Page 67: Hadoop breizhjug

67

Phoenix

● Relational DB Layer over Hbase● HBase access delivered as a JDBC client● Perf : on the order of milliseconds for small

queries, or seconds for tens of millions of rows

Page 68: Hadoop breizhjug

68

Spark

● Big data analytics in-memory / disk● Complements Hadoop● Fast and more flexible

https://speakerdeck.com/nivdul/lightning-fast-machine-learning-with-spark

http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

Page 69: Hadoop breizhjug

69

??