open xke - big data, big mess par bertrand dechoux

Big Data, Big Mess ?Par Bertrand Dechoux

1

Experience Hadoop

2

•première contact début 2010•consultant et trainer Hadoop @ Xebia

2

Agenda

3

Et les données ?

Hive, Pig et Cascading

Hadoop MapReduce 101

Api Java, Hadoop Streaming

3

HadoopMapReduce

101 1

4

un problème, une solution

5

Objectifs :

•calcul distribué

•haute volumétrie

Choix :

•commodity hardware

•local read

5

Map et Reduce

6

DATA

reduce

map

DATA DATA DATA

map map map

reduce

DATA DATA

6

Ce qui vous est fourni

7

• des primitives• en Java• fonctionnelles• de batch distribué

7

Api Java,Hadoop Streaming

1

8

L’Api java

9

public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

9

Industrialisation Simple

10

•dependances -> maven•test -> MRUnit + JUnit + maven•release -> maven + jenkins + nexus

10

Cas d’usage classique

11

•centralisation des logs•comment l’exploitant utilise t il les logs?

11

Beyond Java : Hadoop Streaming

12

•lecture et écriture sur stdin/stdout•integration du legacy•seulement des jobs simples•industrialisation sans problème

12

Hive, Pig etCascading

1

13

Hive et Pig

14

•PigLatin•‘bou!e tout’•DAG

•HiveQL•structuré•tree

14

Industrialisation ?

15

•dependances -> maven•test -> JUnit + maven•release -> maven + jenkins + nexus

15

Industrialisation Laborieuse

16

•1 job MapReduce -> minimum 10 secondes•1 requete -> ???•n requetes -> trop long

16

Cascading

17

•principe similaire à Hive et Pig•une surapi en Java•ou scala : scalding•ou clojure : cascalog

•Hadoop n’est pas la seule plateforme

17

Et les données?1

18

Les fichiers

19

type text SequenceFile Avro

interoperabilité

performance

19

Le filesystem : HDFS

20

•peu de "chiers•des gros "chiers•optimisés pour la lecture en continu

20

La base : HBase

21

•un clone de BigTable•essentiellement une Map avec clefs triées

21

Data Management

22

•HCatalog•inspiré de Hive metastore•décrit les jeux de données

•Avro•un "chier contenant sa description•perfomant

22

Data Management

23

•management = coordination

•data steward / data custodian

23

Tout cela est il important ?

24

24

DesQuestions ?

Merci!

25

open xke - big data, big mess par bertrand dechoux

Documents

class job

new pathargs1 job

wordcount job job new

waitforcompletiontrue

new intwritablesum context

class fileinputformat

nexttoken context

new pathargs0 fileoutputformat