distributed batch processing with hadoop

55
Distributed batch processing with Hadoop Ferran Galí i Reniu @ferrangali 09/01/2014

Upload: ferran-gali-reniu

Post on 10-May-2015

1.069 views

Category:

Technology


0 download

DESCRIPTION

This are the slides I used to introduce Hadoop in a meetup at the Barcelona JUG (Java Users Group).

TRANSCRIPT

Page 1: Distributed batch processing with Hadoop

Distributed batch processing with Hadoop

Ferran Galí i Reniu@ferrangali

09/01/2014

Page 3: Distributed batch processing with Hadoop

Problem

● Too much data○ 90% of all the data in the world has been generated

in the last two years○ Large Hadron Collider: 25 petabytes per year○ Walmart: 1M transactions per hour

● Hard disks○ Cheap!○ Still slow access time○ Write even slower

Page 4: Distributed batch processing with Hadoop

Solutions

● Multiple Hard Disks○ Work in parallel○ We can reduce access time!

● How to deal with hardware failure?● What if we need to combine data?

Page 5: Distributed batch processing with Hadoop
Page 6: Distributed batch processing with Hadoop

Hadoop

● Doug Cutting & Mike Cafarella

Page 8: Distributed batch processing with Hadoop

Hadoop

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat

December 2004

Page 9: Distributed batch processing with Hadoop

Hadoop

● Doug Cutting & Mike Cafarella

● Yahoo!

Page 10: Distributed batch processing with Hadoop

Hadoop

● HDFS○ Storage

● MapReduce○ Processing

● Ecosystem

Page 11: Distributed batch processing with Hadoop
Page 12: Distributed batch processing with Hadoop

HDFS

● Distributed storage○ Managed across a network of commodity machines

● Blocks○ About 128Mb○ Large data sets

● Tolerance to node failure○ Data replication

● Streaming data access○ Many access○ Write once (batch)

Page 13: Distributed batch processing with Hadoop

HDFS

● DataNodes (Workers)○ Store blocks

● NameNode (Master)○ Maintains metadata○ Knows where the blocks are located○ Make DataNodes fault tolerant○ Single point of failure ○ Secondary NameNode

Page 15: Distributed batch processing with Hadoop

HDFS

● Interfaces○ Java○ Command line interface

● Loadhadoop fs -put file.csv /user/hadoop/file.csv

● Extracthadoop fs -get /user/hadoop/file.csv file.csv

Page 16: Distributed batch processing with Hadoop
Page 17: Distributed batch processing with Hadoop

MapReduce

● Distributed processing paradigm○ Moving computation is cheaper than moving data

● Map○ Map(k1,v1) -> list(k2,v2)

● Reduce○ Reduce(k2,list(v2)) -> list(v3)

Page 18: Distributed batch processing with Hadoop

Word Countermap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

Page 19: Distributed batch processing with Hadoop

Word Countermap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

Java is greatHadoop is also great

Page 20: Distributed batch processing with Hadoop

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

Key Value

1 Java is great

2 Hadoop is also great

Page 21: Distributed batch processing with Hadoop

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

map(1, “Java is great”)

Key Value

1 Java is great

2 Hadoop is also great

Page 22: Distributed batch processing with Hadoop

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

map(1, “Java is great”)

Key Value

Java 1

Key Value

1 Java is great

2 Hadoop is also great

Page 23: Distributed batch processing with Hadoop

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

map(1, “Java is great”)

Key Value

Java 1

is 1

Key Value

1 Java is great

2 Hadoop is also great

Page 24: Distributed batch processing with Hadoop

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

map(1, “Java is great”)

Key Value

Java 1

is 1

great 1

Key Value

1 Java is great

2 Hadoop is also great

Page 25: Distributed batch processing with Hadoop

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

map(2, “Hadoop is also great”)

Key Value

Java 1

is 1

great 1

Key Value

1 Java is great

2 Hadoop is also great

Page 26: Distributed batch processing with Hadoop

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

map(2, “Hadoop is also great”)

Key Value

Java 1

is 1

great 1

Hadoop 1

is 1

also 1

great 1

Key Value

1 Java is great

2 Hadoop is also great

Page 27: Distributed batch processing with Hadoop

Word Count - Group & Sort

Key Value

Java 1

is 1

great 1

Hadoop 1

is 1

also 1

great 1

map(k1,v1) -> list(k2, v2) reduce(k2,list(v2)) -> list(v3)

Page 28: Distributed batch processing with Hadoop

Word Count - Group & Sort

Key Value

Java 1

is 1

great 1

Hadoop 1

is 1

also 1

great 1

Key Value

Java [1]

is [1, 1]

great [1, 1]

Hadoop [1]

also [1]

group

map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3)

Page 29: Distributed batch processing with Hadoop

Word Count - Group & Sort

Key Value

Java 1

is 1

great 1

Hadoop 1

is 1

also 1

great 1

Key Value

Java [1]

is [1, 1]

great [1, 1]

Hadoop [1]

also [1]

group

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

sort

map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3)

Page 30: Distributed batch processing with Hadoop

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Page 31: Distributed batch processing with Hadoop

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“also”, [1])

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Page 32: Distributed batch processing with Hadoop

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“also”, [1])

Key Value

also 1

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Page 33: Distributed batch processing with Hadoop

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“great”, [1, 1])

Key Value

also 1

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Page 34: Distributed batch processing with Hadoop

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“great”, [1, 1])

Key Value

also 1

great 2Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Page 35: Distributed batch processing with Hadoop

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“Hadoop”, [1])

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Key Value

also 1

great 2

Hadoop 1

Page 36: Distributed batch processing with Hadoop

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“is”, [1, 1])

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Key Value

also 1

great 2

Hadoop 1

is 2

Page 37: Distributed batch processing with Hadoop

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“Java”, [1])

Key Value

also 1

great 2

Hadoop 1

is 2

Java 1

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Page 38: Distributed batch processing with Hadoop

Distributed?

● Map tasks○ Each read block executes a map task

● Reduce tasks○ Partitioning when grouping

Page 39: Distributed batch processing with Hadoop

Word Count - Partition

Key Value

Java 1

is 1

great 1

Hadoop 1

is 1

also 1

great 1

Key Value

Java [1]

is [1, 1]

great [1, 1]

Hadoop [1]

also [1]

group

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

sort

num partitions = 1

Page 40: Distributed batch processing with Hadoop

Word Count - Partition

Key Value

Java 1

is 1

great 1

Hadoop 1

is 1

also 1

great 1

Key Value

great [1, 1]

Hadoop [1]

also [1]

group

sort

Key Value

Java [1]

is [1, 1]

group

num partitions = 2

Key Value

is [1, 1]

Java [1]

Key Value

also [1]

great [1, 1]

Hadoop [1]

sort

Page 41: Distributed batch processing with Hadoop

Distributed?

● Map tasks○ Each read block executes a map task

● Reduce tasks○ Partitioning when grouping○ Each partition executes a reduce task

Page 42: Distributed batch processing with Hadoop

MapReduce

● Job Tracker○ Dispatches Map & Reduce Tasks

● Task Tracker○ Executes Map & Reduce Tasks

Page 43: Distributed batch processing with Hadoop

MapReduce

Example 1:● Map● Reduce● Group & Partition

$> hadoop jar jug-hadoop.jar example1 /user/hadoop/input.txt /user/hadoop/output 2

$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop

Page 44: Distributed batch processing with Hadoop

MapReduce

Example 2:● Sorting● n-Job workflow

$> hadoop jar jug-hadoop.jar example2 /user/hadoop/input.txt /user/hadoop/output 2

$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop

Page 45: Distributed batch processing with Hadoop

Big Data

Page 46: Distributed batch processing with Hadoop

Big Data

● Too much data○ Not a problem any more

● It’s just a matter of which tools use● New opportunity for businesses

Page 47: Distributed batch processing with Hadoop

Big Data Platform

DB

logsindexes

DB

NoSQL

Consumption Processing Serving

Page 48: Distributed batch processing with Hadoop

Hadoop Ecosystem

Page 49: Distributed batch processing with Hadoop

Hive

● Data Warehouse● SQL-Like analysis system

SELECT SPLIT(line, “ ”) AS word, COUNT(*)

FROM table

GROUP BY word

ORDER BY word ASC;

● Executes MapReduce underneath!

Page 50: Distributed batch processing with Hadoop

HBase

● Based on BigTable● Column-oriented database● Random realtime read/write access● Easy to bulk load from Hadoop

Page 51: Distributed batch processing with Hadoop

Hadoop Ecosystem

● ZooKeeper:○ Centralized coordination system

● Pig○ Data-flow language to analyze large data sets

● Kafka:○ Distributed messaging system

● Sqoop:○ Transfer between RDBMS - HDFS

● ...

Page 52: Distributed batch processing with Hadoop

Hadoop - Who’s using it?

Page 53: Distributed batch processing with Hadoop

Trovit

● What is it:○ Vertical search engine.○ Real estate, cars, jobs, products, vacations.

● Challenges:○ Millions of documents to index○ Traffic generates a huge amount of log files

Page 54: Distributed batch processing with Hadoop

Trovit

● Legacy:○ Use MySQL as a support to document indexing○ Didn’t scale!

● Batch processing:○ Hadoop with a pipeline workflow○ Problem solved!

● Real time processing:○ Storm to improve freshness

● More challenges:○ Content analysis○ Traffic analysis

Page 55: Distributed batch processing with Hadoop

Questions?Distributed batch processing with Hadoop

@ferrangali