distributed batch processing with hadoop

Post on 10-May-2015

1.069 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

This are the slides I used to introduce Hadoop in a meetup at the Barcelona JUG (Java Users Group).

TRANSCRIPT

Distributed batch processing with Hadoop

Ferran Galí i Reniu@ferrangali

09/01/2014

Problem

● Too much data○ 90% of all the data in the world has been generated

in the last two years○ Large Hadron Collider: 25 petabytes per year○ Walmart: 1M transactions per hour

● Hard disks○ Cheap!○ Still slow access time○ Write even slower

Solutions

● Multiple Hard Disks○ Work in parallel○ We can reduce access time!

● How to deal with hardware failure?● What if we need to combine data?

Hadoop

● Doug Cutting & Mike Cafarella

Hadoop

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat

December 2004

Hadoop

● Doug Cutting & Mike Cafarella

● Yahoo!

Hadoop

● HDFS○ Storage

● MapReduce○ Processing

● Ecosystem

HDFS

● Distributed storage○ Managed across a network of commodity machines

● Blocks○ About 128Mb○ Large data sets

● Tolerance to node failure○ Data replication

● Streaming data access○ Many access○ Write once (batch)

HDFS

● DataNodes (Workers)○ Store blocks

● NameNode (Master)○ Maintains metadata○ Knows where the blocks are located○ Make DataNodes fault tolerant○ Single point of failure ○ Secondary NameNode

HDFS

● Interfaces○ Java○ Command line interface

● Loadhadoop fs -put file.csv /user/hadoop/file.csv

● Extracthadoop fs -get /user/hadoop/file.csv file.csv

MapReduce

● Distributed processing paradigm○ Moving computation is cheaper than moving data

● Map○ Map(k1,v1) -> list(k2,v2)

● Reduce○ Reduce(k2,list(v2)) -> list(v3)

Word Countermap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

Word Countermap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

Java is greatHadoop is also great

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

Key Value

1 Java is great

2 Hadoop is also great

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

map(1, “Java is great”)

Key Value

1 Java is great

2 Hadoop is also great

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

map(1, “Java is great”)

Key Value

Java 1

Key Value

1 Java is great

2 Hadoop is also great

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

map(1, “Java is great”)

Key Value

Java 1

is 1

Key Value

1 Java is great

2 Hadoop is also great

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

map(1, “Java is great”)

Key Value

Java 1

is 1

great 1

Key Value

1 Java is great

2 Hadoop is also great

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

map(2, “Hadoop is also great”)

Key Value

Java 1

is 1

great 1

Key Value

1 Java is great

2 Hadoop is also great

Word Counter - Mapmap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

map(2, “Hadoop is also great”)

Key Value

Java 1

is 1

great 1

Hadoop 1

is 1

also 1

great 1

Key Value

1 Java is great

2 Hadoop is also great

Word Count - Group & Sort

Key Value

Java 1

is 1

great 1

Hadoop 1

is 1

also 1

great 1

map(k1,v1) -> list(k2, v2) reduce(k2,list(v2)) -> list(v3)

Word Count - Group & Sort

Key Value

Java 1

is 1

great 1

Hadoop 1

is 1

also 1

great 1

Key Value

Java [1]

is [1, 1]

great [1, 1]

Hadoop [1]

also [1]

group

map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3)

Word Count - Group & Sort

Key Value

Java 1

is 1

great 1

Hadoop 1

is 1

also 1

great 1

Key Value

Java [1]

is [1, 1]

great [1, 1]

Hadoop [1]

also [1]

group

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

sort

map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3)

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“also”, [1])

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“also”, [1])

Key Value

also 1

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“great”, [1, 1])

Key Value

also 1

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“great”, [1, 1])

Key Value

also 1

great 2Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“Hadoop”, [1])

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Key Value

also 1

great 2

Hadoop 1

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“is”, [1, 1])

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Key Value

also 1

great 2

Hadoop 1

is 2

Word Count - Reducemap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

reduce(“Java”, [1])

Key Value

also 1

great 2

Hadoop 1

is 2

Java 1

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Distributed?

● Map tasks○ Each read block executes a map task

● Reduce tasks○ Partitioning when grouping

Word Count - Partition

Key Value

Java 1

is 1

great 1

Hadoop 1

is 1

also 1

great 1

Key Value

Java [1]

is [1, 1]

great [1, 1]

Hadoop [1]

also [1]

group

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

sort

num partitions = 1

Word Count - Partition

Key Value

Java 1

is 1

great 1

Hadoop 1

is 1

also 1

great 1

Key Value

great [1, 1]

Hadoop [1]

also [1]

group

sort

Key Value

Java [1]

is [1, 1]

group

num partitions = 2

Key Value

is [1, 1]

Java [1]

Key Value

also [1]

great [1, 1]

Hadoop [1]

sort

Distributed?

● Map tasks○ Each read block executes a map task

● Reduce tasks○ Partitioning when grouping○ Each partition executes a reduce task

MapReduce

● Job Tracker○ Dispatches Map & Reduce Tasks

● Task Tracker○ Executes Map & Reduce Tasks

MapReduce

Example 1:● Map● Reduce● Group & Partition

$> hadoop jar jug-hadoop.jar example1 /user/hadoop/input.txt /user/hadoop/output 2

$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop

MapReduce

Example 2:● Sorting● n-Job workflow

$> hadoop jar jug-hadoop.jar example2 /user/hadoop/input.txt /user/hadoop/output 2

$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop

Big Data

Big Data

● Too much data○ Not a problem any more

● It’s just a matter of which tools use● New opportunity for businesses

Big Data Platform

DB

logsindexes

DB

NoSQL

Consumption Processing Serving

Hadoop Ecosystem

Hive

● Data Warehouse● SQL-Like analysis system

SELECT SPLIT(line, “ ”) AS word, COUNT(*)

FROM table

GROUP BY word

ORDER BY word ASC;

● Executes MapReduce underneath!

HBase

● Based on BigTable● Column-oriented database● Random realtime read/write access● Easy to bulk load from Hadoop

Hadoop Ecosystem

● ZooKeeper:○ Centralized coordination system

● Pig○ Data-flow language to analyze large data sets

● Kafka:○ Distributed messaging system

● Sqoop:○ Transfer between RDBMS - HDFS

● ...

Hadoop - Who’s using it?

Trovit

● What is it:○ Vertical search engine.○ Real estate, cars, jobs, products, vacations.

● Challenges:○ Millions of documents to index○ Traffic generates a huge amount of log files

Trovit

● Legacy:○ Use MySQL as a support to document indexing○ Didn’t scale!

● Batch processing:○ Hadoop with a pipeline workflow○ Problem solved!

● Real time processing:○ Storm to improve freshness

● More challenges:○ Content analysis○ Traffic analysis

Questions?Distributed batch processing with Hadoop

@ferrangali

top related