distributed batch processing with hadoop
DESCRIPTION
This are the slides I used to introduce Hadoop in a meetup at the Barcelona JUG (Java Users Group).TRANSCRIPT
Distributed batch processing with Hadoop
Ferran Galí i Reniu@ferrangali
09/01/2014
Ferran Galí i Reniu
● UPC - FIB● Trovit
Problem
● Too much data○ 90% of all the data in the world has been generated
in the last two years○ Large Hadron Collider: 25 petabytes per year○ Walmart: 1M transactions per hour
● Hard disks○ Cheap!○ Still slow access time○ Write even slower
Solutions
● Multiple Hard Disks○ Work in parallel○ We can reduce access time!
● How to deal with hardware failure?● What if we need to combine data?
Hadoop
● Doug Cutting & Mike Cafarella
Hadoop
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
October 2003
Hadoop
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat
December 2004
Hadoop
● Doug Cutting & Mike Cafarella
● Yahoo!
Hadoop
● HDFS○ Storage
● MapReduce○ Processing
● Ecosystem
HDFS
● Distributed storage○ Managed across a network of commodity machines
● Blocks○ About 128Mb○ Large data sets
● Tolerance to node failure○ Data replication
● Streaming data access○ Many access○ Write once (batch)
HDFS
● DataNodes (Workers)○ Store blocks
● NameNode (Master)○ Maintains metadata○ Knows where the blocks are located○ Make DataNodes fault tolerant○ Single point of failure ○ Secondary NameNode
HDFS
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
HDFS
● Interfaces○ Java○ Command line interface
● Loadhadoop fs -put file.csv /user/hadoop/file.csv
● Extracthadoop fs -get /user/hadoop/file.csv file.csv
MapReduce
● Distributed processing paradigm○ Moving computation is cheaper than moving data
● Map○ Map(k1,v1) -> list(k2,v2)
● Reduce○ Reduce(k2,list(v2)) -> list(v3)
Word Countermap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Word Countermap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Java is greatHadoop is also great
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Key Value
1 Java is great
2 Hadoop is also great
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
map(1, “Java is great”)
Key Value
1 Java is great
2 Hadoop is also great
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
map(1, “Java is great”)
Key Value
Java 1
Key Value
1 Java is great
2 Hadoop is also great
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
map(1, “Java is great”)
Key Value
Java 1
is 1
Key Value
1 Java is great
2 Hadoop is also great
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
map(1, “Java is great”)
Key Value
Java 1
is 1
great 1
Key Value
1 Java is great
2 Hadoop is also great
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
map(2, “Hadoop is also great”)
Key Value
Java 1
is 1
great 1
Key Value
1 Java is great
2 Hadoop is also great
Word Counter - Mapmap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
map(2, “Hadoop is also great”)
Key Value
Java 1
is 1
great 1
Hadoop 1
is 1
also 1
great 1
Key Value
1 Java is great
2 Hadoop is also great
Word Count - Group & Sort
Key Value
Java 1
is 1
great 1
Hadoop 1
is 1
also 1
great 1
map(k1,v1) -> list(k2, v2) reduce(k2,list(v2)) -> list(v3)
Word Count - Group & Sort
Key Value
Java 1
is 1
great 1
Hadoop 1
is 1
also 1
great 1
Key Value
Java [1]
is [1, 1]
great [1, 1]
Hadoop [1]
also [1]
group
map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3)
Word Count - Group & Sort
Key Value
Java 1
is 1
great 1
Hadoop 1
is 1
also 1
great 1
Key Value
Java [1]
is [1, 1]
great [1, 1]
Hadoop [1]
also [1]
group
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
sort
map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3)
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“also”, [1])
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“also”, [1])
Key Value
also 1
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“great”, [1, 1])
Key Value
also 1
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“great”, [1, 1])
Key Value
also 1
great 2Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“Hadoop”, [1])
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
Key Value
also 1
great 2
Hadoop 1
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“is”, [1, 1])
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
Key Value
also 1
great 2
Hadoop 1
is 2
Word Count - Reducemap (Long key, String value)
for each(String word in value)
emit(word, 1);
reduce (String word, List values)
emit(word, sum(values));
reduce(“Java”, [1])
Key Value
also 1
great 2
Hadoop 1
is 2
Java 1
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
Distributed?
● Map tasks○ Each read block executes a map task
● Reduce tasks○ Partitioning when grouping
Word Count - Partition
Key Value
Java 1
is 1
great 1
Hadoop 1
is 1
also 1
great 1
Key Value
Java [1]
is [1, 1]
great [1, 1]
Hadoop [1]
also [1]
group
Key Value
also [1]
great [1, 1]
Hadoop [1]
is [1, 1]
Java [1]
sort
num partitions = 1
Word Count - Partition
Key Value
Java 1
is 1
great 1
Hadoop 1
is 1
also 1
great 1
Key Value
great [1, 1]
Hadoop [1]
also [1]
group
sort
Key Value
Java [1]
is [1, 1]
group
num partitions = 2
Key Value
is [1, 1]
Java [1]
Key Value
also [1]
great [1, 1]
Hadoop [1]
sort
Distributed?
● Map tasks○ Each read block executes a map task
● Reduce tasks○ Partitioning when grouping○ Each partition executes a reduce task
MapReduce
● Job Tracker○ Dispatches Map & Reduce Tasks
● Task Tracker○ Executes Map & Reduce Tasks
MapReduce
Example 1:● Map● Reduce● Group & Partition
$> hadoop jar jug-hadoop.jar example1 /user/hadoop/input.txt /user/hadoop/output 2
$> hadoop fs -text /user/hadoop/output/part-r-*
http://github.com/ferrangali/jug-hadoop
MapReduce
Example 2:● Sorting● n-Job workflow
$> hadoop jar jug-hadoop.jar example2 /user/hadoop/input.txt /user/hadoop/output 2
$> hadoop fs -text /user/hadoop/output/part-r-*
http://github.com/ferrangali/jug-hadoop
Big Data
Big Data
● Too much data○ Not a problem any more
● It’s just a matter of which tools use● New opportunity for businesses
Big Data Platform
DB
logsindexes
DB
NoSQL
Consumption Processing Serving
Hadoop Ecosystem
Hive
● Data Warehouse● SQL-Like analysis system
SELECT SPLIT(line, “ ”) AS word, COUNT(*)
FROM table
GROUP BY word
ORDER BY word ASC;
● Executes MapReduce underneath!
HBase
● Based on BigTable● Column-oriented database● Random realtime read/write access● Easy to bulk load from Hadoop
Hadoop Ecosystem
● ZooKeeper:○ Centralized coordination system
● Pig○ Data-flow language to analyze large data sets
● Kafka:○ Distributed messaging system
● Sqoop:○ Transfer between RDBMS - HDFS
● ...
Hadoop - Who’s using it?
Trovit
● What is it:○ Vertical search engine.○ Real estate, cars, jobs, products, vacations.
● Challenges:○ Millions of documents to index○ Traffic generates a huge amount of log files
Trovit
● Legacy:○ Use MySQL as a support to document indexing○ Didn’t scale!
● Batch processing:○ Hadoop with a pipeline workflow○ Problem solved!
● Real time processing:○ Storm to improve freshness
● More challenges:○ Content analysis○ Traffic analysis
Questions?Distributed batch processing with Hadoop
@ferrangali