distributed batch processing with hadoop

Distributed batch processing with Hadoop

Ferran Galí i Reniu@ferrangali

09/01/2014

Ferran Galí i Reniu

● UPC - FIB● Trovit

Problem

● Too much data○ 90% of all the data in the world has been generated

in the last two years○ Large Hadron Collider: 25 petabytes per year○ Walmart: 1M transactions per hour

● Hard disks○ Cheap!○ Still slow access time○ Write even slower

Solutions

● Multiple Hard Disks○ Work in parallel○ We can reduce access time!

● How to deal with hardware failure?● What if we need to combine data?

Hadoop

● Doug Cutting & Mike Cafarella

Hadoop

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

October 2003

Hadoop

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat

December 2004

Hadoop

● Doug Cutting & Mike Cafarella

● Yahoo!

Hadoop

● HDFS○ Storage

● MapReduce○ Processing

● Ecosystem

● Distributed storage○ Managed across a network of commodity machines

● Blocks○ About 128Mb○ Large data sets

● Tolerance to node failure○ Data replication

● Streaming data access○ Many access○ Write once (batch)

● DataNodes (Workers)○ Store blocks

● NameNode (Master)○ Maintains metadata○ Knows where the blocks are located○ Make DataNodes fault tolerant○ Single point of failure ○ Secondary NameNode

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

● Interfaces○ Java○ Command line interface

● Loadhadoop fs -put file.csv /user/hadoop/file.csv

● Extracthadoop fs -get /user/hadoop/file.csv file.csv

MapReduce

● Distributed processing paradigm○ Moving computation is cheaper than moving data

● Map○ Map(k1,v1) -> list(k2,v2)

● Reduce○ Reduce(k2,list(v2)) -> list(v3)

Word Countermap (Long key, String value)

for each(String word in value)

emit(word, 1);

reduce (String word, List values)

emit(word, sum(values));

Word Countermap (Long key, String value)

emit(word, 1);

Java is greatHadoop is also great

Word Counter - Mapmap (Long key, String value)

emit(word, 1);

Key Value

1 Java is great

2 Hadoop is also great

emit(word, 1);

map(1, “Java is great”)

Key Value

1 Java is great

emit(word, 1);

Key Value

Java 1

Key Value

1 Java is great

emit(word, 1);

Key Value

Java 1

Key Value

1 Java is great

emit(word, 1);

Key Value

Java 1

great 1

Key Value

1 Java is great

emit(word, 1);

map(2, “Hadoop is also great”)

Key Value

Java 1

great 1

Key Value

1 Java is great

emit(word, 1);

map(2, “Hadoop is also great”)

Key Value

Java 1

great 1

Hadoop 1

also 1

great 1

Key Value

1 Java is great

Word Count - Group & Sort

Key Value

Java 1

great 1

Hadoop 1

also 1

great 1

map(k1,v1) -> list(k2, v2) reduce(k2,list(v2)) -> list(v3)

Key Value

Java 1

great 1

Hadoop 1

also 1

great 1

Key Value

Java [1]

is [1, 1]

great [1, 1]

Hadoop [1]

also [1]

map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3)

Key Value

Java 1

great 1

Hadoop 1

also 1

great 1

Key Value

Java [1]

is [1, 1]

great [1, 1]

Hadoop [1]

also [1]

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

map(k1,v1) -> list(k2,v2) reduce(k2,list(v2)) -> list(v3)

Word Count - Reducemap (Long key, String value)

emit(word, 1);

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

emit(word, 1);

reduce(“also”, [1])

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

emit(word, 1);

reduce(“also”, [1])

Key Value

also 1

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

emit(word, 1);

reduce(“great”, [1, 1])

Key Value

also 1

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

emit(word, 1);

reduce(“great”, [1, 1])

Key Value

also 1

great 2Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

emit(word, 1);

reduce(“Hadoop”, [1])

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Key Value

also 1

great 2

Hadoop 1

emit(word, 1);

reduce(“is”, [1, 1])

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Key Value

also 1

great 2

Hadoop 1

emit(word, 1);

reduce(“Java”, [1])

Key Value

also 1

great 2

Hadoop 1

Java 1

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

Distributed?

● Map tasks○ Each read block executes a map task

● Reduce tasks○ Partitioning when grouping

Word Count - Partition

Key Value

Java 1

great 1

Hadoop 1

also 1

great 1

Key Value

Java [1]

is [1, 1]

great [1, 1]

Hadoop [1]

also [1]

Key Value

also [1]

great [1, 1]

Hadoop [1]

is [1, 1]

Java [1]

num partitions = 1

Word Count - Partition

Key Value

Java 1

great 1

Hadoop 1

also 1

great 1

Key Value

great [1, 1]

Hadoop [1]

also [1]

Key Value

Java [1]

is [1, 1]

num partitions = 2

Key Value

is [1, 1]

Java [1]

Key Value

also [1]

great [1, 1]

Hadoop [1]

Distributed?

● Map tasks○ Each read block executes a map task

● Reduce tasks○ Partitioning when grouping○ Each partition executes a reduce task

MapReduce

● Job Tracker○ Dispatches Map & Reduce Tasks

● Task Tracker○ Executes Map & Reduce Tasks

MapReduce

Example 1:● Map● Reduce● Group & Partition

$> hadoop jar jug-hadoop.jar example1 /user/hadoop/input.txt /user/hadoop/output 2

$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop

MapReduce

Example 2:● Sorting● n-Job workflow

$> hadoop jar jug-hadoop.jar example2 /user/hadoop/input.txt /user/hadoop/output 2

$> hadoop fs -text /user/hadoop/output/part-r-*

http://github.com/ferrangali/jug-hadoop

Big Data

● Too much data○ Not a problem any more

● It’s just a matter of which tools use● New opportunity for businesses

Big Data Platform

logsindexes

Consumption Processing Serving

Hadoop Ecosystem

● Data Warehouse● SQL-Like analysis system

SELECT SPLIT(line, “ ”) AS word, COUNT(*)

FROM table

GROUP BY word

ORDER BY word ASC;

● Executes MapReduce underneath!

● Based on BigTable● Column-oriented database● Random realtime read/write access● Easy to bulk load from Hadoop

Hadoop Ecosystem

● ZooKeeper:○ Centralized coordination system

● Pig○ Data-flow language to analyze large data sets

● Kafka:○ Distributed messaging system

● Sqoop:○ Transfer between RDBMS - HDFS

● ...

Hadoop - Who’s using it?

Trovit

● What is it:○ Vertical search engine.○ Real estate, cars, jobs, products, vacations.

● Challenges:○ Millions of documents to index○ Traffic generates a huge amount of log files

Trovit

● Legacy:○ Use MySQL as a support to document indexing○ Didn’t scale!

● Batch processing:○ Hadoop with a pipeline workflow○ Problem solved!

● Real time processing:○ Storm to improve freshness

● More challenges:○ Content analysis○ Traffic analysis

Questions?Distributed batch processing with Hadoop

@ferrangali

distributed batch processing with hadoop

key java1isreduce string

word counter map long

greatfor eachstring

value emitword

string valuemap1

string valuereducegreat

string valuereducealso

string valuereducehadoop

Technology

research article big data and hadoop with … · in this...

hadoop and distributed computing

dataiku - paris jug 2013 - hadoop is a batch

distributed and parallel processing technology chapter3. the...

hadoop distributed file system - snia · hadoop distributed...

hdfs hadoop distributed file system

hadoop distributed file system rev3

hadoop integration function user's guide...-in the case of...

mongodb & hadoop: flexible hourly batch processing model

cassandra + hadoop: analisi batch con apache cassandra

hadoop: distributed data processing

distributed deep learning on hadoop clusters

strata + hadoop world 2012 keynote: beyond batch - doug...

hdfs: hadoop distributed fs

apache hadoop™ yarn: moving beyond mapreduce and batch

mapreduce and hadoop · 2012. 1. 19. · hadoop • open...

ai and machine learning: actionable insights and machine...

parallel video transcoding using hadoop mapreduce ·...

course content for hadoop and spark...

mapreduce and hadoop distributed file system