hadoop and distributed computing

Federico Cargnelu/ / BSkyB

& Distributed Compu<ng Hadoop

Distributed compu<ng uses so=ware to divide pieces of a program among several computers.

One project in par<cular has proven that the concept works extremely well.

SETI@Home Search for Extra-‐Terrestrial Intelligence

•  Prove the viability of the distributed grid compu<ng concept (succeeded)

•  Detect intelligent life outside Earth (failed)

What problem are we trying to solve?

Distributed Compu6ng

Counts of all the dis6nct word

•  in a file? •  in a directory? •  on the Web?

We need to process 100TB datasets

•  On 1 node: o  Scanning @ 50MB/s = 23 days

•  On 1000 node cluster: o  Scanning @ 50MB/s = 33 min

We need a framework for distribu<on

We need a new paradigm

Hadoop is an open-‐source Java framework for running applica<ons on large clusters of commodity

hardware

Scalable Hadoop can reliably store and process petabytes of data.

Economical Hadoop distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.

Efficient Hadoop can process the distributed data in parallel on the nodes where the data is located.

Reliable Hadoop automa<cally maintains mul<ple copies of data and automa<cally redeploys compu<ng tasks based on failures.

Hadoop Components

Hadoop Distributed File System (HDFS) •  Java, Shell, C and HTTP API’s

Hadoop MapReduce •  Java and Streaming API’s

Hadoop on Demand •  Tools to manage dynamic setup and teardown of Hadoop

nodes

HBase Table storage on top of HDFS, modeled a=er Google’s Big Table

Pig Language for dataflow programming

Hive SQL interface to structured data stored in HDFS

Other Tools

•  Mappers and Reducers are allocated •  Code is shipped to nodes •  Mappers and Reducers are run on same machines

as DataNodes •  Two major daemons: JobTracker and TaskTracker

Hadoop MapReduce

JobTracker

•  Long-‐lived master daemon which distributes tasks •  Maintains a job history of job execu<on sta<s<cs

TaskTrackers

•  Long-‐lived client daemon which executes Map and Reduce tasks

Hadoop MapReduce

•  Setup a mul<-‐node Hadoop cluster using the Hadoop Distributed File System (HDFS)

•  Create a hierarchical HDFS with directories and files. •  Use Hadoop API to store a large text file. •  Create a MapReduce applica<on.

Hadoop MapReduce

•  Mapper takes input key/value pair

•  Does something to its input •  Emits intermediate key/value pair

•  One call per input record •  Fully data-‐parallel

Map

(in, 1)

(in, 1) (sunt, 1)

(in, 1) (elit, 1)

(sed, 1)

(eiusmod, 1)

Map

•  Input is all list of intermediate values for a given key

•  Reducer aggregates list of intermediate values •  Returns a final key/value pair for output

Reduce

(irure, 1)

(in, 3) (ea, 1)

(enim, 1) (eu, 1)

(Duis, 1)

(dolore, 2)

Reduce Reduce

Adobe -‐ Use for data storage and processing -‐ 30 nodes

Facebook -‐ Use for repor<ng and analy<cs -‐ 320 nodes

FOX -‐ Use for log analysis and data mining -‐ 140 nodes

Last.fm -‐ Use for chart calcula<on and log analysis -‐ 27 nodes

New York Times -‐ Use for large scale image conversion -‐ 100 nodes

Yahoo!

-‐ Use for Ad systems and Web search

-‐ 10.000 nodes

Who is using it?

•  Video and Image processing

•  Log analysis •  Spam/BOT analysis

•  Behavioral analy<cs (CRM) •  Sequen<al paiern analysis (eg. Understanding long-‐term

customer buying behavior for cross selling and target marke<ng)

Use Cases

Commodity servers

•  1 RU •  2 x 4 core CPU •  4-‐8GB of RAM using ECC memory •  4 x 1TB SATA drives •  1-‐5TB external storage

Typically arranged in 2 level architecture

•  30/40 nodes per rack

Recommended Hardware

•  No version and dependency management.

•  Configura<on: more than 150 parameters. •  No security against accidents. User iden<fica<on added a=er

Last.fm deleted a fileystem by accident.

•  HDFS is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file.

•  Steep learning curve. According to Facebook, using Hadoop was not easy for end users, especially for the ones who were not familiar with MapReduce.

Challenges

Images: hip://www.flickr.com/photos/labguest/3509303134 hip://www.flickr.com/photos/tantrum_dan/3546852841

Ques6ons?

hadoop and distributed computing

Technology

scanning 50mb

data

nodes

hdfs

clusters

reducers