cs525: big data analytics mapreduce computing paradigm & apache hadoop open source fall 2013...

1

CS525: Big Data Analytics

MapReduce Computing Paradigm &Apache Hadoop Open Source

Fall 2013

Elke A. Rundensteiner

Large-Scale Data Analytics

Scalability (petabytes of data, thousands of machines)

Database

vs.

Flexibility in accepting all data formats (no schema)

Commodity inexpensive hardware

Performance (indexing, tuning, data organization tech.)

Advanced Features: - Full query support - Clever optimizers - Views and security - Data consistency- ….

Many enterprises turn to Hadoop computing paradigm for big data applications :

Focus on read + write, concurrency, correctness, convenience, high-level access

Efficient fault tolerance support

3

What is Hadoop

• Hadoop is a simple software framework for distributed processing of large datasets across huge clusters of (commodity hardware) computers :– Large datasets Terabytes or petabytes of data– Large clusters Hundreds or thousands of nodes

• Open-source implementation for Google MapReduce• Simple programming model : MapReduce• Simple data model: flexible for any data

4

Hadoop Framework

• Two main layers:– Distributed file system (HDFS)– Execution engine (MapReduce)

Hadoop is designed as a master-slave shared-nothing architecture

6

Key Ideas of Hadoop

• Automatic parallelization & distribution– Hidden from end-user

• Fault tolerance and automatic recovery– Failed nodes/tasks recover automatically

• Simple programming abstraction– Users provide two functions “map” and “reduce”

7

Who Uses Hadoop ?

• Google: Invent MapReduce computing paradigm• Yahoo: Develop Hadoop open-source of MapReduce• Integrators: IBM, Microsoft, Oracle, Greenplum• Adopters: Facebook, Amazon, AOL, NetFlex,LinkedIn• Many others …

9

Hadoop Distributed File System (HDFS)

Centralized namenode - Maintains metadata info about files

Many datanodes (1000s) - Store actual data - Files are divided into blocks - Each block is replicated N times (Default = 3)

File F 1 2 3 4 5

Blocks (64 MB)

10

HDFS File System Properties

• Large Space: An HDFS instance may consist of thousands of server machines for storage

• Replication: Each data block is replicated• • Failure: Failure is norm rather than exception

• Fault Tolerance: Automated detection of faults and recovery

11

Map-Reduce Execution Engine(Example: Color Count)

Shuffle & Sorting based on k

Input blocks on HDFS

Produces (k, v) ( , 1)

Consumes(k, [v]) ( , [1,1,1,1,1,1..])

Produces(k’, v’) ( , 100)

Users only provide the “Map” and “Reduce” functions

12

MapReduce Engine

• Job Tracker is the master node (runs with the namenode)– Receives the user’s job– Decides on how many tasks will run (number of mappers)– Decides on where to run each mapper (locality)

• This file has 5 Blocks run 5 map tasks

• Run task reading block “1” on Node 1 or 3.

Node 1 Node 2 Node 3

13

MapReduce Engine

• Task Tracker is the slave node (runs on each datanode)– Receives the task from Job Tracker– Runs task to completion (either map or reduce task)– Communicates with Job Tracker to report its progress

1 map-reduce job consists of 4 map tasks and 3 reduce tasks

14

About Key-Value Pairs • Developer provides Mapper and Reducer functions • Developer decides what is key and what is value• Developer must follow the key-value pair interface

• Mappers:– Consume <key, value> pairs– Produce <key, value> pairs

• Shuffling and Sorting:– Groups all similar keys from all mappers, – sorts and passes them to a certain reducer – in the form of <key, <list of values>>

• Reducers:– Consume <key, <list of values>>– Produce <key, value>

15

MapReduce Phases

16

Another Example : Word Count

• Job: Count occurrences of each word in a data set

Map Tasks

ReduceTasks

17

Summary : Hadoop vs. Typical DB

Distributed DBs Hadoop

Computing Model - Notion of transactions- Transaction is the unit of work- ACID properties, Concurrency control

- Notion of jobs- Job is the unit of work- No concurrency control

Data Model - Structured data with known schema- Read/Write mode

- Any data format- ReadOnly mode

Cost Model - Expensive servers - Cheap commodity machines

Fault Tolerance - Failures are rare- Recovery mechanisms

- Failures are common over thousands of machines

- Simple fault tolerance

Key Characteristics - Efficiency, Powerful, optimizations - Scalability, flexibility, fault tolerance

cs525: big data analytics mapreduce computing paradigm & apache hadoop open source fall 2013...

Documents