cs525: big data analytics mapreduce computing paradigm & apache hadoop open source fall 2013...

15
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

Upload: anthony-perkins

Post on 18-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

1

CS525: Big Data Analytics

MapReduce Computing Paradigm &Apache Hadoop Open Source

Fall 2013

Elke A. Rundensteiner

Page 2: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

Large-Scale Data Analytics

Scalability (petabytes of data, thousands of machines)

Database

vs.

Flexibility in accepting all data formats (no schema)

Commodity inexpensive hardware

Performance (indexing, tuning, data organization tech.)

Advanced Features: - Full query support - Clever optimizers - Views and security - Data consistency- ….

Many enterprises turn to Hadoop computing paradigm for big data applications :

Focus on read + write, concurrency, correctness, convenience, high-level access

Efficient fault tolerance support

Page 3: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

3

What is Hadoop

• Hadoop is a simple software framework for distributed processing of large datasets across huge clusters of (commodity hardware) computers :– Large datasets Terabytes or petabytes of data– Large clusters Hundreds or thousands of nodes

• Open-source implementation for Google MapReduce• Simple programming model : MapReduce• Simple data model: flexible for any data

Page 4: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

4

Hadoop Framework

• Two main layers:– Distributed file system (HDFS)– Execution engine (MapReduce)

Hadoop is designed as a master-slave shared-nothing architecture

Page 5: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

6

Key Ideas of Hadoop

• Automatic parallelization & distribution– Hidden from end-user

• Fault tolerance and automatic recovery– Failed nodes/tasks recover automatically

• Simple programming abstraction– Users provide two functions “map” and “reduce”

Page 6: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

7

Who Uses Hadoop ?

• Google: Invent MapReduce computing paradigm• Yahoo: Develop Hadoop open-source of MapReduce• Integrators: IBM, Microsoft, Oracle, Greenplum• Adopters: Facebook, Amazon, AOL, NetFlex,LinkedIn• Many others …

Page 7: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

9

Hadoop Distributed File System (HDFS)

Centralized namenode - Maintains metadata info about files

Many datanodes (1000s) - Store actual data - Files are divided into blocks - Each block is replicated N times (Default = 3)

File F 1 2 3 4 5

Blocks (64 MB)

Page 8: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

10

HDFS File System Properties

• Large Space: An HDFS instance may consist of thousands of server machines for storage

• Replication: Each data block is replicated• • Failure: Failure is norm rather than exception

• Fault Tolerance: Automated detection of faults and recovery

Page 9: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

11

Map-Reduce Execution Engine(Example: Color Count)

Shuffle & Sorting based on k

Input blocks on HDFS

Produces (k, v) ( , 1)

Consumes(k, [v]) ( , [1,1,1,1,1,1..])

Produces(k’, v’) ( , 100)

Users only provide the “Map” and “Reduce” functions

Page 10: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

12

MapReduce Engine

• Job Tracker is the master node (runs with the namenode)– Receives the user’s job– Decides on how many tasks will run (number of mappers)– Decides on where to run each mapper (locality)

• This file has 5 Blocks run 5 map tasks

• Run task reading block “1” on Node 1 or 3.

Node 1 Node 2 Node 3

Page 11: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

13

MapReduce Engine

• Task Tracker is the slave node (runs on each datanode)– Receives the task from Job Tracker– Runs task to completion (either map or reduce task)– Communicates with Job Tracker to report its progress

1 map-reduce job consists of 4 map tasks and 3 reduce tasks

Page 12: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

14

About Key-Value Pairs • Developer provides Mapper and Reducer functions • Developer decides what is key and what is value• Developer must follow the key-value pair interface

• Mappers:– Consume <key, value> pairs– Produce <key, value> pairs

• Shuffling and Sorting:– Groups all similar keys from all mappers, – sorts and passes them to a certain reducer – in the form of <key, <list of values>>

• Reducers:– Consume <key, <list of values>>– Produce <key, value>

Page 13: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

15

MapReduce Phases

Page 14: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

16

Another Example : Word Count

• Job: Count occurrences of each word in a data set

Map Tasks

ReduceTasks

Page 15: CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1

17

Summary : Hadoop vs. Typical DB

Distributed DBs Hadoop

Computing Model - Notion of transactions- Transaction is the unit of work- ACID properties, Concurrency control

- Notion of jobs- Job is the unit of work- No concurrency control

Data Model - Structured data with known schema- Read/Write mode

- Any data format- ReadOnly mode

Cost Model - Expensive servers - Cheap commodity machines

Fault Tolerance - Failures are rare- Recovery mechanisms

- Failures are common over thousands of machines

- Simple fault tolerance

Key Characteristics - Efficiency, Powerful, optimizations - Scalability, flexibility, fault tolerance