cloud computing: hadoop

21
Data Processing in the Cloud Parand Tony Darugar http://parand.com/say/ [email protected]

Upload: darugar

Post on 24-Apr-2015

20.098 views

Category:

Technology


4 download

DESCRIPTION

Data Processing in the Cloud with Hadoop from Data Services World conference.

TRANSCRIPT

Page 1: Cloud Computing: Hadoop

Data Processing in the Cloud

Parand Tony Darugar

http://parand.com/say/

[email protected]

Page 2: Cloud Computing: Hadoop

2

What is Hadoop

Flexible infrastructure for large scale computation and data processing on a network of commodity hardware.

Page 3: Cloud Computing: Hadoop

3

Why?

A common infrastructure pattern extracted from building distributed systems

Scale Incremental growth Cost Flexibility

Page 4: Cloud Computing: Hadoop

4

Built-in Resilience to Failure

When dealing with large numbers of commodity servers, failure is a fact of life

Assume failure, build protections and recovery into your architecture- Data level redundancy- Job/Task level monitoring and automated

restart and re-allocation

Page 5: Cloud Computing: Hadoop

5

Current State of Hadoop Project

Top level Apache Foundation project In production use at Yahoo, Facebook,

Amazon, IBM, Fox, NY Times, Powerset, … Large, active user base, mailing lists, user

groups Very active development, strong development

team

Page 6: Cloud Computing: Hadoop

6

Widely Adopted

A valuable and reusable skill set- Taught at major universities- Easier to hire for- Easier to train on- Portable across projects, groups

Page 7: Cloud Computing: Hadoop

7

Plethora of Related Projects

Pig Hive Hbase Cascading Hadoop on EC2 JAQL , X-Trace, Happy, Mahout

Page 8: Cloud Computing: Hadoop

8

What is Hadoop

The Linux of distributed processing.

Page 9: Cloud Computing: Hadoop

How Does Hadoop Work?

Page 10: Cloud Computing: Hadoop

10

Hadoop File System

A distributed file system for large data- Your data in triplicate- Built-in redundancy, resiliency to large scale

failures- Intelligent distribution, striping across racks- Accommodates very large data sizes- On commodity hardware

Page 11: Cloud Computing: Hadoop

11

Programming Model: Map/Reduce

Very simple programming model:- Map(anything)->key, value- Sort, partition on key- Reduce(key,value)->key, value

No parallel processing / message passing semantics

Programmable in Java or any other language (streaming)

Page 12: Cloud Computing: Hadoop

12

Processing Model

Create or allocate a cluster Put data onto the file system:- Data is split into blocks, stored in triplicate across

your cluster

Run your job:- Your Map code is copied to the allocated nodes,

preferring nodes that contain copies of your data• Move computation to data, not data to computation

Page 13: Cloud Computing: Hadoop

13

Processing Model

• Monitor workers, automatically restarting failed or slow tasks

- Gather output of Map, sort and partition on key- Run Reduce tasks

• Monitor workers, automatically restarting failed or slow tasks

Results of your job are now available on the Hadoop file system

Page 14: Cloud Computing: Hadoop

14

Hadoop on the Grid

Managed Hadoop clusters Shared resources- improved utilization

Standard data sets, storage Shared, standardized operations

management Hosted internally or externally (eg. on EC2)

Page 15: Cloud Computing: Hadoop

Usage Patterns

Page 16: Cloud Computing: Hadoop

16

ETL

Put large data source (eg. Log files) onto the Hadoop File System

Perform aggregations, transformations, normalizations on the data

Load into RDBMS / data mart

Page 17: Cloud Computing: Hadoop

17

Reporting and Analytics

Run canned and ad-hoc queries over large data

Run analytics and data mining operations on large data

Produce reports for end-user consumption or loading into data mart

Page 18: Cloud Computing: Hadoop

18

Data Processing Pipelines

Multi-step pipelines for data processing Coordination, scheduling, data collection and

publishing of feeds SLA carrying, regularly scheduled jobs

Page 19: Cloud Computing: Hadoop

19

Machine Learning & Graph Algorithms

Traverse large graphs and data sets, building models and classifiers

Implement machine learning algorithms over massive data sets

Page 20: Cloud Computing: Hadoop

20

General Back-End Processing

Implement significant portions of back-end, batch oriented processing on the grid

General computation framework Simplify back-end architecture

Page 21: Cloud Computing: Hadoop

21

What Next?

Dowload Hadoop:- http://hadoop.apache.org/

Try it on your laptop

Try Pig- http://hadoop.apahe.org/pig/

Deploy to multiple boxes

Try it on EC2