the power of hadoop in cloud computing

The Power of Hadoop in Cloud Computing

Joey Echeverria, Solutions Architect
[email protected], @fwiffo

Yahoo! Business Intelligence Before Adopting Hadoop

Copyright 2011, Cloudera, Inc. All Rights Reserved.

Storage Only Grid (20TB/day)InstrumentationCollectionRDBMS (200GB/day)BI Reports + Interactive Apps

Mostly Append

ETL Compute Grid

Moving Data To Compute Doesnt Scale

Couldnt Explore Original Raw Data

BI Problems Before Hadoop

Shrinking ETL Window

25 hours to process a days worth of data

No Scalable ETL Reprocessing To Recover from Data ErrorsActive archive

Conformation LossA new browser agent

No Queries on Raw DataNew product

No Consolidated RepositoryCross product queries

Only SQLPhoto/Image Transcoding

Satellite Map Processing


Yahoo! Business Intelligence After Adopting Hadoop


Hadoop: Storage + Compute GridInstrumentationCollectionRDBMSBI Reports + Interactive AppsComplex Data Processing

Mostly Append

Data Exploration &Advanced AnalyticsETL and Aggregations

So What is Apache Hadoop?

A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)

Core Hadoop has two main components:

Hadoop Distributed File System: self-healing high-bandwidth clustered storage

MapReduce: fault-tolerant distributed processing

Key business values:Flexible Store any data, Run any analysis (Mine First, Govern Later)

Affordable Cost per TB at a fraction of traditional options

Scalable Start at 1TB/3-nodes then grow to petabytes/thousands of nodes

Open Source No Lock-In, Rich Ecosystem, Large developer community

Broadly adopted A large and active ecosystem, Proven to run at scale


One of the key benefits of Hadoop is the ability to dump any type of data into Hadoop then the input record readers will abstract it out as if it was structured (i.e. schema on read vs on write)

Open Source Software allows for innovation by partners and customers. It also enables third-party inspection of source code which provides assurances on security and product quality.

1 HDD = 75 MB/sec, 1000 HDDs = 75 GB/sec, the head of fileserver bottleneck is eliminated.

The system is self-healing in the sense that it automatically routes around failure. If a node fails then its workload and data are transparently shifted some where else.

The system is intelligent in the sense that the MapReduce scheduler optimizes for the processing to happen on the same node storing the associated data (or co-located on the same leaf Ethernet switch), it also speculatively executes redundant tasks if certain nodes are detected to be slow.

Hadoop Design Axioms


1. System Shall Manage and Heal Itself2. Performance Shall Scale Linearly 3. Compute Moves to Data4. Simple Core, Modular and Extensible

Block Size = 64MBReplication Factor = 3

HDFS: Hadoop Distributed File System

Cost/GB is a few /month vs $/monthInfiniteThroughputCopyright 2011, Cloudera, Inc. All Rights Reserved.

Pool commodity servers in a single hierarchical namespace.

Designed for large files that are written once and read many times.

Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.

Typical Hadoop node is eight cores with 24GB ram and four 1TB SATA disks.

Default data block size is 64MB, though most folks now set it to 128MB or even higher

MapReduce: Distributed Processing


Differentiate between MapReduce the platform and MapReduce the programming model. The analogy is similar to the RDBMs which executes the queries, and SQL which is the language for the queries.

MapReduce can run on top of HDFS or a selection of other storage systems

Intelligent scheduling algorithms for locality, sharing, and resource optimization.

Agility

Schema-on-Write (RDBMS):Schema-on-Read (Hadoop):

BenefitsRead is FastLoad is Fast

Standards/GovernanceEvolving Schemas/Agility

New columns must be added explicitly before data for such columns can be loaded into the database

New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse them

Explicit load operation has to take place which transforms data to database internal structure

A Serializer/Deserlizer (SerDe) is applied during read time to extract the required columns

Schema must be created before data is loaded

Data is simply copied to the file store, no special transformation is needed


Scalability

Start with a few servers and 10s of TBsGrow to 1000s of servers and 10s of PBs

AUTO SCALECopyright 2011, Cloudera, Inc. All Rights Reserved.

Active Archive: Keep Data Accessible

Low ROB

Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte

If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage.

High ROBCopyright 2011, Cloudera, Inc. All Rights Reserved.

Use The Right Tool For The Right Job


Relational Databases:

Hadoop:

Use when:Structured or Not (Agility)

Scalable Storage/Compute

Complex Data Processing

Use when:Interactive OLAP Analytics (

the power of hadoop in cloud computing

Documents