the power of hadoop in cloud computing
TRANSCRIPT
The Power of Hadoop in Cloud Computing
Joey Echeverria, Solutions Architect
[email protected], @fwiffo
Yahoo! Business Intelligence Before Adopting Hadoop
Copyright 2011, Cloudera, Inc. All Rights Reserved.
Storage Only Grid (20TB/day)InstrumentationCollectionRDBMS (200GB/day)BI Reports + Interactive Apps
Mostly Append
ETL Compute Grid
Moving Data To Compute Doesnt Scale
Couldnt Explore Original Raw Data
BI Problems Before Hadoop
Shrinking ETL Window
25 hours to process a days worth of data
No Scalable ETL Reprocessing To Recover from Data ErrorsActive archive
Conformation LossA new browser agent
No Queries on Raw DataNew product
No Consolidated RepositoryCross product queries
Only SQLPhoto/Image Transcoding
Satellite Map Processing
Copyright 2011, Cloudera, Inc. All Rights Reserved.
Yahoo! Business Intelligence After Adopting Hadoop
Copyright 2011, Cloudera, Inc. All Rights Reserved.
Hadoop: Storage + Compute GridInstrumentationCollectionRDBMSBI Reports + Interactive AppsComplex Data Processing
Mostly Append
Data Exploration &Advanced AnalyticsETL and Aggregations
So What is Apache Hadoop?
A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)
Core Hadoop has two main components:
Hadoop Distributed File System: self-healing high-bandwidth clustered storage
MapReduce: fault-tolerant distributed processing
Key business values:Flexible Store any data, Run any analysis (Mine First, Govern Later)
Affordable Cost per TB at a fraction of traditional options
Scalable Start at 1TB/3-nodes then grow to petabytes/thousands of nodes
Open Source No Lock-In, Rich Ecosystem, Large developer community
Broadly adopted A large and active ecosystem, Proven to run at scale
Copyright 2011, Cloudera, Inc. All Rights Reserved.
One of the key benefits of Hadoop is the ability to dump any type of data into Hadoop then the input record readers will abstract it out as if it was structured (i.e. schema on read vs on write)
Open Source Software allows for innovation by partners and customers. It also enables third-party inspection of source code which provides assurances on security and product quality.
1 HDD = 75 MB/sec, 1000 HDDs = 75 GB/sec, the head of fileserver bottleneck is eliminated.
The system is self-healing in the sense that it automatically routes around failure. If a node fails then its workload and data are transparently shifted some where else.
The system is intelligent in the sense that the MapReduce scheduler optimizes for the processing to happen on the same node storing the associated data (or co-located on the same leaf Ethernet switch), it also speculatively executes redundant tasks if certain nodes are detected to be slow.
Hadoop Design Axioms
Copyright 2011, Cloudera, Inc. All Rights Reserved.
1. System Shall Manage and Heal Itself2. Performance Shall Scale Linearly 3. Compute Moves to Data4. Simple Core, Modular and Extensible
Block Size = 64MBReplication Factor = 3
HDFS: Hadoop Distributed File System
Cost/GB is a few /month vs $/monthInfiniteThroughputCopyright 2011, Cloudera, Inc. All Rights Reserved.
Pool commodity servers in a single hierarchical namespace.
Designed for large files that are written once and read many times.
Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.
Typical Hadoop node is eight cores with 24GB ram and four 1TB SATA disks.
Default data block size is 64MB, though most folks now set it to 128MB or even higher
MapReduce: Distributed Processing
Copyright 2011, Cloudera, Inc. All Rights Reserved.
Differentiate between MapReduce the platform and MapReduce the programming model. The analogy is similar to the RDBMs which executes the queries, and SQL which is the language for the queries.
MapReduce can run on top of HDFS or a selection of other storage systems
Intelligent scheduling algorithms for locality, sharing, and resource optimization.
Agility
Schema-on-Write (RDBMS):Schema-on-Read (Hadoop):
BenefitsRead is FastLoad is Fast
Standards/GovernanceEvolving Schemas/Agility
New columns must be added explicitly before data for such columns can be loaded into the database
New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse them
Explicit load operation has to take place which transforms data to database internal structure
A Serializer/Deserlizer (SerDe) is applied during read time to extract the required columns
Schema must be created before data is loaded
Data is simply copied to the file store, no special transformation is needed
Copyright 2011, Cloudera, Inc. All Rights Reserved.
Scalability
Start with a few servers and 10s of TBsGrow to 1000s of servers and 10s of PBs
AUTO SCALECopyright 2011, Cloudera, Inc. All Rights Reserved.
Active Archive: Keep Data Accessible
Low ROB
Return on Byte (ROB) = value to be extracted from that byte divided by the cost of storing that byte
If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage.
High ROBCopyright 2011, Cloudera, Inc. All Rights Reserved.
Use The Right Tool For The Right Job
Copyright 2011, Cloudera, Inc. All Rights Reserved.
Relational Databases:
Hadoop:
Use when:Structured or Not (Agility)
Scalable Storage/Compute
Complex Data Processing
Use when:Interactive OLAP Analytics (