is hadoop for you
Post on 26-Jan-2015
110 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Is Hadoop For You?Gwen Shapira, Solutions Architect
2
About Me
• Solution Architect @ Cloudera• Making our customers successful• Formerly:• Database consultant @ Pythian• Specializing in Exadata, RAC, replication• Oracle ACED, Oak Table Member
• @gwenshap <- Hadoop tips in 140 characters
3
Agenda
Answer the question:Who needs Hadoop?
4
In more details…
Getting Started
What you need to succeed
When to Hadoop
Basic Hadoop Architecture
What's so special about Hadoop
0% 5%10%
15%20%
25%30%
35%40%
45%
% of Session
% of Session
5
What’s so special about Hadoop?
Technically Speaking
6
Databases in 1999
1. Buy a really big machine2. Install an expensive DBMS on it3. Point your workload at it4. Hope it doesn’t fail5. Ambitious: buy another really big machine as a
backup
7
Problems:
• Reliability• Scalability• Storage throughput • Complex Upgrades• Relational only
8
Exadata: State of the Art - 2007
1. Storage and compute in one rack2. Cluster with Infiniband interconnect3. Balanced architecture4. Offloading 5. Parallelism6. Compression
9
Hadoop
• Distributed File System• Programming Framework• Many projects on top• Open Source
(This means free)
10
Designed For:
• Reliability• Parallel Processing• Scalability• Flexibility
11
Reminders:
• Disk does a seek for each I/O operation• Seeks are expensive (~10ms)• Big I/Os mean better throughput• Network is fast inside rack• Slower between racks
12
The File System
• Files are split into 64M blocks• 64M!!!• Distributed• Replicated• Write-Once
HDFS Architecture
13
DataNode
Metadata
Paths, filenames, file sizes, block locations, …
NameNode
DataNode DataNode DataNode
HDFS Architecture
14
DataNode
Data
Blocks, checksums
NameNode
DataNode DataNode DataNode
HDFS Write Path
15
DN 1
NameNode
DN 2 DN 3 DN 4
Rack 1 Rack 2
Client
create(“/tmp/myfile”)
Write to [DN4,DN3,DN2]
[DN3,DN2]
[DN2]
HDFS Read Path
16
DN 1
NameNode
DN 2 DN 3 DN 4
Rack 1 Rack 2
Client
open(“/tmp/myfile”,“r”)
Read from [DN4,DN3,DN2]
readdata
17
Map-Reduce
• Java Framework • Works on Key-Value pairs• Map:
• Operate on every element• Filter or transform• Code runs where the data is stored
• Shuffle:• Redistribution of data
• Reduce:• Aggregate or Join
MapReduce Architcture
18
DN 1
JobTracker
DN 2 DN 3 DN 4
Rack 1 Rack 2
NameNode
TT 3 TT 4TT 2TT 1
• Gateway for users• Assigns tasks to
TaskTrackers• Tracks job status
MapReduce Architcture
19
DN 1
JobTracker
DN 2 DN 3 DN 4
Rack 1 Rack 2
NameNode
TT 3 TT 4TT 2TT 1
• TaskTrackers execute Map and Reduce tasks assigned by JT
20
Word Count Example
MapReduce Architcture
21
DN 1
JobTracker
DN 2 DN 3 DN 4
Rack 1 Rack 2
NameNode
TT 3 TT 4TT 2TT 1
wordcount(<files>)
M1 M2 M3 M4 R1
[cat, 1] [dog, 1][the, 1] [sat, 1]
MapReduce Architcture
22
DN 1
JobTracker
DN 2 DN 3 DN 4
Rack 1 Rack 2
NameNode
TT 3 TT 4TT 2TT 1
wordcount(<files>)
M5 M6 M7 M8 R1
[a, 5][cat, 2][dog, 1][the, 4][mat, 1]
23
Compare to Oracle PX
• Mappers -> Producers• Reducers -> Consumers• Shuffle -> Re-distribution
24
In Short
Benefits
• Reliable• Scalable• Infinite Flexibility• Cheap
Challenges
• New skills• Infinitely Flexible• Feature-completeness• Best practices and examples
25
Use Cases
When to Hadoop?
26
When to Hadoop?
When Relational Databases Don’t Add Benefits
27
Non-relational Data
• XML• Logs • Geo spatial data• Video
28
Adding to the Data Warehouse
• ETL• History• Some reports• Rocket Data Science
29
What you Need to Succeed
30
A Problem
31
Right Toolset
32
Toolset
33
Toolset for DBAs
• Hive – Turn SQL to Map-Reduce• Streaming – Map-Reduce in any language• Pig – Write and Execute execution plans• Oozie – Coordinate workflows• Impala – real-time SQL• HBase – key-value real-time data store
34
Data Model
• Partitions• Batch processing• Star Schema• Materialized Views• Sort and Compress
• De-normalize• Tune the data• Nested data structures
35
Right Hardware
• If possible – POC with your workload• Sizing by storage• You probably need to over-provision• Machine reliability• Big Data Appliance is a good start
36
Non-technical Advice
• Your team will have to learn a lot• Be ready for a challenge
37
Getting Started
38
Why get started?
• Hadoop projects are more visible• 48% of Hadoop clusters are owned by DWH team• Big Data == Business pays attention to data• New skills – from coding to cluster administration• Interesting projects
• No, you don’t need to learn Java
39
VM Cloud Cluster
40
Books
41
More Books
42
Beginner Projects
• Install 5 node Hadoop cluster in AWS• Load data:
• Complete works of Shakespeare• Movielens database
• Find the 10 most common words in Shakespeare• Find the 10 most recommended movies• Run TPC-H• Cloudera Data Science Challenge• Actual use-case:
XML ingestion, ETL process, DWH history
43
Need Help?
• I can help:• @gwenshap• gshapira@cloudera.com
• Hadoop Community:• http://community.cloudera.com• user@hadoop.apache.org• Google group: CDH Users
44
top related