intro to the hadoop stack @ april 2011 javamug
DESCRIPTION
Covers high level concepts of different pieces of the Hadoop project: HDFS, MapReduce, HBase, Hive, Pig & ZookeeperTRANSCRIPT
About Me
Meetup organizer for DFWBigData.org
> Hadoop, Cassandra, and all other things BigData and NoSQL
> Join up!
Sr. Consultant @
> Rapidly growing national IT consulting firm focused on career development while operating within an local-office project model
@engfer
What is Hadoop?
0 “framework for running [distributed] applications on large cluster built of commodity hardware” –from Hadoop Wiki
0 Originally created by Doug Cutting
> Named the project after his son’s toy
0 The name “Hadoop” has now evolved to cover a family of products, but at its core, it’s essentially just the MapReduce programming paradigm + a distributed file system
Marty McFly?
History
History
>_< Growing Pains +
Jeffery Dean: lots of data + tape backup + expensive servers + high network bandwidth + expensive databases + non-linear scalability + etc. (http://bit.ly/ec31VL + http://bit.ly/gq84Ot)
History
>_< Growing Pains + +
Solutions
History
>_< Growing Pains + +
Solutions
White Papers: Google File System • 2003
MapReduce • 2004
BigTable • 2006
White Papers: Google File System • 2003
MapReduce • 2004
BigTable • 2006
History
Hadoop Core
c. 2005
Hadoop Distributed File System (HDFS)
0 OSS implementation of Google File System (bit.ly/ihXkof)
0 Master/slave architecture
0 Designed to run on commodity hardware
0 Hardware failures assumed in design
0 Fault-tolerant via replication
0 Semi-POSIX compliance; relaxed for performance
0 Unix-like permissions; ties into host’s users & groups
Hadoop Distributed File System (HDFS)
0 Written in Java
0 Optimized for larger files
0 Focus on streaming data (high-throughput > low-latency)
0 Rack-aware
0 Only *nix for production env.
0 Web consoles for stats
HDFS Client API’s
0 “Shell-like” commands (hadoop dfs [cmd]) > cat chgrp chmod chown
copyFromLocal copyToLocal cp du, dus expunge get getmerge ls, lsr mkdir movefromLocal mv put rm, rmr setrep stat tail test text touchz
0 Native Java API
0 API for other languages (http://bit.ly/fLgCJC)
> C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml
Other HDFS Admin Tools
0 hadoop dfsadmin [opts]
> Basic admin utilities for the DFS cluster
> Change file-level replication factors, set quotas, upgrade, safemode, reporting, etc
0 hadoop fsck [opts]
> Runs distributed file system checking and fixing utility
0 hadoop balancer
> Utility that rebalances block storage across the nodes
HDFS Node Types
NameNode 0 Single node responsible for:
> Filesystem metadata operations on cluster
> Replication and locations of file blocks
0 SPOF
0 Nodes responsible for: > NameNode backup mechanisms
0 Nodes responsible for: > Storage of file blocks
> Serving actual file data to client
Master
DataNode DataNode
DataNode
Slaves
CheckpointNode
BackupNode
(backups)
or
=(
HDFS Architecture
NameNode
DataNode DataNode DataNode
BackupNode
DataNode DataNode
(namespace backups)
FS/namespace/meta ops
serving data -->
(heartbeats, balancing, replication, etc)
nodes write to local disk
HDFS Architecture
NameNode
DataNode DataNode DataNode
BackupNode
DataNode DataNode
HDFS Client
Giant File:
11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...
data Xfer
(block locations, FS ops, etc) <No file data!!>
Putting files on HDFS
NameNode
DataNode DataNode DataNode
BackupNode
DataNode DataNode
HDFS Client
Giant File:
11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...
return block size and nodes for each block
client buffers blocks to local disk… {64MB}
(based on “replication factor”) (3 by default)
Putting files on HDFS
NameNode
DataNode DataNode DataNode
BackupNode
DataNode DataNode
HDFS Client
Giant File:
11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...
While buffering to local disk, the client Xfers block directly
to assigned data nodes
{node1, node2, node3}
(based on “replication factor”)
Putting files on HDFS
NameNode BackupNode
DataNode
HDFS Client
Giant File:
11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...
{node1, node3, node5}
DataNode DataNode DataNode DataNode
While buffering to local disk, the client Xfers block directly
to assigned data nodes
Putting files on HDFS
NameNode BackupNode
HDFS Client
Giant File:
11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...
{node1, node4, node5}
DataNode DataNode DataNode DataNode DataNode
While buffering to local disk, the client Xfers block directly
to assigned data nodes
Putting files on HDFS
NameNode BackupNode
HDFS Client
Giant File:
11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...
{node2, node3, node4}
DataNode DataNode DataNode DataNode DataNode
While buffering to local disk, the client Xfers block directly
to assigned data nodes
DataNode DataNode DataNode DataNode DataNode
Putting files on HDFS
NameNode BackupNode
HDFS Client
Giant File:
11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...
{node2, node4, node5}
While buffering to local disk, the client Xfers block directly
to assigned data nodes
DataNode DataNode DataNode DataNode DataNode
Putting files on HDFS
NameNode BackupNode
HDFS Client
Giant File:
11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...
Ad noseum…
DataNode DataNode DataNode DataNode DataNode
Getting files from HDFS
NameNode BackupNode
HDFS Client
Giant File:
11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...
return locations of blocks for file
Stream blocks from data nodes
Fault Tolerance?
NameNode
DataNode DataNode DataNode
BackupNode
DataNode DataNode
NameNode detects DataNode loss
Fault Tolerance?
NameNode
DataNode DataNode DataNode
BackupNode
DataNode
Blocks are auto-replicated on remaining nodes to satisfy replication factor
Fault Tolerance?
NameNode
DataNode DataNode DataNode
BackupNode
DataNode
Blocks are auto-replicated on remaining nodes to satisfy replication factor
Fault Tolerance?
NameNode
DataNode DataNode DataNode
BackupNode
DataNode
Blocks are auto-replicated on remaining nodes to satisfy replication factor
Fault Tolerance?
NameNode
DataNode DataNode DataNode
BackupNode
DataNode DataNode
NameNode loss = FAIL (requires manual intervention)
**automatic failover is in the works
not an EPIC fail because you have the backup node to replay
any FS operations
Live horizontal scaling and rebalancing
NameNode
DataNode DataNode DataNode
BackupNode
DataNode
NameNode detects new DataNode is added to cluster
DataNode
Live horizontal scaling and rebalancing
NameNode
DataNode DataNode DataNode
BackupNode
DataNode
Blocks are re-balanced and re-distributed
DataNode
Live horizontal scaling and rebalancing
NameNode
DataNode DataNode DataNode
BackupNode
DataNode DataNode
Blocks are re-balanced and re-distributed
Live horizontal scaling and rebalancing
NameNode
DataNode DataNode DataNode
BackupNode
DataNode DataNode
Blocks are re-balanced and re-distributed
Live horizontal scaling and rebalancing
NameNode
DataNode DataNode DataNode
BackupNode
DataNode DataNode
Once replication factor is satisfied, extra replicas are removed
HDFS Demonstration
Other HDFS Utils
0 HDFS Raid (http://bit.ly/fqnzs5)
> Uses distributed RAID instead of replication (useful at Petabyte scale)
0 Flume/Scribe/Chukwa
> Log collection and aggregation frameworks that support streaming log data to HDFS
> Flume = Cloudera (http://bit.ly/gX8LeO)
> Scribe = Facebook (http://bit.ly/dIh3If)
from flume wiki
MapReduce
0 Distributed programming paradigm and framework that is the OSS implementation of Google’s MapReduce (http://bit.ly/gXZbsk)
0 Modeled using the ideas behind functional programming map() and reduce() operations > Distributed on as many nodes as you would like
0 2 phase process:
map( ) reduce( ) sub-divide &
conquer combine & reduce
cardinality
MapReduce ABC’s
0 Essentially, it’s…
1. Take a large problem and divide it into sub-problems
2. Perform the same function on all sub-problems
3. Combine the output from all sub-problems
0 Ex: Searching
1. Take a large problem and divide it into sub-problems
# Different groups of rows in DB; different parts of files; 1 user from a list of users; etc.
2. Perform the same function on all sub-problems
# Search for a key in the given partition of data for the sub-problem; count words; etc.
3. Combine the output from all sub-problems
# Combine the results into a result-set and return to the client
M/R Facts
0 M/R is excellent for problems where the “sub-problems” are not interdependent
> For example, the output of one “mapper” should not depend on the output or communication with another “mapper”
0 The reduce phase does not begin execution until all mappers have finished
0 Failed map and reduce tasks get auto-restarted
0 Rack/HDFS-aware
MapReduce Visualized
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Input
<keyi, valuei>
<keyi, valuei>
<keyi, valuei>
<keyi, valuei>
<keyA, valuea> <keyB, valueb> <keyC, valuec> …
<keyA, valuea> <keyB, valueb> <keyC, valuec> …
<keyA, valuea> <keyB, valueb> <keyC, valuec> …
<keyA, valuea> <keyB, valueb> <keyC, valuec> …
<keyA, list(valuea,valueb, valuec,…)>
<keyB, list(valuea,valueb, valuec,…)>
<keyC, list(valuea,valueb, valuec,…)>
Sort and
group by
key
Output
Input
Example: Word Count
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Input
<?, file1_part1>
<?, file2_part2>
<?, file1_part2>
<?, file2_part1>
<“foo”, 3> <“bar”, 14> <“baz”, 6> …
<“foo”, 21> <“bar”, 78> <“baz”, 12> …
<“foo”, 11> <“bar”, 22> <“baz”, 31> …
<“foo”, 1> <“bar”, 41> <“baz”, 10> …
<“foo”, (3, 21, 11, 1)>
<“bar”, (14, 78, 22, 41)>
<“baz”, (6, 12, 31, 10)>
Sort and
group by
key
bar,155 baz,59 foo,36 …
Lots of Big Files
count()
count()
count()
count()
sum()
sum()
sum()
Hadoop’s MapReduce
0 MapReduce tasks are submitted as a “job”
> Jobs can be assigned to a specified “queue” of jobs
# By default, jobs are submitted to the “default” queue
> Job submission is controlled by ACL’s for each queue
0 Rack-aware and HDFS-aware
> The JobTracker communicates with the HDFS NameNode and schedules map/reduce operations using input data locality on HDFS DataNodes
M/R Nodes
JobTracker
0 Single node responsible for:
> Coordinating all M/R tasks & events
> Managing job queues and scheduling
> Maintains and Controls TaskTrackers
> Moves/restarts map/reduce tasks if needed
0 SPOF
> Uses “checkpointing” to combat this
0 Worker nodes responsible for:
> Executing individual map and reduce tasks as assigned by JobTracker (in separate JVM)
Master
Slaves
TaskTracker TaskTracker
TaskTracker =(
Conceptual Overview
JobTracker
TaskTracker TaskTracker TaskTracker TaskTracker
Temporary data stored on HDFS
JobTracker controls and heartbeats TaskTracker nodes
TaskTrackers store temp data on HDFS
Job Submission
JobTracker
TaskTracker TaskTracker TaskTracker TaskTracker
Temporary data stored on HDFS
submit jobs to JobTracker M/R Client
M/R Client
M/R Client
Mapper Mapper Mapper Mapper
jobs get queued
map()’s are assigned to TaskTrackers (HDFS DataNode locality aware)
mappers store results on HDFS
mappers spawned in separate JVM and execute
Job Submission
JobTracker
TaskTracker
Reducer
TaskTracker
Reducer
TaskTracker
Reducer
TaskTracker
Reducer
Temporary data stored on HDFS
submit jobs to JobTracker M/R Client
M/R Client
M/R Client jobs get queued
reduce phase begins
tmp data read from HDFS
MapReduce Tips
0 Keys and values can be any type of object
> Can specify custom data splitters, partitoners, combiners, InputFormat’s, and OutputFormat’s
0 Use ToolRunner.run(Tool) to run your Java jobs…
> Will use GenericOptionsParser and DistributedCache so that -files, -libjars, & -archives options are available to distribute your mappers, reducers, and any
> Without this, your mappers, reducers, and other utilites will not be propagated and added to the classpath of the other nodes (ClassNotFoundException)
MapReduce Demonstration
Other M/R Utils
0 $HADOOP_HOME/contrib/*
> PriorityScheduler & FairScheduler
> HOD (Hadoop On Demand)
# Uses TORQUE resource manager to dynamically allocate, use, and destroy MapReduce clusters on an as-needed basis
# Great for development and testing
> Hadoop Streaming (next slide...)
0 Amazon’s Elastic MapReduce (EMR)
> Essentially production HOD for EC2 data/clusters
Hadoop Streaming
0 Allows you to write MapReduce jobs in languages other than Java by running any command line process > Input data is partitioned and given to the standard input (STDIN) of
the command line mappers and reducers specified
> Output (STDOUT) from the command line mappers and reducers get combined into the M/R pipeline
0 Can specify custom partitioners and combiners
0 Can specify files & archives to propagate to all nodes and unpack on local file system (-archives & -file)
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming.jar
-input “/foo/bar/input.txt”
-mapper splitz.py
-reducer /bin/wc
-output “/foo/baz/out”
-archives „hdfs://hadoop1/foo/bar/cachedir.jar‟
-file ~/scripts/splitz.py
-D mapred.job.name=“Foo bar”
Pig
0 Framework and language (Pig Latin) for creating and submitting Hadoop MapReduce jobs
0 Common data operations (not supported by POJO-M/R) like join, group, filter, sort, select, etc. are provided
0 Don’t need to know Java
0 Removes boilerplate aspect from M/R
> 200 lines in Java 15 lines in Pig!
0 Relational qualities (reads and feels SQL-ish)
Pig
0 Fact from Wiki: 40% of Yahoo’s M/R jobs are in Pig
0 Interactive shell (grunt) exists
0 User Defined Functions (UDF)
> Allows you to specify Java code where the logic may be too complex for Pig Latin
> UDF’s can be part of most every operation in Pig Latin
> Great for loading and storing custom formats as well as transforming data
Pig Relational Operations
COGROUP
CROSS
DISTINCT
FILTER
FOREACH
GROUP
JOIN
LIMIT
LOAD
MAPREDUCE
ORDER BY
SAMPLE
SPLIT
STORE
STREAM
UNION
most of these are pretty self-explanatory
Example Pig Script
01: REGISTER ./tutorial.jar;
02: raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);
03: clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
04: clean2 = FOREACH clean1 GENERATE user, time,
org.apache.pig.tutorial.ToLower(query) as query;
05: houred = FOREACH clean2 GENERATE user,
org.apache.pig.tutorial.ExtractHour(time) as hour, query;
06: ngramed1 = FOREACH houred GENERATE user, hour,
flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
07: ngramed2 = DISTINCT ngramed1;
08: hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
09: hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0),
COUNT($1) AS count;
10: hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram,
$1 as hour, $2 as count;
11: hour00 = FILTER hour_frequency2 BY hour eq '00';
12: hour12 = FILTER hour_frequency3 BY hour eq '12';
13: same = JOIN hour00 BY $0, hour12 BY $0;
14: same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as
ngram, $2 as count00, $5 as count12;
15: STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage();
Taken from Pig tutorial on Pig wiki: The Temporal Query Phrase Popularity script processes a search query log file from the Excite search engine and compares the occurrence of frequency of search phrases across two time periods separated by twelve hours.
Example Pig Script
01: REGISTER ./tutorial.jar;
02: raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);
03: clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
04: clean2 = FOREACH clean1 GENERATE user, time,
org.apache.pig.tutorial.ToLower(query) as query;
05: houred = FOREACH clean2 GENERATE user,
org.apache.pig.tutorial.ExtractHour(time) as hour, query;
06: ngramed1 = FOREACH houred GENERATE user, hour,
flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
07: ngramed2 = DISTINCT ngramed1;
08: hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
09: hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0),
COUNT($1) AS count;
10: hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram,
$1 as hour, $2 as count;
11: hour00 = FILTER hour_frequency2 BY hour eq '00';
12: hour12 = FILTER hour_frequency3 BY hour eq '12';
13: same = JOIN hour00 BY $0, hour12 BY $0;
14: same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as
ngram, $2 as count00, $5 as count12;
15: STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage();
Taken from Pig tutorial on Pig wiki: The Temporal Query Phrase Popularity script processes a search query log file from the Excite search engine and compares the occurrence of frequency of search phrases across two time periods separated by twelve hours.
UDF’’s
Now... image this equivalent in Java...
ZooKeeper
0 Centralized coordination service for use by distributed applications > Configuration, naming, synchronization (locks), ownership (master
election), etc.
0 Important system guarantees: > Sequential consistency (great for locking) > Atomicity – all or nothing at all > Data consistency – all clients view same system state regardless of
the server it connects to
<- ?
ZooKeeper Service
Server Server Server Server Server
Leader!
Client Client Client Client Client Client Client Client
ZooKeeper
0 Hierarchical namespace of “znodes” (like directories)
0 Operations:
> create a node at a location in the tree
> delete a node
> exists - tests if a node exists at a location
> get data from a node
> set data on a node
> get children from a node
> sync - waits for data to be propagated
<- ?
leaf znodes
HBase
0 Sparse, non-relational, column-oriented distributed database built on top of Hadoop Core (HDFS + MapReduce)
0 Modeled after Google’s BigTable (http://bit.ly/fQ1NMA)
0 NoSQL
0 HBase also has:
> Strong consistency model > In-memory operation > LZO compression (optional) > Live migrations > MapReduce support for querying
Not Only SQL... ...not “SQL is terrible”
What HBase Is…
0 Good at fast/streaming writes
0 Fault tolerant
0 Good at linear horizontal scalability
0 Very efficient at managing billions of rows and millions of columns
0 Good at keeping row history
0 Good at auto-balancing
0 A complement to a SQL DB/warehouse
0 Great with non-normalized data
What HBase Is NOT…
0 Made for table joins
0 Made for splitting into normalized tables (see previous)
0 A complete replacement for a SQL relational database
0 A complete replacement for a SQL data warehouse
0 Great for storing small amounts of data
0 Great for storing gobs of large binary data
0 The best way to do OLTP
0 The best way to do live adhoc querying of any column
0 A replacement for a proper caching mechanism
0 ACID compliant (http://bit.ly/hhFXCS)
HBase Facts
0 Written in Java
0 Uses ZooKeeper to store metadata and -ROOT- region
0 Column-oriented store = flexible schema > Can alter the schema simply by adding the column name and
data on insert (“put”)
> No schema migrations!
0 Every column has a timestamp associated with it > Same column with most recent timestamp wins
0 Can export metrics for use with Ganglia, or as JMX
0 hbase hbck
> Check for errors and fix them (like HDFS fsck)
HBase Client API’s
0 jRuby interactive shell (hbase shell)
> DDL/DML commands
> Admin commands
> Cluster commands
0 Java API (http://bit.ly/ij0MgF)
0 REST API > Provided using Stargate
0 API for other languages (http://bit.ly/fLgCJC)
Column-Oriented?
0 Traditional RDBMS are stored using row-oriented storage which stores entire rows sequentially on disk
0 Whereas column-oriented storage only stores columns for each row (or column-families) sequentially on disk
Row 1 – Cols 1-3 Row 2 – Cols 1-3
Row 3 – Cols 1-3
Row 1 – Col 1 Row 2 – Col 1
Row 3 – Col 1
Row 1 – Col 2 Row 2 – Col 2
Row 3 – Col 2
Row 1 – Col 3 Row 3 – Col 3
Where’s Row 2 - Col 2? Not needed because columns are stored sequentially, so rows have flexible schema!
Think of HBase Tables As…
0 More like JSON > And less like spreadsheets
{
"1" : {
"A" : { v: "x", ts: 4282 },
"B" : { v: "z", ts: 4282 }
},
"aaaaa" : {
"A" : { v: "y", ts: 4282 }
},
"xyz" : {
“address” : {
“line1" : { v: "hello", ts: 4282 },
“line2" : { v: "there", ts: 4282 },
“line2" : { v: "there", ts: 1234 }
},
“fooo" : { v: "wow!", ts: 4282 }
},
"zzzzz" : {
"A" : { v: "woot", ts: 4282 },
"B" : { v: "1337", ts: 4282 }
}
}
Modified from http://bit.ly/hbGWIG
column families allow grouping of columns (faster retrieval)
recent TS = default col value old TS
row id
columns
value & timestamp (TS)
flexible schema
HBase Overview The Master server keeps track of the metadata for RegionServer’s and their containing Regions and stores it in Zookeeper
Data is sent using the client
The HBase client communicates with the Zookeeper cluster only to get Region information; moreover, no data is sent through the Master
The actual row “data” (bytes) is sent directly to and from the RegionServers
Therefore, the Master server nor the Zookeeper cluster don’t serve as data bottlenecks
Pretty diagrams from Lars George http://goo.gl/wRLJP & http://goo.gl/6ehnV
HBase Overview
HDFS breaks files into 64MB chucks and replicates the chunks N times (3 by default) to store on “actual” disk (giving HBase it’s fault tolerance)
All HBase data (HLog and HFiles) are stored on HDFS
Pretty diagrams from Lars George http://goo.gl/wRLJP
Understanding HBase
Tables are split into groups of ~100 rows (configurable) called Regions
HRegions Table
Regions are assigned to particular RegionServer’s by the Master server. The Master only contains region-location metadata and contains no “real” row data.
Pretty diagrams from Lars George http://goo.gl/wRLJP & http://goo.gl/6ehnV
Writing to HBase
2) Transaction is written to write-ahead-log on HDFS (disk) first
3) Same data is written to in memory store for the assigned region (row group)
4) In memory store is periodically flushed to HDFS (disk) when size reaches threshold
1) HBase client gets the assigned region servers (and regions) from Master server for the particular keys (rows) in question and sends commands/data
HDFS
HDFS
Pretty diagrams from Lars George http://goo.gl/wRLJP & http://goo.gl/6ehnV
HBase Scalability
Additional RegionServers can be added to the live system. The master server will then rebalance the cluster to migrate Regions onto the new RegionServers
Moreover, additional HDFS data nodes can be added to disk give more space to the HDFS cluster
Pretty diagrams from Lars George http://goo.gl/wRLJP & http://goo.gl/6ehnV
HBase Demonstration
Hive
0 Data warehouse infrastructure on top of Hadoop Core
> Stores data on HDFS
> Allows you to add custom MapReduce plugins
0 HiveQL
> SQL-like language pretty close to ANSI SQL
# Supports joins
> JDBC driver exists
0 Has interactive shell (like MySQL & PostgreSQL) to run interactive queries
Hive
0 When running a HiveQL query/script, in the background Hive creates and runs a series of MapReduce jobs to
> BigData means it can take a long time to run queries
0 Therefore, it’s good for offline BigETL, but not good replacement for OLTP/OLAP data warehouse (like Oracle)
0 Learn more from wiki: http://bit.ly/epauio
> SHOW TABLES;
> CREATE TABLE rating (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
> DESCRIBE rating;
Other useful utilities around Hadoop
0 Sqoop (http://bit.ly/eRfVEJ)
> Load SQL data from a table into HDFS or Hive
> Generates Java classes to interact with the loaded data
0 Oozie (http://bit.ly/eNLi3B)
> Orchestrates complex workflows around multiple MapReduce jobs
0 Mahout (http://bit.ly/hCXRjL)
> Algorithm library for collaborative filtering, clustering, classifiers, and machine learning
0 Cascading (http://bit.ly/gyZNiI)
> Data query abstraction layer similar to Pig
> Java API that sits on top of MapReduce framework
> Since it’s a Java API you can use it with any program that uses a JVM language: Groovy, Scala, Clojure, jRuby, jython, etc.
What about support?
0 Community, wikis, forumns, IRC
0 Cloudera provides enterprise support
> Offerings:
# Cloudera Enterprise
# Support, professional services, training, management apps
> Cloudera Distribution of Hadoop (CDH)
# Tested and hardened version of Hadoop products plus some other goodies (oozie, flume, hue, sqoop, whirr) ~ Separate codebase, but patches are made to and form the Apache versions
# Packages: debian, redhat, EC2, VM
if you want to try Hadoop, CDH is probably the way to go.
I recommended this instead of downloading each project individually.
Who uses this stuff?
and many more
Where the heck can I use this stuff?
0 The hardest part, is finding the right use-cases to apply Hadoop (and any NoSQL system) > SQL databases are great for data that fits on one machine
> Lots of tooling support for SQL; not as much for Hadoop (yet)
0 A few questions to think about: > How much data are you processing?
> Are you throwing away valuable data due to space?
> Are you processing data where steps aren’t interdependent?
0 Log storage, log processing, utility data, research data, biological data, medical records, events, mail, tweets, market data, financial data
≠
≠ NoSQL
= NoSQL
The Law Of the Instrument
“It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.” -Abraham Maslow
?’s