intro to the hadoop stack @ april 2011 javamug

About Me

Meetup organizer for DFWBigData.org

> Hadoop, Cassandra, and all other things BigData and NoSQL

> Join up!

Sr. Consultant @

> Rapidly growing national IT consulting firm focused on career development while operating within an local-office project model

[email protected]

@engfer

What is Hadoop?

0 “framework for running [distributed] applications on large cluster built of commodity hardware” –from Hadoop Wiki

0 Originally created by Doug Cutting

> Named the project after his son’s toy

0 The name “Hadoop” has now evolved to cover a family of products, but at its core, it’s essentially just the MapReduce programming paradigm + a distributed file system

Marty McFly?

History

History

>_< Growing Pains +

Jeffery Dean: lots of data + tape backup + expensive servers + high network bandwidth + expensive databases + non-linear scalability + etc. (http://bit.ly/ec31VL + http://bit.ly/gq84Ot)

http://bit.ly/ec31VL?r=bb

http://bit.ly/gq84Ot

History

>_< Growing Pains + +

Solutions

History

>_< Growing Pains + +

Solutions

White Papers: Google File System • 2003

MapReduce • 2004

BigTable • 2006

http://labs.google.com/papers/gfs-sosp2003.pdf

http://labs.google.com/papers/mapreduce-osdi04.pdf

http://labs.google.com/papers/bigtable-osdi06.pdf

White Papers: Google File System • 2003

MapReduce • 2004

BigTable • 2006

History

Hadoop Core

c. 2005

http://labs.google.com/papers/gfs-sosp2003.pdf

http://labs.google.com/papers/mapreduce-osdi04.pdf

http://labs.google.com/papers/bigtable-osdi06.pdf

Hadoop Distributed File System (HDFS)

0 OSS implementation of Google File System (bit.ly/ihXkof)

0 Master/slave architecture

0 Designed to run on commodity hardware

0 Hardware failures assumed in design

0 Fault-tolerant via replication

0 Semi-POSIX compliance; relaxed for performance

0 Unix-like permissions; ties into host’s users & groups

http://bit.ly/ihXkof?r=bb

http://bit.ly/ihXkof?r=bb

Hadoop Distributed File System (HDFS)

0 Written in Java

0 Optimized for larger files

0 Focus on streaming data (high-throughput > low-latency)

0 Rack-aware

0 Only *nix for production env.

0 Web consoles for stats

HDFS Client API’s

0 “Shell-like” commands (hadoop dfs [cmd]) > cat chgrp chmod chown

copyFromLocal copyToLocal cp du, dus expunge get getmerge ls, lsr mkdir movefromLocal mv put rm, rmr setrep stat tail test text touchz

0 Native Java API

0 API for other languages (http://bit.ly/fLgCJC)

> C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml

http://bit.ly/fLgCJC?r=bb


Other HDFS Admin Tools

0 hadoop dfsadmin [opts]

> Basic admin utilities for the DFS cluster

> Change file-level replication factors, set quotas, upgrade, safemode, reporting, etc

0 hadoop fsck [opts]

> Runs distributed file system checking and fixing utility

0 hadoop balancer

> Utility that rebalances block storage across the nodes

HDFS Node Types

NameNode 0 Single node responsible for:

> Filesystem metadata operations on cluster

> Replication and locations of file blocks

0 SPOF

0 Nodes responsible for: > NameNode backup mechanisms

0 Nodes responsible for: > Storage of file blocks

> Serving actual file data to client

Master

DataNode DataNode

DataNode

Slaves

CheckpointNode

BackupNode

(backups)

or

=(

HDFS Architecture

NameNode

DataNode DataNode DataNode

BackupNode

DataNode DataNode

(namespace backups)

FS/namespace/meta ops

serving data -->

(heartbeats, balancing, replication, etc)

nodes write to local disk

HDFS Architecture

NameNode


BackupNode

DataNode DataNode

HDFS Client

Giant File:

11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...

data Xfer

(block locations, FS ops, etc) <No file data!!>

Putting files on HDFS

NameNode


BackupNode

DataNode DataNode

HDFS Client

Giant File:

11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...

return block size and nodes for each block

client buffers blocks to local disk… {64MB}

(based on “replication factor”) (3 by default)


NameNode


BackupNode

DataNode DataNode

HDFS Client

Giant File:

11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...

While buffering to local disk, the client Xfers block directly

to assigned data nodes

{node1, node2, node3}

(based on “replication factor”)


NameNode BackupNode

DataNode

HDFS Client

Giant File:

11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...


DataNode DataNode DataNode DataNode




NameNode BackupNode

HDFS Client

Giant File:

11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...


DataNode DataNode DataNode DataNode DataNode




NameNode BackupNode

HDFS Client

Giant File:

11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...







NameNode BackupNode

HDFS Client

Giant File:

11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...

Ad noseum…


Getting files from HDFS

NameNode BackupNode

HDFS Client

Giant File:

11001010100101010010101001100101010010101001010100110010101001010100101010011001010100101010010101001100101010010101001010100101101...

return locations of blocks for file

Stream blocks from data nodes

Fault Tolerance?

NameNode


BackupNode

DataNode DataNode

NameNode detects DataNode loss

Fault Tolerance?

NameNode


BackupNode

DataNode

Blocks are auto-replicated on remaining nodes to satisfy replication factor

Fault Tolerance?

NameNode


BackupNode

DataNode DataNode

NameNode loss = FAIL (requires manual intervention)

**automatic failover is in the works

not an EPIC fail because you have the backup node to replay

any FS operations

Live horizontal scaling and rebalancing

NameNode


BackupNode

DataNode

NameNode detects new DataNode is added to cluster

DataNode


NameNode


BackupNode

DataNode

Blocks are re-balanced and re-distributed

DataNode


NameNode


BackupNode

DataNode DataNode

Blocks are re-balanced and re-distributed


NameNode


BackupNode

DataNode DataNode

Once replication factor is satisfied, extra replicas are removed

HDFS Demonstration

Other HDFS Utils

0 HDFS Raid (http://bit.ly/fqnzs5)

> Uses distributed RAID instead of replication (useful at Petabyte scale)

0 Flume/Scribe/Chukwa

> Log collection and aggregation frameworks that support streaming log data to HDFS

> Flume = Cloudera (http://bit.ly/gX8LeO)

> Scribe = Facebook (http://bit.ly/dIh3If)

from flume wiki

http://bit.ly/fqnzs5

http://bit.ly/gX8LeO







http://bit.ly/dIh3If

MapReduce

0 Distributed programming paradigm and framework that is the OSS implementation of Google’s MapReduce (http://bit.ly/gXZbsk)

0 Modeled using the ideas behind functional programming map() and reduce() operations > Distributed on as many nodes as you would like

0 2 phase process:

map( ) reduce( ) sub-divide &

conquer combine & reduce

cardinality

http://bit.ly/gXZbsk

MapReduce ABC’s

0 Essentially, it’s…

1. Take a large problem and divide it into sub-problems

2. Perform the same function on all sub-problems

3. Combine the output from all sub-problems

0 Ex: Searching

1. Take a large problem and divide it into sub-problems

# Different groups of rows in DB; different parts of files; 1 user from a list of users; etc.

2. Perform the same function on all sub-problems

# Search for a key in the given partition of data for the sub-problem; count words; etc.

3. Combine the output from all sub-problems

# Combine the results into a result-set and return to the client

M/R Facts

0 M/R is excellent for problems where the “sub-problems” are not interdependent

> For example, the output of one “mapper” should not depend on the output or communication with another “mapper”

0 The reduce phase does not begin execution until all mappers have finished

0 Failed map and reduce tasks get auto-restarted

0 Rack/HDFS-aware

MapReduce Visualized

Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Input

<keyi, valuei>

<keyi, valuei>

<keyi, valuei>

<keyi, valuei>

<keyA, valuea> <keyB, valueb> <keyC, valuec> …




<keyA, list(valuea,valueb, valuec,…)>

<keyB, list(valuea,valueb, valuec,…)>

<keyC, list(valuea,valueb, valuec,…)>

Sort and

group by

key

Output

Input

Example: Word Count

Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Input

<?, file1_part1>

<?, file2_part2>

<?, file1_part2>

<?, file2_part1>

<“foo”, 3> <“bar”, 14> <“baz”, 6> …

<“foo”, 21> <“bar”, 78> <“baz”, 12> …

<“foo”, 11> <“bar”, 22> <“baz”, 31> …

<“foo”, 1> <“bar”, 41> <“baz”, 10> …

<“foo”, (3, 21, 11, 1)>

<“bar”, (14, 78, 22, 41)>

<“baz”, (6, 12, 31, 10)>

Sort and

group by

key

bar,155 baz,59 foo,36 …

Lots of Big Files

count()

count()

count()

count()

sum()

sum()

sum()

Hadoop’s MapReduce

0 MapReduce tasks are submitted as a “job”

> Jobs can be assigned to a specified “queue” of jobs

# By default, jobs are submitted to the “default” queue

> Job submission is controlled by ACL’s for each queue

0 Rack-aware and HDFS-aware

> The JobTracker communicates with the HDFS NameNode and schedules map/reduce operations using input data locality on HDFS DataNodes

M/R Nodes

JobTracker

0 Single node responsible for:

> Coordinating all M/R tasks & events

> Managing job queues and scheduling

> Maintains and Controls TaskTrackers

> Moves/restarts map/reduce tasks if needed

0 SPOF

> Uses “checkpointing” to combat this

0 Worker nodes responsible for:

> Executing individual map and reduce tasks as assigned by JobTracker (in separate JVM)

Master

Slaves

TaskTracker TaskTracker

TaskTracker =(

Conceptual Overview

JobTracker

TaskTracker TaskTracker TaskTracker TaskTracker

Temporary data stored on HDFS

JobTracker controls and heartbeats TaskTracker nodes

TaskTrackers store temp data on HDFS

Job Submission

JobTracker

TaskTracker TaskTracker TaskTracker TaskTracker


submit jobs to JobTracker M/R Client

M/R Client

M/R Client

Mapper Mapper Mapper Mapper

jobs get queued

map()’s are assigned to TaskTrackers (HDFS DataNode locality aware)

mappers store results on HDFS

mappers spawned in separate JVM and execute

Job Submission

JobTracker

TaskTracker

Reducer

TaskTracker

Reducer

TaskTracker

Reducer

TaskTracker

Reducer


submit jobs to JobTracker M/R Client

M/R Client

M/R Client jobs get queued

reduce phase begins

tmp data read from HDFS

MapReduce Tips

0 Keys and values can be any type of object

> Can specify custom data splitters, partitoners, combiners, InputFormat’s, and OutputFormat’s

0 Use ToolRunner.run(Tool) to run your Java jobs…

> Will use GenericOptionsParser and DistributedCache so that -files, -libjars, & -archives options are available to distribute your mappers, reducers, and any

> Without this, your mappers, reducers, and other utilites will not be propagated and added to the classpath of the other nodes (ClassNotFoundException)

MapReduce Demonstration

Other M/R Utils

0 $HADOOP_HOME/contrib/*

> PriorityScheduler & FairScheduler

> HOD (Hadoop On Demand)

# Uses TORQUE resource manager to dynamically allocate, use, and destroy MapReduce clusters on an as-needed basis

# Great for development and testing

> Hadoop Streaming (next slide...)

0 Amazon’s Elastic MapReduce (EMR)

> Essentially production HOD for EC2 data/clusters

Hadoop Streaming

0 Allows you to write MapReduce jobs in languages other than Java by running any command line process > Input data is partitioned and given to the standard input (STDIN) of

the command line mappers and reducers specified

> Output (STDOUT) from the command line mappers and reducers get combined into the M/R pipeline

0 Can specify custom partitioners and combiners

0 Can specify files & archives to propagate to all nodes and unpack on local file system (-archives & -file)

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming.jar

-input “/foo/bar/input.txt”

-mapper splitz.py

-reducer /bin/wc

-output “/foo/baz/out”

-archives „hdfs://hadoop1/foo/bar/cachedir.jar‟

-file ~/scripts/splitz.py

-D mapred.job.name=“Foo bar”

Pig

0 Framework and language (Pig Latin) for creating and submitting Hadoop MapReduce jobs

0 Common data operations (not supported by POJO-M/R) like join, group, filter, sort, select, etc. are provided

0 Don’t need to know Java

0 Removes boilerplate aspect from M/R

> 200 lines in Java 15 lines in Pig!

0 Relational qualities (reads and feels SQL-ish)

Pig

0 Fact from Wiki: 40% of Yahoo’s M/R jobs are in Pig

0 Interactive shell (grunt) exists

0 User Defined Functions (UDF)

> Allows you to specify Java code where the logic may be too complex for Pig Latin

> UDF’s can be part of most every operation in Pig Latin

> Great for loading and storing custom formats as well as transforming data

Pig Relational Operations

COGROUP

CROSS

DISTINCT

FILTER

FOREACH

GROUP

JOIN

LIMIT

LOAD

MAPREDUCE

ORDER BY

SAMPLE

SPLIT

STORE

STREAM

UNION

most of these are pretty self-explanatory

Example Pig Script

01: REGISTER ./tutorial.jar;

02: raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);

03: clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);

04: clean2 = FOREACH clean1 GENERATE user, time,

org.apache.pig.tutorial.ToLower(query) as query;

05: houred = FOREACH clean2 GENERATE user,

org.apache.pig.tutorial.ExtractHour(time) as hour, query;

06: ngramed1 = FOREACH houred GENERATE user, hour,

flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;

07: ngramed2 = DISTINCT ngramed1;

08: hour_frequency1 = GROUP ngramed2 BY (ngram, hour);

09: hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0),

COUNT($1) AS count;

10: hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram,

$1 as hour, $2 as count;

11: hour00 = FILTER hour_frequency2 BY hour eq '00';


13: same = JOIN hour00 BY $0, hour12 BY $0;

14: same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as

ngram, $2 as count00, $5 as count12;

15: STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage();

Taken from Pig tutorial on Pig wiki: The Temporal Query Phrase Popularity script processes a search query log file from the Excite search engine and compares the occurrence of frequency of search phrases across two time periods separated by twelve hours.

Example Pig Script

01: REGISTER ./tutorial.jar;

02: raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);

03: clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);

04: clean2 = FOREACH clean1 GENERATE user, time,

org.apache.pig.tutorial.ToLower(query) as query;

05: houred = FOREACH clean2 GENERATE user,

org.apache.pig.tutorial.ExtractHour(time) as hour, query;

06: ngramed1 = FOREACH houred GENERATE user, hour,

flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;

07: ngramed2 = DISTINCT ngramed1;

08: hour_frequency1 = GROUP ngramed2 BY (ngram, hour);

09: hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0),

COUNT($1) AS count;

10: hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram,

$1 as hour, $2 as count;



13: same = JOIN hour00 BY $0, hour12 BY $0;

14: same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as

ngram, $2 as count00, $5 as count12;

15: STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage();

Taken from Pig tutorial on Pig wiki: The Temporal Query Phrase Popularity script processes a search query log file from the Excite search engine and compares the occurrence of frequency of search phrases across two time periods separated by twelve hours.

UDF’’s

Now... image this equivalent in Java...

ZooKeeper

0 Centralized coordination service for use by distributed applications > Configuration, naming, synchronization (locks), ownership (master

election), etc.

0 Important system guarantees: > Sequential consistency (great for locking) > Atomicity – all or nothing at all > Data consistency – all clients view same system state regardless of

the server it connects to

<- ?

ZooKeeper Service

Server Server Server Server Server

Leader!

Client Client Client Client Client Client Client Client

ZooKeeper

0 Hierarchical namespace of “znodes” (like directories)

0 Operations:

> create a node at a location in the tree

> delete a node

> exists - tests if a node exists at a location

> get data from a node

> set data on a node

> get children from a node

> sync - waits for data to be propagated

<- ?

leaf znodes

HBase

0 Sparse, non-relational, column-oriented distributed database built on top of Hadoop Core (HDFS + MapReduce)

0 Modeled after Google’s BigTable (http://bit.ly/fQ1NMA)

0 NoSQL

0 HBase also has:

> Strong consistency model > In-memory operation > LZO compression (optional) > Live migrations > MapReduce support for querying

Not Only SQL... ...not “SQL is terrible”

http://bit.ly/fQ1NMA

What HBase Is…

0 Good at fast/streaming writes

0 Fault tolerant

0 Good at linear horizontal scalability

0 Very efficient at managing billions of rows and millions of columns

0 Good at keeping row history

0 Good at auto-balancing

0 A complement to a SQL DB/warehouse

0 Great with non-normalized data

What HBase Is NOT…

0 Made for table joins

0 Made for splitting into normalized tables (see previous)

0 A complete replacement for a SQL relational database

0 A complete replacement for a SQL data warehouse

0 Great for storing small amounts of data

0 Great for storing gobs of large binary data

0 The best way to do OLTP

0 The best way to do live adhoc querying of any column

0 A replacement for a proper caching mechanism

0 ACID compliant (http://bit.ly/hhFXCS)

http://bit.ly/hhFXCS

HBase Facts

0 Written in Java

0 Uses ZooKeeper to store metadata and -ROOT- region

0 Column-oriented store = flexible schema > Can alter the schema simply by adding the column name and

data on insert (“put”)

> No schema migrations!

0 Every column has a timestamp associated with it > Same column with most recent timestamp wins

0 Can export metrics for use with Ganglia, or as JMX

0 hbase hbck

> Check for errors and fix them (like HDFS fsck)

HBase Client API’s

0 jRuby interactive shell (hbase shell)

> DDL/DML commands

> Admin commands

> Cluster commands

0 Java API (http://bit.ly/ij0MgF)

0 REST API > Provided using Stargate

0 API for other languages (http://bit.ly/fLgCJC)

http://bit.ly/ij0MgF


Column-Oriented?

0 Traditional RDBMS are stored using row-oriented storage which stores entire rows sequentially on disk

0 Whereas column-oriented storage only stores columns for each row (or column-families) sequentially on disk

Row 1 – Cols 1-3 Row 2 – Cols 1-3

Row 3 – Cols 1-3

Row 1 – Col 1 Row 2 – Col 1

Row 3 – Col 1


Row 3 – Col 2


Where’s Row 2 - Col 2? Not needed because columns are stored sequentially, so rows have flexible schema!

Think of HBase Tables As…

0 More like JSON > And less like spreadsheets

{

"1" : {

"A" : { v: "x", ts: 4282 },

"B" : { v: "z", ts: 4282 }

},

"aaaaa" : {

"A" : { v: "y", ts: 4282 }

},

"xyz" : {

“address” : {

“line1" : { v: "hello", ts: 4282 },

“line2" : { v: "there", ts: 4282 },

“line2" : { v: "there", ts: 1234 }

},

“fooo" : { v: "wow!", ts: 4282 }

},

"zzzzz" : {

"A" : { v: "woot", ts: 4282 },

"B" : { v: "1337", ts: 4282 }

}

}

Modified from http://bit.ly/hbGWIG

column families allow grouping of columns (faster retrieval)

recent TS = default col value old TS

row id

columns

value & timestamp (TS)

flexible schema

http://bit.ly/hbGWIG

HBase Overview The Master server keeps track of the metadata for RegionServer’s and their containing Regions and stores it in Zookeeper

Data is sent using the client

The HBase client communicates with the Zookeeper cluster only to get Region information; moreover, no data is sent through the Master

The actual row “data” (bytes) is sent directly to and from the RegionServers

Therefore, the Master server nor the Zookeeper cluster don’t serve as data bottlenecks

Pretty diagrams from Lars George http://goo.gl/wRLJP & http://goo.gl/6ehnV

http://goo.gl/wRLJP

http://goo.gl/6ehnV

http://goo.gl/6ehnV

HBase Overview

HDFS breaks files into 64MB chucks and replicates the chunks N times (3 by default) to store on “actual” disk (giving HBase it’s fault tolerance)

All HBase data (HLog and HFiles) are stored on HDFS

Pretty diagrams from Lars George http://goo.gl/wRLJP

http://goo.gl/wRLJP

Understanding HBase

Tables are split into groups of ~100 rows (configurable) called Regions

HRegions Table

Regions are assigned to particular RegionServer’s by the Master server. The Master only contains region-location metadata and contains no “real” row data.


http://goo.gl/wRLJP

http://goo.gl/6ehnV

http://goo.gl/6ehnV

Writing to HBase

2) Transaction is written to write-ahead-log on HDFS (disk) first

3) Same data is written to in memory store for the assigned region (row group)

4) In memory store is periodically flushed to HDFS (disk) when size reaches threshold

1) HBase client gets the assigned region servers (and regions) from Master server for the particular keys (rows) in question and sends commands/data

HDFS

HDFS


http://goo.gl/wRLJP

http://goo.gl/6ehnV

http://goo.gl/6ehnV

HBase Scalability

Additional RegionServers can be added to the live system. The master server will then rebalance the cluster to migrate Regions onto the new RegionServers

Moreover, additional HDFS data nodes can be added to disk give more space to the HDFS cluster


http://goo.gl/wRLJP

http://goo.gl/6ehnV

http://goo.gl/6ehnV

HBase Demonstration

Hive

0 Data warehouse infrastructure on top of Hadoop Core

> Stores data on HDFS

> Allows you to add custom MapReduce plugins

0 HiveQL

> SQL-like language pretty close to ANSI SQL

# Supports joins

> JDBC driver exists

0 Has interactive shell (like MySQL & PostgreSQL) to run interactive queries

Hive

0 When running a HiveQL query/script, in the background Hive creates and runs a series of MapReduce jobs to

> BigData means it can take a long time to run queries

0 Therefore, it’s good for offline BigETL, but not good replacement for OLTP/OLAP data warehouse (like Oracle)

0 Learn more from wiki: http://bit.ly/epauio

> SHOW TABLES;

> CREATE TABLE rating (

userid INT,

movieid INT,

rating INT,

unixtime STRING)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

STORED AS TEXTFILE;

> DESCRIBE rating;

http://bit.ly/epauio

Other useful utilities around Hadoop

0 Sqoop (http://bit.ly/eRfVEJ)

> Load SQL data from a table into HDFS or Hive

> Generates Java classes to interact with the loaded data

0 Oozie (http://bit.ly/eNLi3B)

> Orchestrates complex workflows around multiple MapReduce jobs

0 Mahout (http://bit.ly/hCXRjL)

> Algorithm library for collaborative filtering, clustering, classifiers, and machine learning

0 Cascading (http://bit.ly/gyZNiI)

> Data query abstraction layer similar to Pig

> Java API that sits on top of MapReduce framework

> Since it’s a Java API you can use it with any program that uses a JVM language: Groovy, Scala, Clojure, jRuby, jython, etc.

http://bit.ly/eRfVEJ





http://bit.ly/eNLi3B

http://bit.ly/hCXRjL

http://bit.ly/gyZNiI

What about support?

0 Community, wikis, forumns, IRC

0 Cloudera provides enterprise support

> Offerings:

# Cloudera Enterprise

# Support, professional services, training, management apps

> Cloudera Distribution of Hadoop (CDH)

# Tested and hardened version of Hadoop products plus some other goodies (oozie, flume, hue, sqoop, whirr) ~ Separate codebase, but patches are made to and form the Apache versions

# Packages: debian, redhat, EC2, VM

if you want to try Hadoop, CDH is probably the way to go.

I recommended this instead of downloading each project individually.

Who uses this stuff?

and many more

Where the heck can I use this stuff?

0 The hardest part, is finding the right use-cases to apply Hadoop (and any NoSQL system) > SQL databases are great for data that fits on one machine

> Lots of tooling support for SQL; not as much for Hadoop (yet)

0 A few questions to think about: > How much data are you processing?

> Are you throwing away valuable data due to space?

> Are you processing data where steps aren’t interdependent?

0 Log storage, log processing, utility data, research data, biological data, medical records, events, mail, tweets, market data, financial data

≠ NoSQL

= NoSQL

The Law Of the Instrument

“It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.” -Abraham Maslow

http://en.wikipedia.org/wiki/Law_of_the_instrument

Thank You

[email protected]

submit feedback here!

intro to the hadoop stack @ april 2011 javamug

Technology