driving new value from big data investments user r+hadoop.pdf · building modern analytics...

Driving New Value from Big Data Investments

An Introduction to Using R with HadoopJeffrey BreenPrincipal, Think Big [email protected]://www.thinkbigacademy.com/

February 2013

mailto:[email protected]

mailto:[email protected]

http://www.thinkbigacademy.com

http://www.thinkbigacademy.com

2

Building Modern Analytics Solutions to Monetize Big Data Investments

Strategy and Roadmap

IMAGINETraining

and Education

ILLUMINATEHands-On

Data Science and Data Engineering

IMPLEMENT

Leading Providerof Innovative Big Analytics Services

3

We Accelerate Your Time to Value

THINK BIG Analytics Methodology

Experiment-Driven Short Projects with Nimble Test Solution Cycles

� Breaking Down Business and IT Barriers

� Discrete Projects with Beginning and End

� Early Releases to Validate ROI andEnsure Long Term Success

IMAGINE

ILLUMINATE

IMPLEMENT

Innovation and Value

4

� Expert Training/Courses− e.g. Hadoop Developer, HBase, Pig and Hive for Modelers

� Joint Application Development� Side-by-Side Mentoring

Enable Your IT Staff with New Skills

Data Architect

Data Architect Big Data

Monitoring

DatabaseAdministrator

Big DataAdministrator

BusinessAnalyst

Data ScienceMath Modeler

Developers

Big DataEngineering

ILLUMINATE: Training and Education

� Build Capabilities to Manage Rapid Innovation Needed with Big Data

� Invest in and Scale Skills to Create Data-Driven Organization

THINK BIG Analytics

Agenda

5

� Why R?� What is Hadoop?� Counting words with MapReduce� Writing MapReduce jobs with RHadoop� Data Warehousing with Hive� Big Data ≠ Hadoop� Want to learn more?� Q&A

Agenda

6


Revolution Confidential

7http://thebalancedguy.blogspot.com/2010/09/with-3-boys-and-having-been-cub-scout.html

http://thebalancedguy.blogspot.com/2010/09/with-3-boys-and-having-been-cub-scout.html

http://thebalancedguy.blogspot.com/2010/09/with-3-boys-and-having-been-cub-scout.html


8http://www.wengerna.com/giant-knife-16999

http://www.wengerna.com/giant-knife-16999

http://www.wengerna.com/giant-knife-16999

Number of R Packages Available

How many R Packages are there now?

At the command line enter:> dim(available.packages())

Slide courtesy of John Versotek, organizer of the Boston Predictive Analytics Meetup

Agenda

10


Google File System is the Storage.

2003

13

MapReduce is the framework.

2004

14

Enter HadoopAbout this time,

Doug Cutting, the creator of Lucene, was working on Nutch.

15

Nutch Timeline

Year Topics

2003 Google’s GFS paper.

2004 Nutch Distributed File System (NDFS).

2004 Google’s MapReduce paper.

2004-2005

Nutch MapReduce Implementation.

16

Hadoop TimelineYear Topics

2006NDFS and Nutch MapReduce extracted to separate Hadoop Apache project.

2008Hadoop is a top-level Apache project.Yahoo! announces 10K core cluster.

17

�Optimize disk I/O performance.-Minimize disk head seeks!

�Redundant data storage and processing to eliminate many kinds of data loss.

�Horizontal scalability.�Run on commodity, server-class hardware.

Hadoop Design Goals

18


19

from Jeff Dean, based on Peter Norvig’s http://norvig.com/21-days.html

http://norvig.com/21-days.html

http://norvig.com/21-days.html

What is Hadoop?

� An open source project designed to support large scale data processing� Inspired by Google’s MapReduce-based computational infrastructure� Comprised of several components- Hadoop Distributed File System (HDFS)- MapReduce processing framework, job scheduler, etc.- Ingest/outgest services (Sqoop, Flume, etc.)- Higher level languages and libraries (Hive, Pig, Cascading, Mahout)

� Written in Java, first opened up to alternatives through its Streaming API→ If your language of choice can handle stdin and stdout, you can use it to write MapReduce jobs

20

21

Hadoop cluster components

Key� italics: process�✲ : MR jobs

Cluster

Slaves

IngestService

OutgestService

SQLStore

SQLStore

Logs

Client Servers

✲ Hive, Pig, ...✲ cron+bash, Azkaban, …

Sqoop, Scribe, …Monitoring, Management

...

Secondary Master Server

Secondary Name Node

Primary Master Server✲ Job Tracker

Name Node

Slave Server✲ Task Tracker

Data Node

DiskDiskDiskDiskDiskDiskDiskDisk


Data Node



Data Node

DiskDiskDiskDiskDiskDiskDiskDiskfrom Think Big Academy’s Hadoop Developer Course

22

Hadoop’s distributed file system

Services

Name Node

Data Nodes

64MB blocks

3x replication

Cluster

Slaves

IngestService

OutgestService

SQLStore

SQLStore

Logs

Client Servers

✲ Hive, Pig, ...✲ cron+bash, Azkaban, …

Sqoop, Scribe, …Monitoring, Management

...

Secondary Master Server

Secondary Name Node

Primary Master Server✲ Job Tracker

Name Node


Data Node



Data Node



Data Node

DiskDiskDiskDiskDiskDiskDiskDiskfrom Think Big Academy’s Hadoop Developer Course

Agenda

23


True confession: I was wrong about MapReduce

� When the Google paper was published in 2004, I was running a typical enterprise IT department

� Big hardware (Sun, EMC) + big applications (Siebel, Peoplesoft) + big databases (Oracle, SQL Server)= big licensing & support costs

� Loved the scalability, COTS components, and price, but missed the fact that keys (and values) could be compound & complex

� ... and examples like Wordcount didn’t help!

Source: Hadoop: The Definitive Guide, Second Edition, p. 20

24

Copyright © 2011-‐2013, Think Big AnalyNcs, All Rights Reserved

There is a Map phase

Hadoop uses MapReduce

Input Mappers Sort,Shuffle

Reducers

map 1mapreduce 1phase 2

a 2hadoop 1is 2

Output

There is a Reduce phase

reduce 1there 2uses 1

(hadoop, 1)

(uses, 1)(mapreduce, 1)

(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

We need to convert the Input

into the Output.

from Think Big Academy’s Hadoop Developer Course




Input

(N, "…")

(N, "…")

(N, "")

Mappers

There is a Reduce phase (N, "…")





Input

(N, "…")

(N, "…")

(N, "")

Mappers


(hadoop, 1)(uses, 1)(mapreduce, 1)

(there, 1) (is, 1)(a, 1) (reduce, 1)(phase, 1)

(there, 1) (is, 1)(a, 1) (map, 1)(phase, 1)



28http://blog.stackoverflow.com/wp-content/uploads/then-a-miracle-occurs-cartoon.png

http://blog.stackoverflow.com/wp-content/uploads/then-a-miracle-occurs-cartoon.png

http://blog.stackoverflow.com/wp-content/uploads/then-a-miracle-occurs-cartoon.png




Input

(N, "…")

(N, "…")

(N, "")

Mappers Sort,Shuffle

Reducers


(hadoop, 1)


(is, 1), (a, 1)

(there, 1)

(there, 1), (reduce, 1)

(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z





Input

(N, "…")

(N, "…")

(N, "")


(a, [1,1]),(hadoop, [1]),

(is, [1,1])

(map, [1]),(mapreduce, [1]),

(phase, [1,1])

Reducers


(reduce, [1]),(there, [1,1]),

(uses, 1)

(hadoop, 1)


(is, 1), (a, 1)

(there, 1)


(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z





Input

(N, "…")

(N, "…")

(N, "")


(a, [1,1]),(hadoop, [1]),

(is, [1,1])


(phase, [1,1])

Reducers


a 2hadoop 1is 2

Output


(reduce, [1]),(there, [1,1]),

(uses, 1)


(hadoop, 1)


(is, 1), (a, 1)

(there, 1)


(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-z





Input

(N, "…")

(N, "…")

(N, "")


(a, [1,1]),(hadoop, [1]),

(is, [1,1])


(phase, [1,1])

Reducers


a 2hadoop 1is 2

Output


(reduce, [1]),(there, [1,1]),

(uses, 1)


(hadoop, 1)


(is, 1), (a, 1)

(there, 1)


(phase,1)

(map, 1),(phase,1)

(is, 1), (a, 1)

0-9, a-l

m-q

r-zMap:

• Transform one input to 0-‐N outputs.

Reduce:

• Collect multiple inputs into one output.


Agenda

33


Enter RHadoop

� RHadoop is an open source project sponsored by Revolution Analytics� Package Overview- rmr2 - all MapReduce-related functions- rhdfs - interaction with Hadoop’s HDFS file system- rhbase - access to the NoSQL HBase database

� rmr2 uses Hadoop’s Streaming API to allow R users to write MapReduce jobs in R- handles all of the I/O and job submission for you (no while(<stdin>)-like loops!)

34

RHadoop Advantages

� Modular- Packages group similar functions- Only load (and learn!) what you need- Minimizes prerequisites and dependencies

� Open Source- Cost: Low (no) barrier to start using- Transparency: Development, issue tracker, Wiki, etc. hosted on github

• https://github.com/RevolutionAnalytics/RHadoop/� Supported- Sponsored by Revolution Analytics- Training & professional services available- Support available with Revolution R Enterprise subscriptions

35

https://github.com/RevolutionAnalytics/RHadoop/

https://github.com/RevolutionAnalytics/RHadoop/

wordcount: codelibrary(rmr2)

map = function(k,lines) {

words.list = strsplit(lines, '\\s') words = unlist(words.list)

return( keyval(words, 1) )}

reduce = function(word, counts) { keyval(word, sum(counts))}

wordcount = function (input, output = NULL) { mapreduce(input = input, output = output, input.format = "text", map = map, reduce = reduce)}

36

from Revolution Analytics’ Getting Started with RHadoop course

wordcount: submit job and fetch results

Submit job> hdfs.root = 'wordcount'> hdfs.data = file.path(hdfs.root, 'data')> hdfs.out = file.path(hdfs.root, 'out')> out = wordcount(hdfs.data, hdfs.out)

Fetch results from HDFS> results = from.dfs( out )> results.df = as.data.frame(results, stringsAsFactors=F )> colnames(results.df) = c('word', 'count')> head(results.df) word count1 greatness 22 damned 33 tis 54 jade 15 magician 1

37

from Revolution Analytics’ Getting Started with RHadoop course

Code notes

� Scalable- Hadoop and MapReduce abstract away system details- Code runs on 1 node or 1,000 nodes without modification

� Portable- You write normal R code, interacting with normal R objects- RHadoop’s rmr2 library abstracts away Hadoop details- All the functionality you expect is there—including Enterprise R’s

� Flexible- Only the mapper deals with the data directly- All components communicate via key-value pairs- Key-value “schema” chosen for each analysis rather than as a prerequisite to

loading data into the system

38

rmr2 Function Overview

� Convenience- keyval() - creates a key-value pair from any two R objects. Used to generate

output from input formatters, mappers, reducers, etc.� Input/output- from.dfs(), to.dfs() - read/write data from/to the HDFS- make.input.format() - provides common file parsing (text, CSV) or will wrap a user-

supplied function� Job execution- mapreduce() - submit job and return an HDFS path to the results if successful

39

rhdfs function overview

� File & directory manipulation- hdfs.ls(), hdfslist.files()- hdfs.delete(), hdfs.del(), hdfs.rm() - hdfs.dircreate(), hdfs.mkdir()- hdfs.chmod(), hdfs.chown(), hdfs.file.info()- hdfs.exists()

� Copying, moving & renaming files to/from/within HDFS- hdfs.copy(), hdfs.move(), hdfs.rename()- hdfs.put(), hdfs.get()

� Reading files directly from HDFS- hdfs.file(), hdfs.read(), hdfs.write(), hdfs.flush()- hdfs.seek(), hdfs.tell(con), hdfs.close()- hdfs.line.reader(), hdfs.read.text.file()

� Misc.- hdfs.init(), hdfs.defaults()

40

rhbase function overview

� Initialization- hb.init()

� Create and manage tables- hb.list.tables(), hb.describe.table()- hb.new.table(), hb.delete.table()

� Read and write data- hb.insert(), hb.insert.data.frame()- hb.get(), hb.get.data.frame(), hb.scan()- hb.delete()

� Administrative, etc.- hb.defaults(), hb.set.table.mode()- hb.regions.table(), hb.compact.table()

41

Big Data Warehousing with Hive

� Hive supplies a SQL-like query language- very familiar for those with relational database experience

� But Hive compiles, optimizes, and executes these queries as MapReduce jobs on the Hadoop cluster

� Can be used in conjunction with other Hadoop jobs, such as those written with rmr2

42

Hive architecture & access

43

Hadoop

Master✲ Job Tracker Name Node DFS

Hive

Driver(compiles, optimizes, executes)

CLI HWI Thrift Server

Metastore

JDBC ODBC

RODBC, RJDBC, etc.Terminal browser

Accessing Hive via ODBC/JDBClibrary(RJDBC)

# set the classpath to include the JDBC driver location, plus commons-logging

[...]class.path = c(hive.class.path, commons.class.path)drv = JDBC("org.apache.hadoop.hive.jdbc.HiveDriver", classPath=class.path, "`")

# make a connection to the running Hive Server:conn = dbConnect(drv, "jdbc:hive://localhost:10000/default")

# setting the database name in the URL doesn't help,# so issue 'use databasename' command:res = dbSendQuery(conn, 'use mydatabase')

# submit the query and fetch the results as a data.frame:df = dbGetQuery(conn, 'SELECT name, sub FROM employees LATERAL VIEW explode(subordinates) subView AS sub')

44

Other ways to use R and Hadoop

� HDFS- Revolution Enterprise R can read and write files directly on the distributed file

system- Files can include ScaleR’s XDF-formatted data sets

� MapReduce- Many other R packages have been written to use R and Hadoop together,

including RHIPE, segue, Oracle’s R Connector for Hadoop, etc.

� Hive- Hadoop Streaming is also available for Hive to leverage functionality external to

Hadoop and Java- RHive leverages RServe to connect the two

• http://cran.r-project.org/web/packages/RHive/

45

http://cran.r-project.org/web/packages/RHive/

http://cran.r-project.org/web/packages/RHive/

Big Data ≠ Hadoop

� NoSQL databases offer low-latency, random-access to key-values- HBase- Cassandra- CouchDB- MongoDB- Accumulo

� Next week, Think Big’s Douglas Moore will be presenting at the Boston Storm Meetup:- “Predictive Analytics with Storm, Hadoop, R and AWS”- http://www.meetup.com/Boston-Storm-Users/events/103506142/

46

http://www.meetup.com/Boston-Storm-Users/events/103506142/

http://www.meetup.com/Boston-Storm-Users/events/103506142/

Want to learn more?

47

� Upcoming public Getting Started with RHadoop 1-day classes- Hands-on examples and exercises covering rhdfs, rhbase, and rmr2- Algorithms and data include wordcount, analysis of airline flight data, and

collaborative filtering using structured and unstructured data from text, CSV files and Twitter

• February 25, 2013 - Palo Alto, CA• March 13, 2013 - Boston, MA

• 25% off with “useR” discount code @ http://bit.ly/rhadoop0313

� Revolution Analytics Quick Start Program for Hadoop- Private Getting Started with RHadoop training- Onsite consulting assistance for initial use case- Revolution R for Hadoop licenses and support- More info @ http://bit.ly/rhadoopqs

http://bit.ly/rhadoop0313

http://bit.ly/rhadoop0313

http://bit.ly/rhadoopqs

http://bit.ly/rhadoopqs

driving new value from big data investments user r+hadoop.pdf · building modern analytics...

Documents