Transcript
Page 1: Enabling R on Hadoop

© Hortonworks Inc. 2012

Enabling R on Hadoop July 11, 2013

Page 1

Page 2: Enabling R on Hadoop

© Hortonworks Inc. 2012

Your Presenters

Ravi Mutyala Systems Architect

Page 2

Paul Codding Solutions Engineer

Page 3: Enabling R on Hadoop

© Hortonworks Inc. 2012

Agenda

• A Brief History of R • How R is Typically Used • How R is Used with Hadoop • Getting Started

Page 3

Page 4: Enabling R on Hadoop

© Hortonworks Inc. 2012

A Brief History of R

Page 4

Page 5: Enabling R on Hadoop

© Hortonworks Inc. 2012

History of R

Page 5

1976: S Fortran John

Chambers

S

1988: S V3 written in C & statistical

models included

1998: S V4

1991: R Created by

Ross Ihaka & Robert

Gentleman

R

1997: R Core Group

Formed

2000: R Version 1.0

released

Page 6: Enabling R on Hadoop

© Hortonworks Inc. 2012

How R is Typically Used

Page 6

Page 7: Enabling R on Hadoop

© Hortonworks Inc. 2012

Main Uses of R

• Statistical Analysis & Modeling – Classification – Scoring – Ranking – Clustering – Finding relationships – Characterization

• Common Uses – Interactive Data Analysis – General Purpose Statistics – Predictive Modeling

Page 7

Page 8: Enabling R on Hadoop

© Hortonworks Inc. 2012

How R is Used with Hadoop

Page 8

Page 9: Enabling R on Hadoop

© Hortonworks Inc. 2012

Hadoop Components

Page 9

OS   Cloud   VM   Appliance  

PLATFORM  SERVICES  

HADOOP  CORE  

DATA  SERVICES  

OPERATIONAL  SERVICES  

Manage & Operate at

Scale

Store, Process and Access Data

Enterprise Readiness: HA, DR, Snapshots, Security, …

HORTONWORKS    DATA  PLATFORM  (HDP)  

Distributed Storage & Processing HDFS   YARN  (in  2.0)  

WEBHDFS   MAP  REDUCE  

HCATALOG  

HIVE  PIG  HBASE  

SQOOP  

FLUME  

OOZIE  

AMBARI  

Page 10: Enabling R on Hadoop

© Hortonworks Inc. 2012

Hadoop Components & R

Page 10

OS   Cloud   VM   Appliance  

PLATFORM  SERVICES  

HADOOP  CORE  

DATA  SERVICES  

OPERATIONAL  SERVICES  

Manage & Operate at

Scale

Store, Process and Access Data

Enterprise Readiness: HA, DR, Snapshots, Security, …

HORTONWORKS    DATA  PLATFORM  (HDP)  

Distributed Storage & Processing HDFS   YARN  (in  2.0)  

WEBHDFS   MAP  REDUCE  

HCATALOG  

HIVE  PIG  HBASE  

SQOOP  

FLUME  

OOZIE  

AMBARI  

Data Service Components •  Hive •  HBase Hadoop Core •  Map Reduce •  HDFS

Page 11: Enabling R on Hadoop

© Hortonworks Inc. 2012

Options for R on Hadoop

• Options – RODBC/RJDBC – RHive – RHadoop

• Analysis – Focus – Integration Ease – Benefits – Limitations

Page 11

RHadoop

RODBC/RJDBC

RHive

Page 12: Enabling R on Hadoop

© Hortonworks Inc. 2012

RODBC/RJDBC

• Focus – SQL Access from R

• Integration Ease – Install Hortonworks Hive ODBC Driver – Install Hive libraries

• Benefits – Low impact on existing R scripts leveraging other DB packages – Not required to install Hadoop configuration/binaries on client

machines

• Limitations – Parallelism limited to Hive – Result set size

Page 12

Page 13: Enabling R on Hadoop

© Hortonworks Inc. 2012

Deployment Considerations

Page 13

TT , DN

.

.

.

.

.

.

.

TT , DNJT

NN

HS

Page 14: Enabling R on Hadoop

© Hortonworks Inc. 2012

RHive

• Focus – Broad access to Hive and HDFS

• Integration Ease – Requires Hadoop binaries, libraries, and configuration files on

client machines – Uses Java DFS Client and HiveServer

• Benefits – Wide range of features expressed through HQL

– rhive-apply R Distributed apply function using HQL

• Limitations – Requires heavy client deployment – Dependent on HiveServer, and can’t be used with HiveServer2

Page 14

Page 15: Enabling R on Hadoop

© Hortonworks Inc. 2012

Deployment Considerations

Page 15

TT + DN

.

.

.

.

.

.

.

TT + DN

JT

R Edge Node N

NH

S

Page 16: Enabling R on Hadoop

© Hortonworks Inc. 2012

RHadoop

• Focus – Tight integration with core Hadoop components

• Benefit – Ability to run R on a massively distributed system

– Ability to work with full data sets instead of sample sets

• Additional Information – https://github.com/RevolutionAnalytics/RHadoop/wiki

Page 16

Page 17: Enabling R on Hadoop

© Hortonworks Inc. 2012

RHadoop Architecture

Page 17

R

rhdfs

rhbase

rmr2

HDFS

HBase Thrift Gateway

Map Reduce

HBase

Streaming

R

RR

R

Page 18: Enabling R on Hadoop

© Hortonworks Inc. 2012

rhdfs

• Access HDFS from R • Read from HDFS to R dataframe • Write from R dataframe to HDFS • 1.0.6 adds support for Windows (using HDP)

Page 18

Page 19: Enabling R on Hadoop

© Hortonworks Inc. 2012

rhdfs

• Hadoop CLI Commands & rhdfs equivalent • hadoop fs –ls /

– hdfs.ls(“/”)

• hadoop fs –mkdir /user/rhdfs/ppt – hdfs.mkdir(“/user/rhdfs/ppt”)

• hadoop fs –put 1.txt /user/rhfds/ppt/ –  localData <- system.file(file.path("unitTestData", ”1.txt"), package="rhdfs”) – hdfs.put(localData, ”/user/rhdfs/ppt/1.txt”)

• hadoop fs –get /user/rhdfs/ppt/1.txt 1.txt – hdfs.get(”/user/rhdfs/ppt/1.txt”,”test”)

• hadoop fs –rm /user/rhdfs/ppt/1.txt – hdfs.delete(“/user/rhdfs/ppt/1.txt”)

Page 19

Page 20: Enabling R on Hadoop

© Hortonworks Inc. 2012

rhbase

• Access and change data within HBase • Uses Thrift API • Command Examples

– hb.new.table – hb.insert – hb.scan.ex – hb.scan

Page 20

Page 21: Enabling R on Hadoop

© Hortonworks Inc. 2012

rmr2

• Enables writing MapReduce jobs using R • Ability to parallelize algorithms • Ability to use big data sets without needing to sample data

• mapreduce(input, output, map, reduce, …) • Reduces takes a key and a collection of values which could be vector, list, data frame or matrix

• 2.2.1 adds support for Windows (using HDP)

Page 21

Page 22: Enabling R on Hadoop

© Hortonworks Inc. 2012

Sample code - wordcount

Page 22

wc.map = ! function(., lines) {! keyval(! unlist(! strsplit(! x = lines,! split = pattern)),! 1)}!wc.reduce =! function(word, counts ) {! keyval(word, sum(counts))}!!mapreduce(! input = input ,! output = output,! input.format = "text",! map = wc.map,! reduce = wc.reduce,! combine = T)}!

Page 23: Enabling R on Hadoop

© Hortonworks Inc. 2012

More Sample Code

Page 23

groups = rbinom(32, n = 50, prob = 0.4)! tapply(groups, groups, length)!

groups = to.dfs(groups)! from.dfs(! mapreduce(! input = groups,! map = function(., v) keyval(v, 1),! reduce =! function(k, vv)! keyval(k, length(vv))))!

Page 24: Enabling R on Hadoop

© Hortonworks Inc. 2012

Deployment Considerations

Page 24

TT , DN, RS

R

.

.

.

.

.

.

.

TT , DN, RS

RJT

R Edge Node N

NH

T G

Page 25: Enabling R on Hadoop

© Hortonworks Inc. 2012

RHadoop

• Limitations – Requires installation of R on all TaskTracker nodes – Does not automatically parallelize algorithms – Different slot/memory configuration recommended to leave

memory and CPU resources for R

Page 25

OS

Map Reduce

OS

Map Reduce

R

Page 26: Enabling R on Hadoop

© Hortonworks Inc. 2012

Getting Started

Page 26

Page 27: Enabling R on Hadoop

© Hortonworks Inc. 2012

Your Fastest On-ramp to Enterprise Hadoop™!

Page 27

http://hortonworks.com/products/hortonworks-sandbox/

The Sandbox lets you experience Apache Hadoop from the convenience of your own laptop – no data center, no cloud and no internet connection needed! The Hortonworks Sandbox is: •  A free download: http://hortonworks.com/products/hortonworks-sandbox/ •  A complete, self contained virtual machine with Apache Hadoop pre-configured •  A personal, portable and standalone Hadoop environment •  A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop

Page 28: Enabling R on Hadoop

© Hortonworks Inc. 2012

Installation

• Install R on all nodes • Install dependent packages – RJSONIO – itertools – digest – Rcpp – rJava – functional – RCurl – httr – plyr

• Download & Install RHadoop Packages – rmr2 – rhdfs – rhbase (requires Thrift)

Page 28

Page 29: Enabling R on Hadoop

© Hortonworks Inc. 2012

Questions & Answers

TRY Download HDP at hortonworks.com

LEARN Applying Data Science using Apache Hadoop Training

FOLLOW twitter: @hortonworks Facebook: facebook.com/hortonworks

Page 29

Further questions & comments: [email protected]

[email protected]


Top Related