introduction of r on hadoop

A-Tsai (Chung-Tsai Su) SPN

2013/10/1

Introduction of R on Hadoop

Agenda

•  When Should You Use R? •  When Should You Consider Hadoop?

•  How to use R on Hadoop? – Rhadoop – R + Hadoop Streaming – Rhipe

•  Demo

•  Conclusions

When Should You Use R?

What’s New in the Second Edition?

ggplot2

xiv | Preface

(Page 16)

http://3.bp.blogspot.com/-SbrlR5E0tks/UGCxeL_f5YI/AAAAAAAAL3M/lroU3yF-3_0/s1600/BigDataLandscape.png

https://developers.google.com/appengine/docs/python/images/mapreduce_mapshuffle.png

When should you consider Hadoop?

RHadoop

554 | Chapter 26: R and Hadoop

(Page 576)

(Page 576)

When should you consider Hadoop?

RHadoop


RHadoop

Packages of RHadoop

http://revolution-computing.typepad.com/.a/6a010534b1db25970b0154359c29bf970c-800wi

RHadoop

Installation (in textbook)

Installing RHadoop locally

devtools

> library(devtools)> install_url("https://github.com/downloads/RevolutionAnalytics/RHadoop/rmr_1.3.tar.gz")Installing rmr_1.3.tar.gz from https://github.com/downloads/RevolutionAnalytics/RHadoop/rmr_1.3.tar.gzInstalling rmrInstalling dependencies for rmr:...> # make sure to set HADOOP_HOME to the location of your HADOOP installation,> # HADOOP_CONF to the location of your hadoop config files, and make sure> # that the hadoop bin diretory is on your path> Sys.setenv(HADOOP_HOME="/Users/jadler/src/hadoop-0.20.2-cdh3u4")> Sys.setenv(HADOOP_CONF=paste(Sys.getenv("HADOOP_HOME"),+ "/conf", sep=""))> Sys.setenv(PATH=paste(Sys.getenv("PATH"), ":", Sys.getenv("HADOOP_HOME"),+ "/bin", sep=""))> install_url("https://github.com/downloads/RevolutionAnalytics/RHadoop/rhdfs_1.0.4.tar.gz")Installing rhdfs_1.0.4.tar.gz from https://github.com/downloads/RevolutionAnalytics/RHadoop/rhdfs_1.0.4.tar.gzInstalling rhdfs...> install_url("https://github.com/downloads/RevolutionAnalytics/RHadoop/rhbase_1.0.4.tar.gz")Installing rhbase_1.0.4.tar.gz from https://github.com/downloads/RevolutionAnalytics/RHadoop/rhbase_1.0.4.tar.gzInstalling rhbase

An example RHadoop application

$ # get the file from the CDC$ wget ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/ mortality/mort2009us.zip

R and Hadoop | 559

R and Hadoop

(Refer to page 581)

Installation

http://blog.fens.me/rhadoop-rhadoop/

•  Download Rhadoop package from https://github.com/RevolutionAnalytics/RHadoop/wiki

•  $ R CMD javareonf •  $ R

–  Install rJava, reshape2, Rcpp, iterators, itertools, digest, RJSONIO, functional, and bitops.

•  >q() •  $ R CMD INSTALL rhdfs_1.0.6.tar.gz •  $ R CMD INSTALL rmr2_2.2.2.tar.gz •  Check whether successful installation

–  > library(rhdfs) –  > hdfs.init() –  > hdfs.ls(“/user”)

First Example: WordCount

Hadoop Portal

An example RHadoop application

•  Mortality Public Use File Documentation –  The dataset contains a record of every death in the United States,

including the cause of death and demographic information about the deceased. (in 2009, the mortality data file was 1.1GB and contained 2,441,219 records)

$ wget ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2009us.zip $ unzip mort2009us.zip $ hadoop fs -mkdir mort09 $ hadoop fs -copyFromLocal VS09MORT.DUSMCPUB mort09 $ hadoop fs -ls mort09 Found 1 items -rw-r--r-- 3 jadler supergroup 1196197310 2012-08-02 16:31 /user/jadler/mort09/VS09MORT.DUSMCPUB

/home/spndc/src/Rhadoop/mort09.R (1/3)

$ # unzip the file$ unzip mort2009us.zip$ # create a directory on hdfs$ hadoop fs -mkdir mort09$ # copy to that directory on hdfs$ hadoop fs -copyFromLocal VS09MORT.DUSMCPUB mort09$ # look at the files$ hadoop fs -ls mort09Found 1 items-rw-r--r-- 3 jadler supergroup 1196197310 2012-08-02 16:31 /user/jadler/mort09/VS09MORT.DUSMCPUB

$ head -n 100 VS09MORT.DUSMCPUB > VS09MORT.DUSMCPUB.sample

read.fwf read.fwf

.X

mort.schema <- c( .X0=19, ResidentStatus=1, .X1=40, Education1989=2, Education2003=1, EducationFlag=1,MonthOfDeath=2,.X2=2,Sex=1,AgeDetail=4, AgeSubstitution=1, AgeRecode52=2,AgeRecode27=2,AgeRecode12=2,AgeRecodeInfant22=2, PlaceOfDeath=1,MaritalStatus=1,DayOfWeekofDeath=1,.X3=16, CurrentDataYear=4, InjuryAtWork=1, MannerOfDeath=1, MethodOfDisposition=1, Autopsy=1,.X4=34,ActivityCode=1,PlaceOfInjury=1,ICDCode=4, CauseRecode358=3,.X5=1,CauseRecode113=3,CauseRecode130=3, CauseRecode39=2,.X6=1,Conditions=281,.X8=1,Race=2,BridgeRaceFlag=1, RaceImputationFlag=1,RaceRecode3=1,RaceRecode5=1,.X9=33, HispanicOrigin=3,.X10=1,HispanicOriginRecode=1)

> # according to the documentation, each line is 488 characters long> sum(mort.schema)[1] 488

.X

unpack.line <- function(data, schema) { filter.func <- function(x) {substr(x,1,2) != ".X"} data.pointer <- 1 output.data <- list() for (i in 1:length(schema)) { if (filter.func(names(schema)[i])) { output.data[[names(schema)[i]]] <- type.convert( substr(data, data.pointer, data.pointer+schema[i] - 1), as.is=TRUE)


/home/spndc/src/Rhadoop/mort09_1.R (1/4)

R + Hadoop Streaming

Hadoop Streaming

http://biomedicaloptics.spiedigitallibrary.org/data/Journals/BIOMEDO/23543/125003_1_2.png

mapreducefrom.dfs

from.dfs(input, format = "native", to.data.frame = FALSE, vectorized = FALSE, structured = FALSE)

Learning more

Hadoop Streaming

#! /usr/bin/env Rscript

mort.schema <- ...

unpack.line <- ...

age.decode <- ...

con <- file("stdin", open="r")while(length(line <- readLines(con, n=1)) > 0) { parsed <- unpack.line(line,mort.schema) write(paste(parsed[["CauseRecode39"]], age.decode(parsed[["AgeDetail"]]), sep="\t"), stdout())}close(con)


/home/spndc/src/Rhadoop/map.R

/home/spndc/src/Rhadoop/reduce.R #! /usr/bin/env Rscript

cause.decode <- ...

con <- file("stdin", open="r")

current.key <- NAcumulative.age <- 0count <- 0

print.results <- function(k, n, d) { write(paste(cause.decode(k),n/d,sep="\t"),stdout())}

while(length(line <- readLines(con, n=1)) > 0) {

parsed <- strsplit(line,"\t") key <- parsed[[1]][1] value <- type.convert(parsed[[1]][2], as.is=TRUE)

if (is.na(current.key)) { current.key <- key } else if (current.key != key) { print.results(current.key, cumulative.age, count) current.key <- key cumulative.age <- 0 count <- 0 }

if (!is.na(value)) { cumulative.age <- cumulative.age + value count <- count + 1 }}

close(con)print.results(current.key, cumulative.age, count)

reduce

$ chmod +x map.R$ chmod +x reduce.R

R and Hadoop | 569

R and Hadoop

/home/spndc/src/Rhadoop/streaming.sh

#!/bin/sh

/usr/java/hadoop-1.2.0/bin/hadoop jar /usr/java/hadoop-1.2.0/contrib/streaming/hadoop-streaming-1.2.0.jar -input mort09 -output averagebycondition -mapper map.R -reducer reduce.R -file map.R -file reduce.R

Output

[spndc@localhost hadoop-1.2.0]$ bin/hadoop fs -text averagebycondition/part-00000 Tuberculosis 60.5 Malignant neoplasms of cervix uteri, corpus uteri and ovary 68.0631578947368 Malignant neoplasm of prostate 78.0705882352941 Malignant neoplasms of urinary tract 72.5656565656566 Non-Hodgkin's lymphoma 69.56 Leukemia 72.8674698795181 Other malignant neoplasms 66.8361581920904 Diabetes mellitus 68.2723404255319 Alzheimer's disease 85.419795221843 Hypertensive heart disease with or without renal disease 68.0833333333333 Ischemic heart diseases 72.1750619322874 Other diseases of heart 74.925 Essential 70.468085106383 Cerebrovascular diseases 76.0950639853748 Atherosclerosis 80.12 … Malignant neoplasm of breast 67.3815789473684

RHIPE

http://www.datadr.org/index.html

Installation

API

RHIPE v0.65.3

Example

RHIPE v0.65.3

Recommendation System

Live Demo

Conclusions

•  Rhadoop is a good way to scale out, but it might be not the best way.

•  Rhadoop is still under fast developing cycle, so you might be aware of the backward compatible issue.

•  So far, SPN has no plan to adopt Rhadoop for data analysis.

•  One of R fans suggests that using Pig with R will be better than using Rhadoop directly.

Reference

•  Rhadoop Wiki –  https://github.com/RevolutionAnalytics/RHadoop/wiki

•  Rhipe –  http://www.datadr.org/

•  Rhadoop實踐系列文章： –  http://blog.fens.me/series-rhadoop/

• 阿貝好威的實驗室 –  http://lab.howie.tw/2013/01/Big-Data-Analytic-Weka-vs-Mahout-vs-

R.html

•  R and Hadoop 整合初體驗 –  http://michaelhsu.tw/2013/05/01/r-and-hadoop-%E5%88%9D

%E9%AB%94%E9%A9%97/

Thank You

Backup

introduction of r on hadoop

Technology

rhadoop r

hadoop installation

introduction of r

r cmd install

r install rjava

hadoop portal

r cmd javareonf

hadoop streaming http