introduction of r on hadoop

42
A-Tsai (Chung-Tsai Su) SPN 2013/10/1 Introduction of R on Hadoop

Upload: -

Post on 15-Jan-2015

1.490 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Introduction of R on Hadoop

A-Tsai (Chung-Tsai Su) SPN

2013/10/1

Introduction of R on Hadoop

Page 2: Introduction of R on Hadoop

Agenda

•  When Should You Use R? •  When Should You Consider Hadoop?

•  How to use R on Hadoop? – Rhadoop – R + Hadoop Streaming – Rhipe

•  Demo

•  Conclusions

Page 3: Introduction of R on Hadoop

When Should You Use R?

What’s New in the Second Edition?

ggplot2

xiv | Preface

(Page 16)

Page 4: Introduction of R on Hadoop

http://3.bp.blogspot.com/-SbrlR5E0tks/UGCxeL_f5YI/AAAAAAAAL3M/lroU3yF-3_0/s1600/BigDataLandscape.png

Page 5: Introduction of R on Hadoop

https://developers.google.com/appengine/docs/python/images/mapreduce_mapshuffle.png

Page 6: Introduction of R on Hadoop

When should you consider Hadoop?

RHadoop

554 | Chapter 26: R and Hadoop

(Page 576)

Page 7: Introduction of R on Hadoop

When should you consider Hadoop?

RHadoop

554 | Chapter 26: R and Hadoop

(Page 576)

Page 8: Introduction of R on Hadoop

(Page 576)

When should you consider Hadoop?

RHadoop

554 | Chapter 26: R and Hadoop

Page 9: Introduction of R on Hadoop

RHadoop

Page 10: Introduction of R on Hadoop

Packages of RHadoop

http://revolution-computing.typepad.com/.a/6a010534b1db25970b0154359c29bf970c-800wi

Page 11: Introduction of R on Hadoop

RHadoop

Page 12: Introduction of R on Hadoop

Installation (in textbook)

Installing RHadoop locally

devtools

> library(devtools)> install_url("https://github.com/downloads/RevolutionAnalytics/RHadoop/rmr_1.3.tar.gz")Installing rmr_1.3.tar.gz from https://github.com/downloads/RevolutionAnalytics/RHadoop/rmr_1.3.tar.gzInstalling rmrInstalling dependencies for rmr:...> # make sure to set HADOOP_HOME to the location of your HADOOP installation,> # HADOOP_CONF to the location of your hadoop config files, and make sure> # that the hadoop bin diretory is on your path> Sys.setenv(HADOOP_HOME="/Users/jadler/src/hadoop-0.20.2-cdh3u4")> Sys.setenv(HADOOP_CONF=paste(Sys.getenv("HADOOP_HOME"),+ "/conf", sep=""))> Sys.setenv(PATH=paste(Sys.getenv("PATH"), ":", Sys.getenv("HADOOP_HOME"),+ "/bin", sep=""))> install_url("https://github.com/downloads/RevolutionAnalytics/RHadoop/rhdfs_1.0.4.tar.gz")Installing rhdfs_1.0.4.tar.gz from https://github.com/downloads/RevolutionAnalytics/RHadoop/rhdfs_1.0.4.tar.gzInstalling rhdfs...> install_url("https://github.com/downloads/RevolutionAnalytics/RHadoop/rhbase_1.0.4.tar.gz")Installing rhbase_1.0.4.tar.gz from https://github.com/downloads/RevolutionAnalytics/RHadoop/rhbase_1.0.4.tar.gzInstalling rhbase

An example RHadoop application

$ # get the file from the CDC$ wget ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/ mortality/mort2009us.zip

R and Hadoop | 559

R and Hadoop

(Refer to page 581)

Page 13: Introduction of R on Hadoop

Installation

http://blog.fens.me/rhadoop-rhadoop/

•  Download Rhadoop package from https://github.com/RevolutionAnalytics/RHadoop/wiki

•  $ R CMD javareonf •  $ R

–  Install rJava, reshape2, Rcpp, iterators, itertools, digest, RJSONIO, functional, and bitops.

•  >q() •  $ R CMD INSTALL rhdfs_1.0.6.tar.gz •  $ R CMD INSTALL rmr2_2.2.2.tar.gz •  Check whether successful installation

–  > library(rhdfs) –  > hdfs.init() –  > hdfs.ls(“/user”)

Page 14: Introduction of R on Hadoop

First Example: WordCount

Page 15: Introduction of R on Hadoop

Hadoop Portal

Page 16: Introduction of R on Hadoop

An example RHadoop application

•  Mortality Public Use File Documentation –  The dataset contains a record of every death in the United States,

including the cause of death and demographic information about the deceased. (in 2009, the mortality data file was 1.1GB and contained 2,441,219 records)

$ wget ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/DVS/mortality/mort2009us.zip $ unzip mort2009us.zip $ hadoop fs -mkdir mort09 $ hadoop fs -copyFromLocal VS09MORT.DUSMCPUB mort09 $ hadoop fs -ls mort09 Found 1 items -rw-r--r-- 3 jadler supergroup 1196197310 2012-08-02 16:31 /user/jadler/mort09/VS09MORT.DUSMCPUB

Page 17: Introduction of R on Hadoop

/home/spndc/src/Rhadoop/mort09.R (1/3)

$ # unzip the file$ unzip mort2009us.zip$ # create a directory on hdfs$ hadoop fs -mkdir mort09$ # copy to that directory on hdfs$ hadoop fs -copyFromLocal VS09MORT.DUSMCPUB mort09$ # look at the files$ hadoop fs -ls mort09Found 1 items-rw-r--r-- 3 jadler supergroup 1196197310 2012-08-02 16:31 /user/jadler/mort09/VS09MORT.DUSMCPUB

$ head -n 100 VS09MORT.DUSMCPUB > VS09MORT.DUSMCPUB.sample

read.fwf read.fwf

.X

mort.schema <- c( .X0=19, ResidentStatus=1, .X1=40, Education1989=2, Education2003=1, EducationFlag=1,MonthOfDeath=2,.X2=2,Sex=1,AgeDetail=4, AgeSubstitution=1, AgeRecode52=2,AgeRecode27=2,AgeRecode12=2,AgeRecodeInfant22=2, PlaceOfDeath=1,MaritalStatus=1,DayOfWeekofDeath=1,.X3=16, CurrentDataYear=4, InjuryAtWork=1, MannerOfDeath=1, MethodOfDisposition=1, Autopsy=1,.X4=34,ActivityCode=1,PlaceOfInjury=1,ICDCode=4, CauseRecode358=3,.X5=1,CauseRecode113=3,CauseRecode130=3, CauseRecode39=2,.X6=1,Conditions=281,.X8=1,Race=2,BridgeRaceFlag=1, RaceImputationFlag=1,RaceRecode3=1,RaceRecode5=1,.X9=33, HispanicOrigin=3,.X10=1,HispanicOriginRecode=1)

> # according to the documentation, each line is 488 characters long> sum(mort.schema)[1] 488

.X

unpack.line <- function(data, schema) { filter.func <- function(x) {substr(x,1,2) != ".X"} data.pointer <- 1 output.data <- list() for (i in 1:length(schema)) { if (filter.func(names(schema)[i])) { output.data[[names(schema)[i]]] <- type.convert( substr(data, data.pointer, data.pointer+schema[i] - 1), as.is=TRUE)

560 | Chapter 26: R and Hadoop

Page 18: Introduction of R on Hadoop

/home/spndc/src/Rhadoop/mort09.R (2/3)

Page 19: Introduction of R on Hadoop

/home/spndc/src/Rhadoop/mort09.R (3/3)

Page 20: Introduction of R on Hadoop

/home/spndc/src/Rhadoop/mort09_1.R (1/4)

Page 21: Introduction of R on Hadoop

/home/spndc/src/Rhadoop/mort09_1.R (2/4)

Page 22: Introduction of R on Hadoop

/home/spndc/src/Rhadoop/mort09_1.R (3/4)

Page 23: Introduction of R on Hadoop

/home/spndc/src/Rhadoop/mort09_1.R (4/4)

Page 24: Introduction of R on Hadoop

R + Hadoop Streaming

Page 25: Introduction of R on Hadoop

Hadoop Streaming

http://biomedicaloptics.spiedigitallibrary.org/data/Journals/BIOMEDO/23543/125003_1_2.png

Page 26: Introduction of R on Hadoop

mapreducefrom.dfs

from.dfs(input, format = "native", to.data.frame = FALSE, vectorized = FALSE, structured = FALSE)

Learning more

Hadoop Streaming

#! /usr/bin/env Rscript

mort.schema <- ...

unpack.line <- ...

age.decode <- ...

con <- file("stdin", open="r")while(length(line <- readLines(con, n=1)) > 0) { parsed <- unpack.line(line,mort.schema) write(paste(parsed[["CauseRecode39"]], age.decode(parsed[["AgeDetail"]]), sep="\t"), stdout())}close(con)

568 | Chapter 26: R and Hadoop

/home/spndc/src/Rhadoop/map.R

Page 27: Introduction of R on Hadoop

/home/spndc/src/Rhadoop/reduce.R #! /usr/bin/env Rscript

cause.decode <- ...

con <- file("stdin", open="r")

current.key <- NAcumulative.age <- 0count <- 0

print.results <- function(k, n, d) { write(paste(cause.decode(k),n/d,sep="\t"),stdout())}

while(length(line <- readLines(con, n=1)) > 0) {

parsed <- strsplit(line,"\t") key <- parsed[[1]][1] value <- type.convert(parsed[[1]][2], as.is=TRUE)

if (is.na(current.key)) { current.key <- key } else if (current.key != key) { print.results(current.key, cumulative.age, count) current.key <- key cumulative.age <- 0 count <- 0 }

if (!is.na(value)) { cumulative.age <- cumulative.age + value count <- count + 1 }}

close(con)print.results(current.key, cumulative.age, count)

reduce

$ chmod +x map.R$ chmod +x reduce.R

R and Hadoop | 569

R and Hadoop

Page 28: Introduction of R on Hadoop

/home/spndc/src/Rhadoop/streaming.sh

#!/bin/sh

/usr/java/hadoop-1.2.0/bin/hadoop jar /usr/java/hadoop-1.2.0/contrib/streaming/hadoop-streaming-1.2.0.jar -input mort09 -output averagebycondition -mapper map.R -reducer reduce.R -file map.R -file reduce.R

Page 29: Introduction of R on Hadoop

Output

[spndc@localhost hadoop-1.2.0]$ bin/hadoop fs -text averagebycondition/part-00000 Tuberculosis 60.5 Malignant neoplasms of cervix uteri, corpus uteri and ovary 68.0631578947368 Malignant neoplasm of prostate 78.0705882352941 Malignant neoplasms of urinary tract 72.5656565656566 Non-Hodgkin's lymphoma 69.56 Leukemia 72.8674698795181 Other malignant neoplasms 66.8361581920904 Diabetes mellitus 68.2723404255319 Alzheimer's disease 85.419795221843 Hypertensive heart disease with or without renal disease 68.0833333333333 Ischemic heart diseases 72.1750619322874 Other diseases of heart 74.925 Essential 70.468085106383 Cerebrovascular diseases 76.0950639853748 Atherosclerosis 80.12 … Malignant neoplasm of breast 67.3815789473684

Page 30: Introduction of R on Hadoop

RHipe

Page 31: Introduction of R on Hadoop

RHIPE

http://www.datadr.org/index.html

Page 32: Introduction of R on Hadoop

Installation

Page 33: Introduction of R on Hadoop

API

RHIPE v0.65.3

Page 34: Introduction of R on Hadoop

Example

RHIPE v0.65.3

Page 35: Introduction of R on Hadoop

Recommendation System

Page 36: Introduction of R on Hadoop

Live Demo

Page 37: Introduction of R on Hadoop

Conclusions

•  Rhadoop is a good way to scale out, but it might be not the best way.

•  Rhadoop is still under fast developing cycle, so you might be aware of the backward compatible issue.

•  So far, SPN has no plan to adopt Rhadoop for data analysis.

•  One of R fans suggests that using Pig with R will be better than using Rhadoop directly.

Page 38: Introduction of R on Hadoop

Reference

•  Rhadoop Wiki –  https://github.com/RevolutionAnalytics/RHadoop/wiki

•  Rhipe –  http://www.datadr.org/

•  Rhadoop實踐系列文章: –  http://blog.fens.me/series-rhadoop/

• 阿貝好威的實驗室 –  http://lab.howie.tw/2013/01/Big-Data-Analytic-Weka-vs-Mahout-vs-

R.html

•  R and Hadoop 整合初體驗 –  http://michaelhsu.tw/2013/05/01/r-and-hadoop-%E5%88%9D

%E9%AB%94%E9%A9%97/

Page 39: Introduction of R on Hadoop

Thank You

Page 40: Introduction of R on Hadoop

Backup

Page 41: Introduction of R on Hadoop
Page 42: Introduction of R on Hadoop