r belgium 20121116-awson-cloud-beamer

29
Getting started on Amazon cloud Some concrete applications using Hadoop About RBelgium R on Amazon cloud Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) 2012 Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Upload: jean-baptiste-poullet

Post on 07-Aug-2015

162 views

Category:

Sports


0 download

TRANSCRIPT

Page 1: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

R on Amazon cloud

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer)

2012

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 2: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Outline

1 Getting started on Amazon cloud

2 Some concrete applications using Hadoop

3 About RBelgium

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 3: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Basics on AWS

Register for AWS EC2 and S3 account(http://aws.amazon.com/)

Account Number, Access Key ID, Secret Access Key, 509Certificate

S3, EC2, EMR, . . .

Not followed or some more info ?http://aws.amazon.com/documentation/gettingstarted/

http://www.bucketexplorer.com/documentation/

amazon-s3--what-is-my-aws-access-and-secret-key.html

http://www.yusufhm.info/content/

adding-x509-certificate-aws-iam-user-api-command-line-tools-0

. . .

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 4: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Basics on AWS

Register for AWS EC2 and S3 account(http://aws.amazon.com/)

Account Number, Access Key ID, Secret Access Key, 509Certificate

S3, EC2, EMR, . . .

Not followed or some more info ?http://aws.amazon.com/documentation/gettingstarted/

http://www.bucketexplorer.com/documentation/

amazon-s3--what-is-my-aws-access-and-secret-key.html

http://www.yusufhm.info/content/

adding-x509-certificate-aws-iam-user-api-command-line-tools-0

. . .

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 5: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Basics on AWS

Register for AWS EC2 and S3 account(http://aws.amazon.com/)

Account Number, Access Key ID, Secret Access Key, 509Certificate

S3, EC2, EMR, . . .

Not followed or some more info ?http://aws.amazon.com/documentation/gettingstarted/

http://www.bucketexplorer.com/documentation/

amazon-s3--what-is-my-aws-access-and-secret-key.html

http://www.yusufhm.info/content/

adding-x509-certificate-aws-iam-user-api-command-line-tools-0

. . .

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 6: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Basics on AWS

Register for AWS EC2 and S3 account(http://aws.amazon.com/)

Account Number, Access Key ID, Secret Access Key, 509Certificate

S3, EC2, EMR, . . .

Not followed or some more info ?http://aws.amazon.com/documentation/gettingstarted/

http://www.bucketexplorer.com/documentation/

amazon-s3--what-is-my-aws-access-and-secret-key.html

http://www.yusufhm.info/content/

adding-x509-certificate-aws-iam-user-api-command-line-tools-0

. . .

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 7: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Why AWS?

Simple to use Just start up an instance with an AMI

Elastic: Auto-scaling groups (RAM,CPU) + Load balancing(I/O) + Elastic IPs

On demand: anytime, what you want (limit to 20 EC2instances without demand), normal, spot, reserved andEBS-optimized (see http://aws.amazon.com/ec2/)

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 8: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Which AMI(s)? (1/2)

Bioconductor on Amazon cloud: http:

//bioconductor.org/help/bioconductor-cloud-ami/

MPI cluster on Amazon:

Example

1 l i b r a r y (Rmpi )mpi . spawn . R s l a v e s ( )

3 mpi . pa rLapp l y ( 1 : mpi . u n i v e r s e . s i z e ( ) , f u n c t i o n ( x) x+1)

mpi . c l o s e . R s l a v e s ( )5 mpi . q u i t ( )

Listing 1: ’Rmpi’ on EC2

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 9: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Which AMI(s)? (2/2)

Parallel cluster on Amazon:

Example

1 l i b r a r y ( p a r a l l e l )c l <− makePSOCKcluster ( c ( ’ 1 0 . 6 8 . 1 55 . 3 0 ’ , ’

1 0 . 6 8 . 1 55 . 4 5 ’ , ’ 1 0 . 6 8 . 1 55 . 6 5 ’ ) )3 c l u s t e r C a l l ( c l , e va l , myfunc ( arg1 , arg2 , . . . ) )

Listing 2: ’parallel’ on EC2

Hadoop cluster on Amazon with RHadoop:https://github.com/RevolutionAnalytics/RHadoop/tree/

master/rmr2/pkg/tools

Storm cluster on Amazon:https://github.com/nathanmarz/storm-deploy

SAP Hana (http://aws.amazon.com/sap/), Oracle R Enterprise(Hadoop for batch + NoSQL for real-time), etc.

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 10: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Using rmr2 in Hadoop framework (1/4)

Toy case

X β = y

solve(t(X)%*%X, t(X)%*%y)

=

Example

1 l i b r a r y ( rmr2 )X = to . d f s ( mat r i x ( rnorm (2000) , n co l = 10) )

3 y = as . mat r i x ( rnorm (200) )

Listing 3: initializing variables

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 11: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Using rmr2 in Hadoop framework (1/4)

Toy case

X β = ysolve(t(X)%*%X, t(X)%*%y)

=

Example

1 l i b r a r y ( rmr2 )X = to . d f s ( mat r i x ( rnorm (2000) , n co l = 10) )

3 y = as . mat r i x ( rnorm (200) )

Listing 4: initializing variables

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 12: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Using rmr2 in Hadoop framework (1/4)

Toy case

X β = ysolve(t(X)%*%X, t(X)%*%y)

=

Example

1 l i b r a r y ( rmr2 )X = to . d f s ( mat r i x ( rnorm (2000) , n co l = 10) )

3 y = as . mat r i x ( rnorm (200) )

Listing 5: initializing variables

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 13: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Using rmr2 in Hadoop framework (1/4)

Toy case

X β = ysolve(t(X)%*%X, t(X)%*%y)

=

Example

1 l i b r a r y ( rmr2 )X = to . d f s ( mat r i x ( rnorm (2000) , n co l = 10) )

3 y = as . mat r i x ( rnorm (200) )

Listing 6: initializing variables

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 14: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Using rmr2 in Hadoop framework (2/4)

Example

1 tXX =va l u e s (

3 from . d f s (mapreduce (

5 i n pu t = X,map = f u n c t i o n ( k , Xi ) k e y v a l (1 , l i s t ( t ( Xi )%∗%Xi ) ,

7 % reduce = reduce rFunc t i on ,combine = TRUE) ) ) [ [ 1 ] ]

Listing 7: ’rmr2’ matrix multiplication

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 15: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Using rmr2 in Hadoop framework (3/4)

Example

tXy =2 v a l u e s (

from . d f s (4 mapreduce (

i n pu t = X,6 map = f u n c t i o n ( k , Xi )

k e y v a l (1 , l i s t ( t ( Xi ) %∗% y ) ) ,8 combine = TRUE) ) ) [ [ 1 ] ]

s o l v e ( tXX , tXy )

Listing 8: ’rmr2’ solving

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 16: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

How to debug (4/4)

Debugging

rmr.str(varName)

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 17: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

R on EMR with segue package

Example

1 l i b r a r y ( segue )s e t C r e d e n t i a l s ( ” acces sKey ” , ” s e c r e tAcc e s sKey ” )

3 myClus te r <− c r e a t e C l u s t e r ( numIns tances=1,mas t e r I n s t anceType=”m1. sma l l ” ,

s l a v e I n s t a n c eTyp e=”m1. sma l l ” , l o c a t i o n=”us−eas t−1a” )

5 R e s u l t L i s t<−emr l app l y ( myc lu s t e r , d a t aL i s t , myfunc )s t o pC l u s t e r ( )

Listing 9: R on EMR with ’segue’

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 18: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

R on EMR using the API command (1/3)

Upload the numberList file (integers from 1 to 100 with oneinteger per line) and the following R scripts: ”mapper.r” and”reducer.r” to your AWS S3

Run the command line in your bash:

Example

. / e l a s t i c −mapreduce −−c r e a t e −−s t ream −−i n pu t s3 : //you rbucke t / numberL i s t . t x t −−mapper s3 : //you rbucke t /mapper . r −−r e du c e r s3 : // you rbucke t /r e du c e r . r −−output s3 : // emrout r1vv / my r e s u l t s −−name EMRexampleR1 −−num− i n s t a n c e s 1

Listing 10: Running R on EMR

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 19: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

R on EMR using the API command (2/3)

Example

1 #! / u s r / b in / env R s c r i p tt r imWhiteSpace <− f u n c t i o n ( l i n e ) gsub ( ” (ˆ +) | ( +$

) ” , ”” , l i n e )3 con <− f i l e ( ” s t d i n ” , open = ” r ” )

wh i l e ( l e n g t h ( l i n e <− r e a dL i n e s ( con , n = 1 , warn= FALSE) ) > 0) {

5 l i n e <− t r imWhiteSpace ( l i n e )ca t ( as . numer ic ( l i n e ) , ”\ t ” , ”\n” , sep=”” )

7 }

Listing 11: Running simple R scripts on EMR - mapper script

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 20: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

R on EMR using the API command (2/3)

Example

1 #! / u s r / b in / env R s c r i p tt r imWhiteSpace <− f u n c t i o n ( l i n e ) gsub ( ” (ˆ +) | ( +$

) ” , ”” , l i n e )3 con <− f i l e ( ” s t d i n ” , open = ” r ” )

x <− c ( )5 wh i l e ( l e n g t h ( l i n e <− r e a dL i n e s ( con , n = 1 , warn

= FALSE) ) > 0) {x <− c ( x , as . numer ic ( t r imWhiteSpace ( l i n e ) ) )

7 }ca t (mean ( x ) )

Listing 12: Running simple R scripts on EMR - reducer script

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 21: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

How to debug (4/4)

Debugging

Debug first your R code in local with the command line:

ca t i n pu t . t x t | R CMD BATCH −−s l a v e −−no−t im ingmapper . r out . t x t ;

2 ca t out . t x t | R CMD BATCH −−s l a v e −−no−t im ingr e du c e r . r 2>&1

Listing 13: Debugging R code before using EMR

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 22: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Tips with EMR

Be careful between s3 and s3n, either you use one or the otherbut not both. For more information about the differencesbetween s3 and s3n, seehttp://stackoverflow.com/questions/10569455/difference-between-amazon-s3-and-s3n-in-hadoop (accessed on Nov 62012).The first line of the file must be well written to call the rightlanguage (such as #! /usr/bin/env Rscript" for R or#!/usr/bin/python for python). If this file is called byanother one then this is not necessary (ex: an R script calls anR function from another file, the R function file does not needto start with #! /usr/bin/env Rscript).the output directory may NOT exist before launching yourEMR job, otherwise the job will always FAIL. Uses3://yourProjects/project1 instead of s3://project1.

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 23: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Projects in RBelgium

http://www.heritagehealthprize.com/c/hhp

Text Mining using real “text” data extracted from thedatabase systems of a project-partner

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 24: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Projects in RBelgium

http://www.heritagehealthprize.com/c/hhp

Text Mining using real “text” data extracted from thedatabase systems of a project-partner

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 25: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

RBelgium members (1/3)

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 26: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

RBelgium members (2/3)

Example

mygroup <− ”RBelgium”2 # l i b r a r i e s f o r communicat ing wi th meetup API

l i b r a r y (RJSONIO , Rcu r l )4 # l i b r a r y f o r p l o t t i n g

l i b r a r y ( ggp l o t 2 )6 # get member data from meetup . com

domain . u r l<−pa s t e ( ” h t t p s : // ap i . meetup . com/2/members? key=” ,mykey , ”&s i g n=t r u e&group ur lname=RBelgium” , c o l l a p s e=”” , sep=”” )

8 domain . ge t<−getURL ( domain . u r l )domain . data<−fromJSON( domain . ge t )

10 # d i s p l a y i n g namesp r i n t ( u n l i s t ( l a p p l y ( domain . data $ r e s u l t s , f u n c t i o n (

x ) x$name) ) )

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 27: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

RBelgium members (3/3)

Example

1 # p l o t t i n g graphj o i n s <− u n l i s t ( l a p p l y ( domain . data $ r e s u l t s ,

f u n c t i o n ( x ) x$ j o i n e d ) )3 o r d e r e d J o i n s <− j o i n s [ o r d e r ( j o i n s ) ]

l a b = as . POSIXct ( o r d e r e d J o i n s / 1000 , o r i g i n=”1970−01−01” )

5 d f <− data . f rame (x=lab ,

7 y=1: l e n g t h ( domain . data $ r e s u l t s ))

9 png ( ”memberJoined . png” )ggp l o t ( d f ) +

11 geom po i n t ( aes ( x = x , y = y ) ) +x l ab ( ”Date” ) +

13 y l a b ( ”#members” )dev . o f f ( )

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 28: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

RBelgium on internet

Website: http://www.meetup.com/RBelgium/ (68members)

Website: http://www.rbelgium.be

Twitter: twitter.com/rbelgium (5 followers)

LinkedIn: http://www.linkedin.com/groups/

RBelgium-4223869?gid=4223869&trk=hb_side_g (7members)

Google group:http://groups.google.com/group/rbelgium,[email protected]

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud

Page 29: R belgium 20121116-awson-cloud-beamer

Getting started on Amazon cloudSome concrete applications using Hadoop

About RBelgium

Questions?

Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud