r belgium 20121116-awson-cloud-beamer
TRANSCRIPT
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
R on Amazon cloud
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer)
2012
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Outline
1 Getting started on Amazon cloud
2 Some concrete applications using Hadoop
3 About RBelgium
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Basics on AWS
Register for AWS EC2 and S3 account(http://aws.amazon.com/)
Account Number, Access Key ID, Secret Access Key, 509Certificate
S3, EC2, EMR, . . .
Not followed or some more info ?http://aws.amazon.com/documentation/gettingstarted/
http://www.bucketexplorer.com/documentation/
amazon-s3--what-is-my-aws-access-and-secret-key.html
http://www.yusufhm.info/content/
adding-x509-certificate-aws-iam-user-api-command-line-tools-0
. . .
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Basics on AWS
Register for AWS EC2 and S3 account(http://aws.amazon.com/)
Account Number, Access Key ID, Secret Access Key, 509Certificate
S3, EC2, EMR, . . .
Not followed or some more info ?http://aws.amazon.com/documentation/gettingstarted/
http://www.bucketexplorer.com/documentation/
amazon-s3--what-is-my-aws-access-and-secret-key.html
http://www.yusufhm.info/content/
adding-x509-certificate-aws-iam-user-api-command-line-tools-0
. . .
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Basics on AWS
Register for AWS EC2 and S3 account(http://aws.amazon.com/)
Account Number, Access Key ID, Secret Access Key, 509Certificate
S3, EC2, EMR, . . .
Not followed or some more info ?http://aws.amazon.com/documentation/gettingstarted/
http://www.bucketexplorer.com/documentation/
amazon-s3--what-is-my-aws-access-and-secret-key.html
http://www.yusufhm.info/content/
adding-x509-certificate-aws-iam-user-api-command-line-tools-0
. . .
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Basics on AWS
Register for AWS EC2 and S3 account(http://aws.amazon.com/)
Account Number, Access Key ID, Secret Access Key, 509Certificate
S3, EC2, EMR, . . .
Not followed or some more info ?http://aws.amazon.com/documentation/gettingstarted/
http://www.bucketexplorer.com/documentation/
amazon-s3--what-is-my-aws-access-and-secret-key.html
http://www.yusufhm.info/content/
adding-x509-certificate-aws-iam-user-api-command-line-tools-0
. . .
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Why AWS?
Simple to use Just start up an instance with an AMI
Elastic: Auto-scaling groups (RAM,CPU) + Load balancing(I/O) + Elastic IPs
On demand: anytime, what you want (limit to 20 EC2instances without demand), normal, spot, reserved andEBS-optimized (see http://aws.amazon.com/ec2/)
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Which AMI(s)? (1/2)
Bioconductor on Amazon cloud: http:
//bioconductor.org/help/bioconductor-cloud-ami/
MPI cluster on Amazon:
Example
1 l i b r a r y (Rmpi )mpi . spawn . R s l a v e s ( )
3 mpi . pa rLapp l y ( 1 : mpi . u n i v e r s e . s i z e ( ) , f u n c t i o n ( x) x+1)
mpi . c l o s e . R s l a v e s ( )5 mpi . q u i t ( )
Listing 1: ’Rmpi’ on EC2
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Which AMI(s)? (2/2)
Parallel cluster on Amazon:
Example
1 l i b r a r y ( p a r a l l e l )c l <− makePSOCKcluster ( c ( ’ 1 0 . 6 8 . 1 55 . 3 0 ’ , ’
1 0 . 6 8 . 1 55 . 4 5 ’ , ’ 1 0 . 6 8 . 1 55 . 6 5 ’ ) )3 c l u s t e r C a l l ( c l , e va l , myfunc ( arg1 , arg2 , . . . ) )
Listing 2: ’parallel’ on EC2
Hadoop cluster on Amazon with RHadoop:https://github.com/RevolutionAnalytics/RHadoop/tree/
master/rmr2/pkg/tools
Storm cluster on Amazon:https://github.com/nathanmarz/storm-deploy
SAP Hana (http://aws.amazon.com/sap/), Oracle R Enterprise(Hadoop for batch + NoSQL for real-time), etc.
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Using rmr2 in Hadoop framework (1/4)
Toy case
X β = y
solve(t(X)%*%X, t(X)%*%y)
=
Example
1 l i b r a r y ( rmr2 )X = to . d f s ( mat r i x ( rnorm (2000) , n co l = 10) )
3 y = as . mat r i x ( rnorm (200) )
Listing 3: initializing variables
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Using rmr2 in Hadoop framework (1/4)
Toy case
X β = ysolve(t(X)%*%X, t(X)%*%y)
=
Example
1 l i b r a r y ( rmr2 )X = to . d f s ( mat r i x ( rnorm (2000) , n co l = 10) )
3 y = as . mat r i x ( rnorm (200) )
Listing 4: initializing variables
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Using rmr2 in Hadoop framework (1/4)
Toy case
X β = ysolve(t(X)%*%X, t(X)%*%y)
=
Example
1 l i b r a r y ( rmr2 )X = to . d f s ( mat r i x ( rnorm (2000) , n co l = 10) )
3 y = as . mat r i x ( rnorm (200) )
Listing 5: initializing variables
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Using rmr2 in Hadoop framework (1/4)
Toy case
X β = ysolve(t(X)%*%X, t(X)%*%y)
=
Example
1 l i b r a r y ( rmr2 )X = to . d f s ( mat r i x ( rnorm (2000) , n co l = 10) )
3 y = as . mat r i x ( rnorm (200) )
Listing 6: initializing variables
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Using rmr2 in Hadoop framework (2/4)
Example
1 tXX =va l u e s (
3 from . d f s (mapreduce (
5 i n pu t = X,map = f u n c t i o n ( k , Xi ) k e y v a l (1 , l i s t ( t ( Xi )%∗%Xi ) ,
7 % reduce = reduce rFunc t i on ,combine = TRUE) ) ) [ [ 1 ] ]
Listing 7: ’rmr2’ matrix multiplication
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Using rmr2 in Hadoop framework (3/4)
Example
tXy =2 v a l u e s (
from . d f s (4 mapreduce (
i n pu t = X,6 map = f u n c t i o n ( k , Xi )
k e y v a l (1 , l i s t ( t ( Xi ) %∗% y ) ) ,8 combine = TRUE) ) ) [ [ 1 ] ]
s o l v e ( tXX , tXy )
Listing 8: ’rmr2’ solving
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
How to debug (4/4)
Debugging
rmr.str(varName)
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
R on EMR with segue package
Example
1 l i b r a r y ( segue )s e t C r e d e n t i a l s ( ” acces sKey ” , ” s e c r e tAcc e s sKey ” )
3 myClus te r <− c r e a t e C l u s t e r ( numIns tances=1,mas t e r I n s t anceType=”m1. sma l l ” ,
s l a v e I n s t a n c eTyp e=”m1. sma l l ” , l o c a t i o n=”us−eas t−1a” )
5 R e s u l t L i s t<−emr l app l y ( myc lu s t e r , d a t aL i s t , myfunc )s t o pC l u s t e r ( )
Listing 9: R on EMR with ’segue’
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
R on EMR using the API command (1/3)
Upload the numberList file (integers from 1 to 100 with oneinteger per line) and the following R scripts: ”mapper.r” and”reducer.r” to your AWS S3
Run the command line in your bash:
Example
. / e l a s t i c −mapreduce −−c r e a t e −−s t ream −−i n pu t s3 : //you rbucke t / numberL i s t . t x t −−mapper s3 : //you rbucke t /mapper . r −−r e du c e r s3 : // you rbucke t /r e du c e r . r −−output s3 : // emrout r1vv / my r e s u l t s −−name EMRexampleR1 −−num− i n s t a n c e s 1
Listing 10: Running R on EMR
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
R on EMR using the API command (2/3)
Example
1 #! / u s r / b in / env R s c r i p tt r imWhiteSpace <− f u n c t i o n ( l i n e ) gsub ( ” (ˆ +) | ( +$
) ” , ”” , l i n e )3 con <− f i l e ( ” s t d i n ” , open = ” r ” )
wh i l e ( l e n g t h ( l i n e <− r e a dL i n e s ( con , n = 1 , warn= FALSE) ) > 0) {
5 l i n e <− t r imWhiteSpace ( l i n e )ca t ( as . numer ic ( l i n e ) , ”\ t ” , ”\n” , sep=”” )
7 }
Listing 11: Running simple R scripts on EMR - mapper script
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
R on EMR using the API command (2/3)
Example
1 #! / u s r / b in / env R s c r i p tt r imWhiteSpace <− f u n c t i o n ( l i n e ) gsub ( ” (ˆ +) | ( +$
) ” , ”” , l i n e )3 con <− f i l e ( ” s t d i n ” , open = ” r ” )
x <− c ( )5 wh i l e ( l e n g t h ( l i n e <− r e a dL i n e s ( con , n = 1 , warn
= FALSE) ) > 0) {x <− c ( x , as . numer ic ( t r imWhiteSpace ( l i n e ) ) )
7 }ca t (mean ( x ) )
Listing 12: Running simple R scripts on EMR - reducer script
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
How to debug (4/4)
Debugging
Debug first your R code in local with the command line:
ca t i n pu t . t x t | R CMD BATCH −−s l a v e −−no−t im ingmapper . r out . t x t ;
2 ca t out . t x t | R CMD BATCH −−s l a v e −−no−t im ingr e du c e r . r 2>&1
Listing 13: Debugging R code before using EMR
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Tips with EMR
Be careful between s3 and s3n, either you use one or the otherbut not both. For more information about the differencesbetween s3 and s3n, seehttp://stackoverflow.com/questions/10569455/difference-between-amazon-s3-and-s3n-in-hadoop (accessed on Nov 62012).The first line of the file must be well written to call the rightlanguage (such as #! /usr/bin/env Rscript" for R or#!/usr/bin/python for python). If this file is called byanother one then this is not necessary (ex: an R script calls anR function from another file, the R function file does not needto start with #! /usr/bin/env Rscript).the output directory may NOT exist before launching yourEMR job, otherwise the job will always FAIL. Uses3://yourProjects/project1 instead of s3://project1.
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Projects in RBelgium
http://www.heritagehealthprize.com/c/hhp
Text Mining using real “text” data extracted from thedatabase systems of a project-partner
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Projects in RBelgium
http://www.heritagehealthprize.com/c/hhp
Text Mining using real “text” data extracted from thedatabase systems of a project-partner
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
RBelgium members (1/3)
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
RBelgium members (2/3)
Example
mygroup <− ”RBelgium”2 # l i b r a r i e s f o r communicat ing wi th meetup API
l i b r a r y (RJSONIO , Rcu r l )4 # l i b r a r y f o r p l o t t i n g
l i b r a r y ( ggp l o t 2 )6 # get member data from meetup . com
domain . u r l<−pa s t e ( ” h t t p s : // ap i . meetup . com/2/members? key=” ,mykey , ”&s i g n=t r u e&group ur lname=RBelgium” , c o l l a p s e=”” , sep=”” )
8 domain . ge t<−getURL ( domain . u r l )domain . data<−fromJSON( domain . ge t )
10 # d i s p l a y i n g namesp r i n t ( u n l i s t ( l a p p l y ( domain . data $ r e s u l t s , f u n c t i o n (
x ) x$name) ) )
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
RBelgium members (3/3)
Example
1 # p l o t t i n g graphj o i n s <− u n l i s t ( l a p p l y ( domain . data $ r e s u l t s ,
f u n c t i o n ( x ) x$ j o i n e d ) )3 o r d e r e d J o i n s <− j o i n s [ o r d e r ( j o i n s ) ]
l a b = as . POSIXct ( o r d e r e d J o i n s / 1000 , o r i g i n=”1970−01−01” )
5 d f <− data . f rame (x=lab ,
7 y=1: l e n g t h ( domain . data $ r e s u l t s ))
9 png ( ”memberJoined . png” )ggp l o t ( d f ) +
11 geom po i n t ( aes ( x = x , y = y ) ) +x l ab ( ”Date” ) +
13 y l a b ( ”#members” )dev . o f f ( )
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
RBelgium on internet
Website: http://www.meetup.com/RBelgium/ (68members)
Website: http://www.rbelgium.be
Twitter: twitter.com/rbelgium (5 followers)
LinkedIn: http://www.linkedin.com/groups/
RBelgium-4223869?gid=4223869&trk=hb_side_g (7members)
Google group:http://groups.google.com/group/rbelgium,[email protected]
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud
Getting started on Amazon cloudSome concrete applications using Hadoop
About RBelgium
Questions?
Jean-Baptiste Poullet (RBelgium Founder and Co-Organizer) R on Amazon cloud