revolution r enterprise 6.1
TRANSCRIPT
-
7/28/2019 Revolution r Enterprise 6.1
1/27
Revolution Confidential
Revolution Confidential
New A dvanc es in HighP erformance Analytics
with R : 'B ig Data'
Dec is ion Trees andA nalys is of Hadoop
Data
P res ented by:
S ue R anneyVP Produc t Development
-
7/28/2019 Revolution r Enterprise 6.1
2/27
Revolution ConfidentialIn todays webcas t:
High Performance Analytics (HPA) withRevolution R Enterprise
Big Data Decision Trees
Revolutions HPA with Hadoop Data
Resources, Q&A
2
-
7/28/2019 Revolution r Enterprise 6.1
3/27
Revolution ConfidentialR evolution R E nterpris e: What
G ets Ins talled?
3
Latest stable version of Open-Source R High performance math libraries RevoScaleR package that adds:
High performance big data capabilities to R Access to a variety of data sources (e.g., SAS, SPSS,text files, ODBC)
Ability to compute in a variety of compute contexts(e.g., Windows/Linux workstation/server, MicrosoftHPC Server cluster, Azure Burst, IBM Platform LSF
cluster) High performance computing capabilities
Integrated Development Environment based on VisualStudio technology (for Windows): the R ProductivityEnvironment (RPE)
Revolution R Enterprise 5.0 Webinar
-
7/28/2019 Revolution r Enterprise 6.1
4/27
Revolution Confidential
High P erformance A nalytics (HPA ) in
RevoScaleR
High Performance Computing + Data
Full-featured, fast, and scalable analysisfunctions
Same code works on small and big data, and avariety of data sources
Same code works on a variety of computecontexts - a laptop, server, cluster, or the cloud
Scales approximately linearly with the numberof observations without increasing memoryrequirements
Revolution R Enterprise 4
-
7/28/2019 Revolution r Enterprise 6.1
5/27
Revolution ConfidentialR evoS c aleR : HPA Algorithms
Descriptive statistics (rxSummary)
Tables and cubes (rxCube, rxCrossTabs) Correlations/covariances (rxCovCor, rxCor,
rxCov, rxSSCP)
K means clustering (rxKmeans) Linear regressions (rxLinMod) Logistic regressions (rxLogit)
Generalized Linear Models (rxGlm) Predictions (scoring) (rxPredict) Decision Trees (rxDTree) NEW!
Revolution R Enterprise 5
-
7/28/2019 Revolution r Enterprise 6.1
6/27
Revolution ConfidentialDecis ion Trees
Relatively easy-to-interpret models
Widely used in a variety of disciplines. For example,
Predicting which patient characteristics are associated withhigh risk of, for example, heart attack.
Deciding whether or not to offer a loan to an individualbased on individual characteristics.
Predicting the rate of return of various investmentstrategies
Retail target marketing
Can handle multi-factor response easily
Useful in identifying important interactions
Revolution R Enterprise 6
-
7/28/2019 Revolution r Enterprise 6.1
7/27
Revolution ConfidentialDecis ion Tree Types
Classification tree: predict what class orgroup an observation belongs in(dependent variable is a factor) for eachterminal node or leaf
Regression tree: predict average value ofdependent variable for each terminal nodeor leaf
Revolution R Enterprise 7
-
7/28/2019 Revolution r Enterprise 6.1
8/27
Revolution ConfidentialS imple E xample: Marketing R es pons e
Data set containing the following information: Response: Was response to a phone call, email, or
mailing?
Age
Income Marital status
Attended college?
Revolution R Enterprise 8
-
7/28/2019 Revolution r Enterprise 6.1
9/27
Revolution ConfidentialS imple E xample: S pecifying the model
t r eeOut
-
7/28/2019 Revolution r Enterprise 6.1
10/27
Revolution ConfidentialS imple E xample: B as ic O utput
Information on the split, the number of observations inthe node, the number that match the y value, and the yprobabilities
1) r oot 10000 4069 Emai l ( 0. 33260000 0. 59310000 0. 07430000)
2) col l ege=No Col l ege 5074 2378 Phone ( 0. 53133622 0. 38943634 0. 07922743)4) age>=39. 5 2518 330 Phone ( 0. 86894361 0. 00000000 0. 13105639)
8) age< 64. 5 2256 77 Phone ( 0. 96586879 0. 00000000 0. 03413121) *
9) age>=64. 5 262 9 Mai l ( 0. 03435115 0. 00000000 0. 96564885) *
5) age< 39. 5 2556 580 Emai l ( 0. 19874804 0. 77308294 0. 02816901)
10) mar i t al =Si ngl e 835 371 Phone ( 0. 55568862 0. 40958084 0. 03473054)20) i ncome>=29. 5 472 14 Phone( 0. 97033898 0. 00000000 0. 02966102) *
21) i ncome< 29. 5 363 21 Emai l ( 0. 01652893 0. 94214876 0. 04132231) *
11) mar i t al =Mar r i ed 1721 87 Emai l ( 0. 02556653 0. 9494480 . 02498547) *
3) col l ege=Col l ege 4926 971 Emai l ( 0. 12789281 0. 80288266 0. 06922452)
Revolution R Enterprise 10
-
7/28/2019 Revolution r Enterprise 6.1
11/27
Revolution ConfidentialSimple Example: Visual Representation
Root
NoCollege
Age >=40
Age =65:Mail
Age =30:Phone
Income =65: Mail
Revolution R Enterprise 11
-
7/28/2019 Revolution r Enterprise 6.1
12/27
Revolution ConfidentialS caling HPA with R evoS c aleR
RevoScaleR functions can read from data sets on disk inchunks, so you can increase the number of observations inthe data set beyond what can be analyzed in memory all atonce
RevoScaleR analysis functions process chunks of data in
parallel, taking greater advantage of your computingresources (Parallel External Memory Algorithms)
Multiple cores on a desktop/server
Cluster/grids have added advantage of more hard drivesfor storing & accessing data
Windows HPC Server Cluster
Burst computations to Azure in the cloud
IBM Platform LSF Grid
Revolution R Enterprise 12
-
7/28/2019 Revolution r Enterprise 6.1
13/27
Revolution ConfidentialT he B ig Data Dec is ion Tree A lgorithm
Classical algorithms for building a decision treesort all continuous variables in order to decidewhere to split the data.
This sorting step becomes time and memoryprohibitive when dealing with large data.
rxDTree bins the data rather than sorting,computing histograms to create empirical
distribution functions of the data
rxDTree partitions the data horizontally, processingin parallel different sets of observations
Revolution R Enterprise 13
-
7/28/2019 Revolution r Enterprise 6.1
14/27
Revolution Confidential
Us eful rxDTree A rguments for B ig Data
cp: complexity parameter. Increasing cp willdecrease the number of splits attempted
maxDept h: the maximum depth of any tree
node. The computations take much longer atgreater depth, so lowering maxDept h cangreatly speed up computation time.
maxNumBi ns: the maximum number of bins
to use to cut numeric data. DecreasingmaxNumBi ns will speed up computationtime.
Revolution R Enterprise 14
-
7/28/2019 Revolution r Enterprise 6.1
15/27
Revolution ConfidentialB ig Data E xample
15
CDC Report in J an. 2012
-
7/28/2019 Revolution r Enterprise 6.1
16/27
Revolution ConfidentialT he U.S . B irth Data: 1985 - 2009
Public-use data sets containing information onall births in the United States for each year from1985 to 2009 are available to download:http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
These natality files are gigantic; theyreapproximately 3.1 GB uncompressed. Thats alittle larger than R can easily process J osephAdler, R in a Nutshell
Ive imported key variables from each year intoa single .xdf file with over 100 millionobservations.
16
http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htmhttp://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm -
7/28/2019 Revolution r Enterprise 6.1
17/27
Revolution Confidential
R egres s ion Tree: Multiple B irths
Cal l :r xDTr ee( f or mul a = I sMul t i pl e ~ DadAgeR8
+ MAGER + FRACEREC + FHI SP_REC +MRACEREC + MHI SP_REC + DOB_YY,
dat a = bi r t hAl l C,maxDept h = 6, cp = 1e- 05,bl ocksPer Read = 10, ver bose = 1)
Fi l e:
C: \ Revol ut i on\ Dat a\ CDC\ Bi r t hUS. xdfNumber of val i d obser vat i ons: 100672041
Number of mi ssi ng obser vat i ons: 0
Revolution R Enterprise 17
-
7/28/2019 Revolution r Enterprise 6.1
18/27
Revolution Confidential
L eaves with L owes t P ercent of Multiple
Births
18
Mom is not black and under theage of 20
1.3%
Mom is Asian or Pacific Islander
(and notHispanic) and is between22 and 28 years of age. The birthis before 1997
1.6%
Mom is black and under the ageof 18
1.7%
-
7/28/2019 Revolution r Enterprise 6.1
19/27
Revolution Confidential
L eaves with Highes t P ercent of Multiple
Births
19
Mom is over 47 years old andthe birth is after 1996
38.6%
Mom is white, non-Hispanic, is
between 45 and 47 years old,and the birth is after 1996
28.1%
Mom is Hispanic, is between
45 and 47 years old, and thebirth is after 1996
15.5%
-
7/28/2019 Revolution r Enterprise 6.1
20/27
Revolution Confidential
P oll Ques tion
Are you using Hadoop?
-
7/28/2019 Revolution r Enterprise 6.1
21/27
Revolution ConfidentialR evoS caleR with Hadoop Data F iles NE W
The Hadoop Distributed File System (HDFS)
is highly fault-tolerant and
is designed to be deployed on low-costhardware.
RevoScaleR supports accessing data in the
HDFS file system for import or for directanalysis
21
-
7/28/2019 Revolution r Enterprise 6.1
22/27
Revolution ConfidentialR evoS caleR Data S ources
Data Sources can be used for import or directly foranalysis
External: delimited text, fixed format text, SAS, SPSS,ODBC connections
Provided with RevoScaleR: efficient .xdf file format
Data Sources contain information about their filesystem
Delimited text and .xdf data sources can both be usedwith the HDFS file system
Data sources are used as input to HPA functions
22
-
7/28/2019 Revolution r Enterprise 6.1
23/27
Revolution ConfidentialA n E xample Us ing Hadoop Data
Hadoop cluster in our office Five nodes of commodity hardware
Red Hat Enterprise Linux (RHEL) operating system
Clouderas Hadoop (CDH3)
Also has IBM Platform LSF workload managementsystem installed (not required to use HDFS data)
My colleague, Dawn Kinsey, recorded a data
analysis session 22 comma delimited files stored in HDFS
Contain information on U.S. flight arrivals, 1997 2008
Revolution R Enterprise 23
-
7/28/2019 Revolution r Enterprise 6.1
24/27
Revolution ConfidentialS teps in A nalys is
Set up a file system object and a data sourceobject
Explore the HDFS airline data for the year 2000directly
Extract variables of interest from all the files into an.xdf file in the native file system
Use Rs great plotting capabilities on summary
information Perform a big logistic regression on an .xdf file
stored in HDFS
Revolution R Enterprise 24
-
7/28/2019 Revolution r Enterprise 6.1
25/27
Revolution Confidential
P oll Ques tion
What features of Revolution R
Enterprise 6.1 are most interestingto you?
-
7/28/2019 Revolution r Enterprise 6.1
26/27
Revolution ConfidentialT hank You!
Download slides, replay from todays webinar http://bit.ly/QJ fR4A
Learn more about Revolution R Enterprise Overview: revolutionanalytics.com/products New feature videos:
http://www.revolutionanalytics.com/products/new-features.php
Contact Revolution Analytics http://bit.ly/hey-revo
26
November 29: Real-Time Big Data Analyt ics: from Deploymentto Production
David Smith, VP Marketing and Community, Revolution Analytics
www.revolutionanalytics.com/news-events/free-webinars
http://bit.ly/QJfR4Ahttp://www.revolutionanalytics.com/productshttp://www.revolutionanalytics.com/products/new-features.phphttp://bit.ly/hey-revohttp://bit.ly/hey-revohttp://www.revolutionanalytics.com/products/new-features.phphttp://www.revolutionanalytics.com/productshttp://bit.ly/QJfR4A -
7/28/2019 Revolution r Enterprise 6.1
27/27
Revolution Confidential
27
The leading commercial provider of software and support for the
popular open source R statistics language.
www.revolutionanalytics.com+1 (650) 646 9545
Twitter: @RevolutionR
http://www.revolutionanalytics.com/http://twitter.com/RevolutionRhttp://twitter.com/RevolutionRhttp://www.revolutionanalytics.com/