revolution r enterprise 6.1

Upload: erotemethinks8580

Post on 03-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Revolution r Enterprise 6.1

    1/27

    Revolution Confidential

    Revolution Confidential

    New A dvanc es in HighP erformance Analytics

    with R : 'B ig Data'

    Dec is ion Trees andA nalys is of Hadoop

    Data

    P res ented by:

    S ue R anneyVP Produc t Development

  • 7/28/2019 Revolution r Enterprise 6.1

    2/27

    Revolution ConfidentialIn todays webcas t:

    High Performance Analytics (HPA) withRevolution R Enterprise

    Big Data Decision Trees

    Revolutions HPA with Hadoop Data

    Resources, Q&A

    2

  • 7/28/2019 Revolution r Enterprise 6.1

    3/27

    Revolution ConfidentialR evolution R E nterpris e: What

    G ets Ins talled?

    3

    Latest stable version of Open-Source R High performance math libraries RevoScaleR package that adds:

    High performance big data capabilities to R Access to a variety of data sources (e.g., SAS, SPSS,text files, ODBC)

    Ability to compute in a variety of compute contexts(e.g., Windows/Linux workstation/server, MicrosoftHPC Server cluster, Azure Burst, IBM Platform LSF

    cluster) High performance computing capabilities

    Integrated Development Environment based on VisualStudio technology (for Windows): the R ProductivityEnvironment (RPE)

    Revolution R Enterprise 5.0 Webinar

  • 7/28/2019 Revolution r Enterprise 6.1

    4/27

    Revolution Confidential

    High P erformance A nalytics (HPA ) in

    RevoScaleR

    High Performance Computing + Data

    Full-featured, fast, and scalable analysisfunctions

    Same code works on small and big data, and avariety of data sources

    Same code works on a variety of computecontexts - a laptop, server, cluster, or the cloud

    Scales approximately linearly with the numberof observations without increasing memoryrequirements

    Revolution R Enterprise 4

  • 7/28/2019 Revolution r Enterprise 6.1

    5/27

    Revolution ConfidentialR evoS c aleR : HPA Algorithms

    Descriptive statistics (rxSummary)

    Tables and cubes (rxCube, rxCrossTabs) Correlations/covariances (rxCovCor, rxCor,

    rxCov, rxSSCP)

    K means clustering (rxKmeans) Linear regressions (rxLinMod) Logistic regressions (rxLogit)

    Generalized Linear Models (rxGlm) Predictions (scoring) (rxPredict) Decision Trees (rxDTree) NEW!

    Revolution R Enterprise 5

  • 7/28/2019 Revolution r Enterprise 6.1

    6/27

    Revolution ConfidentialDecis ion Trees

    Relatively easy-to-interpret models

    Widely used in a variety of disciplines. For example,

    Predicting which patient characteristics are associated withhigh risk of, for example, heart attack.

    Deciding whether or not to offer a loan to an individualbased on individual characteristics.

    Predicting the rate of return of various investmentstrategies

    Retail target marketing

    Can handle multi-factor response easily

    Useful in identifying important interactions

    Revolution R Enterprise 6

  • 7/28/2019 Revolution r Enterprise 6.1

    7/27

    Revolution ConfidentialDecis ion Tree Types

    Classification tree: predict what class orgroup an observation belongs in(dependent variable is a factor) for eachterminal node or leaf

    Regression tree: predict average value ofdependent variable for each terminal nodeor leaf

    Revolution R Enterprise 7

  • 7/28/2019 Revolution r Enterprise 6.1

    8/27

    Revolution ConfidentialS imple E xample: Marketing R es pons e

    Data set containing the following information: Response: Was response to a phone call, email, or

    mailing?

    Age

    Income Marital status

    Attended college?

    Revolution R Enterprise 8

  • 7/28/2019 Revolution r Enterprise 6.1

    9/27

    Revolution ConfidentialS imple E xample: S pecifying the model

    t r eeOut

  • 7/28/2019 Revolution r Enterprise 6.1

    10/27

    Revolution ConfidentialS imple E xample: B as ic O utput

    Information on the split, the number of observations inthe node, the number that match the y value, and the yprobabilities

    1) r oot 10000 4069 Emai l ( 0. 33260000 0. 59310000 0. 07430000)

    2) col l ege=No Col l ege 5074 2378 Phone ( 0. 53133622 0. 38943634 0. 07922743)4) age>=39. 5 2518 330 Phone ( 0. 86894361 0. 00000000 0. 13105639)

    8) age< 64. 5 2256 77 Phone ( 0. 96586879 0. 00000000 0. 03413121) *

    9) age>=64. 5 262 9 Mai l ( 0. 03435115 0. 00000000 0. 96564885) *

    5) age< 39. 5 2556 580 Emai l ( 0. 19874804 0. 77308294 0. 02816901)

    10) mar i t al =Si ngl e 835 371 Phone ( 0. 55568862 0. 40958084 0. 03473054)20) i ncome>=29. 5 472 14 Phone( 0. 97033898 0. 00000000 0. 02966102) *

    21) i ncome< 29. 5 363 21 Emai l ( 0. 01652893 0. 94214876 0. 04132231) *

    11) mar i t al =Mar r i ed 1721 87 Emai l ( 0. 02556653 0. 9494480 . 02498547) *

    3) col l ege=Col l ege 4926 971 Emai l ( 0. 12789281 0. 80288266 0. 06922452)

    Revolution R Enterprise 10

  • 7/28/2019 Revolution r Enterprise 6.1

    11/27

    Revolution ConfidentialSimple Example: Visual Representation

    Root

    NoCollege

    Age >=40

    Age =65:Mail

    Age =30:Phone

    Income =65: Mail

    Revolution R Enterprise 11

  • 7/28/2019 Revolution r Enterprise 6.1

    12/27

    Revolution ConfidentialS caling HPA with R evoS c aleR

    RevoScaleR functions can read from data sets on disk inchunks, so you can increase the number of observations inthe data set beyond what can be analyzed in memory all atonce

    RevoScaleR analysis functions process chunks of data in

    parallel, taking greater advantage of your computingresources (Parallel External Memory Algorithms)

    Multiple cores on a desktop/server

    Cluster/grids have added advantage of more hard drivesfor storing & accessing data

    Windows HPC Server Cluster

    Burst computations to Azure in the cloud

    IBM Platform LSF Grid

    Revolution R Enterprise 12

  • 7/28/2019 Revolution r Enterprise 6.1

    13/27

    Revolution ConfidentialT he B ig Data Dec is ion Tree A lgorithm

    Classical algorithms for building a decision treesort all continuous variables in order to decidewhere to split the data.

    This sorting step becomes time and memoryprohibitive when dealing with large data.

    rxDTree bins the data rather than sorting,computing histograms to create empirical

    distribution functions of the data

    rxDTree partitions the data horizontally, processingin parallel different sets of observations

    Revolution R Enterprise 13

  • 7/28/2019 Revolution r Enterprise 6.1

    14/27

    Revolution Confidential

    Us eful rxDTree A rguments for B ig Data

    cp: complexity parameter. Increasing cp willdecrease the number of splits attempted

    maxDept h: the maximum depth of any tree

    node. The computations take much longer atgreater depth, so lowering maxDept h cangreatly speed up computation time.

    maxNumBi ns: the maximum number of bins

    to use to cut numeric data. DecreasingmaxNumBi ns will speed up computationtime.

    Revolution R Enterprise 14

  • 7/28/2019 Revolution r Enterprise 6.1

    15/27

    Revolution ConfidentialB ig Data E xample

    15

    CDC Report in J an. 2012

  • 7/28/2019 Revolution r Enterprise 6.1

    16/27

    Revolution ConfidentialT he U.S . B irth Data: 1985 - 2009

    Public-use data sets containing information onall births in the United States for each year from1985 to 2009 are available to download:http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm

    These natality files are gigantic; theyreapproximately 3.1 GB uncompressed. Thats alittle larger than R can easily process J osephAdler, R in a Nutshell

    Ive imported key variables from each year intoa single .xdf file with over 100 millionobservations.

    16

    http://www.cdc.gov/nchs/data_access/Vitalstatsonline.htmhttp://www.cdc.gov/nchs/data_access/Vitalstatsonline.htm
  • 7/28/2019 Revolution r Enterprise 6.1

    17/27

    Revolution Confidential

    R egres s ion Tree: Multiple B irths

    Cal l :r xDTr ee( f or mul a = I sMul t i pl e ~ DadAgeR8

    + MAGER + FRACEREC + FHI SP_REC +MRACEREC + MHI SP_REC + DOB_YY,

    dat a = bi r t hAl l C,maxDept h = 6, cp = 1e- 05,bl ocksPer Read = 10, ver bose = 1)

    Fi l e:

    C: \ Revol ut i on\ Dat a\ CDC\ Bi r t hUS. xdfNumber of val i d obser vat i ons: 100672041

    Number of mi ssi ng obser vat i ons: 0

    Revolution R Enterprise 17

  • 7/28/2019 Revolution r Enterprise 6.1

    18/27

    Revolution Confidential

    L eaves with L owes t P ercent of Multiple

    Births

    18

    Mom is not black and under theage of 20

    1.3%

    Mom is Asian or Pacific Islander

    (and notHispanic) and is between22 and 28 years of age. The birthis before 1997

    1.6%

    Mom is black and under the ageof 18

    1.7%

  • 7/28/2019 Revolution r Enterprise 6.1

    19/27

    Revolution Confidential

    L eaves with Highes t P ercent of Multiple

    Births

    19

    Mom is over 47 years old andthe birth is after 1996

    38.6%

    Mom is white, non-Hispanic, is

    between 45 and 47 years old,and the birth is after 1996

    28.1%

    Mom is Hispanic, is between

    45 and 47 years old, and thebirth is after 1996

    15.5%

  • 7/28/2019 Revolution r Enterprise 6.1

    20/27

    Revolution Confidential

    P oll Ques tion

    Are you using Hadoop?

  • 7/28/2019 Revolution r Enterprise 6.1

    21/27

    Revolution ConfidentialR evoS caleR with Hadoop Data F iles NE W

    The Hadoop Distributed File System (HDFS)

    is highly fault-tolerant and

    is designed to be deployed on low-costhardware.

    RevoScaleR supports accessing data in the

    HDFS file system for import or for directanalysis

    21

  • 7/28/2019 Revolution r Enterprise 6.1

    22/27

    Revolution ConfidentialR evoS caleR Data S ources

    Data Sources can be used for import or directly foranalysis

    External: delimited text, fixed format text, SAS, SPSS,ODBC connections

    Provided with RevoScaleR: efficient .xdf file format

    Data Sources contain information about their filesystem

    Delimited text and .xdf data sources can both be usedwith the HDFS file system

    Data sources are used as input to HPA functions

    22

  • 7/28/2019 Revolution r Enterprise 6.1

    23/27

    Revolution ConfidentialA n E xample Us ing Hadoop Data

    Hadoop cluster in our office Five nodes of commodity hardware

    Red Hat Enterprise Linux (RHEL) operating system

    Clouderas Hadoop (CDH3)

    Also has IBM Platform LSF workload managementsystem installed (not required to use HDFS data)

    My colleague, Dawn Kinsey, recorded a data

    analysis session 22 comma delimited files stored in HDFS

    Contain information on U.S. flight arrivals, 1997 2008

    Revolution R Enterprise 23

  • 7/28/2019 Revolution r Enterprise 6.1

    24/27

    Revolution ConfidentialS teps in A nalys is

    Set up a file system object and a data sourceobject

    Explore the HDFS airline data for the year 2000directly

    Extract variables of interest from all the files into an.xdf file in the native file system

    Use Rs great plotting capabilities on summary

    information Perform a big logistic regression on an .xdf file

    stored in HDFS

    Revolution R Enterprise 24

  • 7/28/2019 Revolution r Enterprise 6.1

    25/27

    Revolution Confidential

    P oll Ques tion

    What features of Revolution R

    Enterprise 6.1 are most interestingto you?

  • 7/28/2019 Revolution r Enterprise 6.1

    26/27

    Revolution ConfidentialT hank You!

    Download slides, replay from todays webinar http://bit.ly/QJ fR4A

    Learn more about Revolution R Enterprise Overview: revolutionanalytics.com/products New feature videos:

    http://www.revolutionanalytics.com/products/new-features.php

    Contact Revolution Analytics http://bit.ly/hey-revo

    26

    November 29: Real-Time Big Data Analyt ics: from Deploymentto Production

    David Smith, VP Marketing and Community, Revolution Analytics

    www.revolutionanalytics.com/news-events/free-webinars

    http://bit.ly/QJfR4Ahttp://www.revolutionanalytics.com/productshttp://www.revolutionanalytics.com/products/new-features.phphttp://bit.ly/hey-revohttp://bit.ly/hey-revohttp://www.revolutionanalytics.com/products/new-features.phphttp://www.revolutionanalytics.com/productshttp://bit.ly/QJfR4A
  • 7/28/2019 Revolution r Enterprise 6.1

    27/27

    Revolution Confidential

    27

    The leading commercial provider of software and support for the

    popular open source R statistics language.

    www.revolutionanalytics.com+1 (650) 646 9545

    Twitter: @RevolutionR

    http://www.revolutionanalytics.com/http://twitter.com/RevolutionRhttp://twitter.com/RevolutionRhttp://www.revolutionanalytics.com/