end-to-end machine learning pipelines with hp vertica and distributed r

38
1 End-to-end Machine Learning Pipelines with HP Vertica and Distributed R Jorge Martinez May 20th, 2015

Upload: jorge-martinez-de-salinas

Post on 07-Aug-2015

95 views

Category:

Data & Analytics


1 download

TRANSCRIPT

1

End-to-end Machine Learning Pipelines with HP Vertica and Distributed R

Jorge Martinez May 20th, 2015

2

About me

FPGAsBarcelona2009

Embedded software, GPUsBarcelona2011

Distributed systems and MLSF2013

@jorgemarsalhttp://jorgemarsal.github.io

3

The data explosion

4

Horizontal scaling

The shift from BI to Data Science

The shift from BI to data science

Happens!

https://www.youtube.com/watch?v=vbb-AjiXyh0

5

Predictive analytics applications

Marketing

Sales

Logistics

Risk

Customer support

Human resources

Healthcare

Consumer financial

Retail

Insurance

Life sciences

Travel

6

Predictive analytics workflow

Build Models

Evaluate ModelsDeploy Models

(In-DB or Web)

BI Integration

1 2

3

Build and evaluate predictive models on large datasets using Distributed R

2

1 Ingest and prepare data by leveraging HP Vertica Analytics Platform (SQL DB)

3 Deploy models to Vertica and use in-database scoring to produce prediction results for BI and applications.Alternatively deploy model as a web service.

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Training predictive models

9

R is …

 “The best thing about R is that it was developed by statisticians. The worst thing about R is that… it was developed by statisticians.”-Bo Cogwill, Google

10

R is ….

PopularNot scalable

Open source

No parallel algorithmsFlexible

Extensible

Limited pre/post processing

11

Horizontal scalingFunctional programming and big

dataScale-out

Scale-out

12

Horizontal scaling

“The future has arrived, it’s just not evenly distributed yet”- William Gibson

“The future has arrived, it’s just not evenly distributed yet”- William Gibson

Ship code to data,Functional Programming

13

Distributed RThe Next Generation Platform for Predictive Analytics

14

Distributed RA New Enterprise class predictive analytics platform

A scalable, high-performance platform for the R language• Implemented as an R package• Open source

Use familiar GUIs and packages

Analyze data too large for vanilla R

Leverage multiple nodes for distributed processing

Vastly improved

performance

15

Distributed R: architecture

Master• Schedules tasks across the

cluster.• Sends commands/code to

workers

Workers• Hold data partitions• Apply functions to data partitions

in parallel

16

•Relies on user defined partitioning• Also support for distributed data-frames and lists

darray

Distributed R: Distributed data structures

17

• Express computations over partitions• Execute across the cluster

foreach

Distributed R: Distributed code

f (x)

18

Distributed R: basic concepts

19

• Similar signature, accuracy as R packages• Scalable and high performance • E.g., regression on billions of rows in a couple of minutes

Distributed R: Built-in distributed algorithms

Algorithm Use cases

Linear Regression (GLM) Risk Analysis, Trend Analysis, etc.

Logistic Regression (GLM)Customer Response modeling, Healthcare analytics (Disease analysis)

Random Forest Customer churn, Market campaign analysis

K-Means ClusteringCustomer segmentation, Fraud detection, Anomaly detection

Page Rank Identify influencers

20

March Madness example

21

Predicting March Madness results using ML

Load data from

Vertica/HDFS/Local FS

Optionally add

additional data using

Idol-on-demand APIs

Train model using

Random Forest in

Distributed R

Deploy model to

Vertica or as a web service

22

Model training

• Use team and opponent features to train a model (Blocks, steals, assists …)

• Learn what’s important in a basketball game (using Random Forest) and use that knowledge to predict game results.

23

Decision Trees

24

Parallel Random Forest

• Random Forest – building an ensemble of deep decision trees.

• Each tree is created with a different subset of the data and different features to generalize better.

• E.g. build 100 decision trees on 4 machines. Each machine builds 25 decision trees.

25

Game result prediction

• Group games by teams and get the average of each team’s features

• Predict the result of a game using the model• Fill out bracket by predicting 1 game at the time

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Demo

27

28

Distributed R: summary

• Regression on billions of rows in minutes• Graph algorithms on 10B edges• Load 400GB+ data from database to R in < 10 minutes• Open source!

29

That’s cool… what can I do with it?

• Collaborate• Github (report issues, send PRs) https://github.com/vertica/DistributedR • Standardization with R-core

http://www.r-bloggers.com/enhancing-r-for-distributed-computing/

• Get the SW + docs: http://www.vertica.com/hp-vertica-products/hp-vertica-distributed-r/

• Buy commercial support

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Publishing the model as a Web Service

31

Expose R as a Web Service using

OpenCPUhttps://

www.opencpu.org/

Create a Web App that makes predictions using that service.

Steps

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Demo

33

Another example: Income Prediction

http://15.126.194.41/public/index.html

© Copyright 2014 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Conclusions

35

HavenBig Data Platform

Turn 100% of your data into action.

Human Data

Business Data

Machine Data

Powering Big Data Analytics to Applications

Insight

Haven OnDemand

• Vertica OnDemand

• IDOL OnDemand

• Vertica Enterprise• IDOL Enterprise• Vertica for SQL on

Hadoop• Vertica Distributed R• KeyView

Haven Enterprise

HP Haven Big Data Platform

36

HP Haven Ecosystem for Developers

Haven OnDemand

• IDOLOnDemand.com

• VerticaOnDemand.com

Developer Community

• Ask questions

• Learn through code tutorials, events…

• Share ideas

• Code libraries and quick-starts

• Let us know if something is broken

HP Haven Marketplace

• Downloads for apps, extensions, plugins, widgets for HP Haven

• HP and 3rd Party developer apps

• Promote your app

• New revenue streams

37

“The future has already arrived, it’s just not evenly distributed yet”- William Gibson

Thank you

http://www8.hp.com/us/en/software-solutions/big-data-

analytics-software.html

http://github.com/vertica