best practices for using predictive analytics to extract value from hadoop

27
RapidMiner, Inc. All rights reserved. - 1 - TM TM Best Practices for Extracting Value from Hadoop Your Name Here Today’s Date Here

Upload: rapidminer

Post on 14-Apr-2017

89 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 1 -

TMTM

Best Practices

for Extracting Value from

HadoopYour Name HereToday’s Date Here

Page 2: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 2 -

90% of Deployed Data Lakes are “USELESS”“Through 2018, 90% of deployed data lakes will be useless as they are overwhelmed with information assets captured for uncertain use cases.”

• SKILLS GAP is a major adoption inhibitor, cited by 57%

• How to EXTRACT VALUE from Hadoop, cited by 49%

Page 3: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 3 -

Sampling

Grid Computing

Native Distributed Algorithms

Different Approaches to Big Data Analytics

Page 4: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 4 -

Sampling

Grid Computing

Native Distributed Algorithms

Approach 1: Sampling

Page 5: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 5 -

Sampling: Data Movement & Processing

• Data Movement Pulls sample data from HDFS/Hive/Impala

• Data Processing In the analytics tool

Analytics Tool

Pieces of data pulled out of Hadoop

Performs Calculations

Page 6: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 6 -

Sampling: Pros & Cons

• Pros+ Simple and easy to start with+ Usually works well for data exploration and early

prototyping+ Some ML models would not benefit from more

data anyway

• Cons– Many ML models would benefit from more data– Cannot be used when large scale data preparation

is needed– Hadoop is used as a data repository only

Analytics Tool

Pieces of data pulled out of Hadoop

Performs Calculations

Page 7: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 7 -

Sampling: Best Practices

• When to use it+ Only data exploration / data understanding+ Early prototyping on prepared and clean data+ Machine Learning modeling with very few and basic

patterns (e.g. only a handful of columns and binary prediction target)

• When NOT to use it− Large number of columns in the data− Need to blend large data sets (e.g. large-scale joins)− Complex Machine Learning models− Looking for rare events

• Horror stories– Important decisions made based on biased samples

Analytics Tool

Pieces of data pulled out of Hadoop

Performs Calculations

Page 8: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 8 -

Sampling

Grid Computing

Native Distributed Algorithms

Different Approaches to Big Data Analytics

Data Visualization, Programming

Page 9: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 9 -

Sampling

Grid Computing

Native Distributed Algorithms

Approach 2: Grid Computing

Page 10: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 10 -

Approach 2: Grid Computing

• Data Movement• Only results are moved, data remains in Hadoop

• Data Processing• Custom single-node application running on

multiple Hadoop nodes• Pros & Cons

+ Hadoop is used for parallel processing in addition to using as a data source

– Only works if data subsets can be processed independently

– Only as good as the single-node engine, no benefit from fast-evolving Hadoop innovations

App

AppApp

Analytics Tool

App

Application Results

Calculations

Page 11: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 11 -

Grid computing best practices

• When to use it+ Task can be performed on smaller, independent

data subsets+ Compute-intensive data processing

• When NOT to use it– Data-intensive data processing– Complex Machine Learning models– Lots of interdependencies between data subsets

• Horror stories– Grid computing job called in huge loops to

manage dependencies and intermediate results App

AppApp

Analytics Tool

App

Application Results

Calculations

Page 12: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 12 -

Sampling

Grid Computing

Native Distributed Algorithms

Approach 2: Grid Computing

Legacy single-machine analytics engines

Page 13: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 13 -

Sampling

Grid Computing

Native Distributed Algorithms

Approach 3: Native Distributed Algorithms

Page 14: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 14 -

Analytics Tool

Approach 3: Native distributed algorithms• Data Movement

• Only results are moved, data remains in Hadoop• Data Processing

• Executed by native Hadoop tools: Hive, Spark, H2O, Pig, MapReduce, etc.

• Pros & Cons+ Holistic view of all data and patterns+ Highly scalable distributed processing optimized

for Hadoop− Limited set of algorithms available, very hard to

develop new algorithms

Calculations

ResultsInstructions pushed to Hadoop

Page 15: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 15 -

Analytics Tool

Native distributed algorithms best practices

• When to use it+ Complex Machine Learning models needed+ Lots of interdependencies inside the data (e.g.

graph analytics)+ Need to blend and cleanse large data sets (e.g.

large-scale joins)• When NOT to use it

− Data is not that large− Sample would reveal all interesting patterns

• Horror stories– Complex ML model developed in 3 months

defeated by a prototype model created in an afternoon

Calculations

ResultsInstructions pushed to Hadoop

Page 16: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 16 -

Sampling

Grid Computing

Native Distributed Algorithms

Approach 3: Native Distributed Algorithms

Hadoop ecosystem

projects

Page 17: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 17 -

Sampling

Grid Computing

Native Distributed Algorithms

Different Approaches to Big Data Analytics

Which one to use for a givenuse case?

Page 18: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 18 -

Typical projects need all three to succeed

Sampling

Grid Computing

Native Distributed Algorithms

Page 19: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 19 -

RapidMiner Predictive Analytics Platform

Page 20: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 20 -

Sampling

Grid Computing

Native Distributed Algorithms

Single Analytics Platform to support all three

Page 21: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 21 -- 21 -©2016 RapidMiner, Inc. All rights reserved.

DEMO

Page 22: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 22 -

ACCELERATES Time-to-Value

DATA PREP Speed & optimize ALL data

exploration, blending & cleansing tasks

OPERATIONALIZEEasily deploy & maintain

models and embed analytic results

MODEL & VALIDATERapidly prototype and

confidently validate predictive models

DATA PREP Speed & optimize ALL data

exploration, blending & cleansing tasks

CONNECT TO ANY DATA SOURCE, ANY

FORMAT, AT ANY SCALE

SUPPORT FOR ALL MAJOR BI, DATA VISUALIZATION &

BUSINESS APPLICATIONS

Page 23: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 23 -

Sampling

Grid Computing

Native Distributed Algorithms

Single Analytics Platform to support all three

Page 24: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

- 24 -©2016 RapidMiner, Inc. All rights reserved.

Example RapidMiner Execution on Hadoop

The user designs the Predictive Analytics process in RapidMiner Studio. RapidMiner Radoop translates each component of the process into the appropriate language of Hadoop and pushes these instructions into Hadoop where the analytics are calculated. This enables RapidMiner to take advantage of the full computation power of Hadoop and also to extract insights from across the entire cluster, for more complete, holistic view. Results are delivered back to RapidMiner, where they are visible in the Studio UI or can be operationalized in business applications via the RapidMiner Server Deployment tier.

Page 25: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 25 -

Extracting Value from HadoopRapidMiner Radoop extends predictive analytics to Hadoop and Spark • We speak Hadoop so you don’t have to

Translates predictive analytics into native Hadoop – you concentrate on creating analytics, not Hadoop programming

• COMPLETE insights into your Big DataPushes analytic instructions into Hadoop for computation, so you can analyze the full breadth and variety of your Big Data

• Use your favorite Hadoop scripts, too!Incorporates SparkR, PySpark, Pig and HiveQL

• Safe and sound Integrates with Kerberos authentication, supports data access authorization – seamless for users, easy administration for IT

Page 26: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

- 26 -©2016 RapidMiner, Inc. All rights reserved.

RapidMiner is #1 Open Source Data Science Platform

Leader

2016, 2015 & 2014

Gartner Magic Quadrant for Advanced Analytics Platforms

Strong Performer

2015

Forrester Wave on Big Data Predictive Analytics

Innovation Winner

2015Wisdom of Crowds for

Advanced & Predictive Analytics, Big Data Analytics &

End-User Data Preparation

#1 Open-Source Platform

2015, 2014, 2013

Data Mining & Analytics Software Poll

Page 27: Best Practices for Using Predictive Analytics to Extract Value from Hadoop

©2016 RapidMiner, Inc. All rights reserved. - 27 -- 27 -©2016 RapidMiner, Inc. All rights reserved.

Q&A

Download RapidMiner for free today

rapidminer.com