predictive analytics techniques: what to use for...

31
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD

Upload: lenhan

Post on 29-Aug-2018

239 views

Category:

Documents


0 download

TRANSCRIPT

Predictive Analytics Techniques: What to Use

For Your Big Data

March 26, 2014 Fern Halper, PhD

Presenter

3

Proven Performance Since 1995

�  High-quality, vendor-neutral educational offerings �  Independent analyst research staff and thought

leadership �  Trusted sources of emerging information and trends �  Ability to bring together qualified BI/DW professionals

and solution providers Premium Membership, conferences, seminars, research, publications, topical portals, whitepaper library, and numerous online programs

TDWI helps business and IT professionals gain insight about data warehousing, BI, and analytics:

www.tdwi.org

Agenda

•  Introduction to big data and predictive analytics

•  Popular predictive analytics methodologies – Examples – Guidelines

•  Deployment models

Big Data

5

Social

M2M/IoT

Text

Mobile/Location Volume

Formats

A Confusion of Words

Big Data Analytics

Big  Data  Analy,cs  

Text  analy,cs  

Predic,ve  analy,cs  

Slicing  and  dicing  

Etc.  Visual  discovery  

Link  analysis  

Stream  mining  

Predictive Analytics

A statistical or data mining solution consisting of algorithms and techniques that can be used on both structured and unstructured data to determine outcomes

A Lot of It Is Used to Predict Behavior

•  People – Churn – Marketing – Fraud detection

•  Machine – Operations maintenance

•  And much, much more! Good source for use

cases

Of Course, It Isn’t Just About Modeling

CRISP Lifecycle

A Vast Array of Techniques

Source: TDWI BPR on Predictive Analytics, 2014; n=242

Supervised

•  Use it when you know outcomes of interest – Leave vs. stay – Revenue prediction

•  Need enough data for training, testing, validation

Unsupervised

•  Does not include target information •  Looks for commonalities/hidden structures

in data •  May not produce useful insight •  Is it prediction?

Techniques

•  Supervised –  Classification –  Regression –  Neural networks

•  Unsupervised –  Clustering –  Association

•  Supervised –  Deep learning, auto-encoders –  Decision trees, random

forests, gradient boosting –  Support vector machines,

Bayesian classifiers, principal component, discriminant analysis

•  Unsupervised –  Nearest-neighbor mapping,

k-means clustering, self-organizing maps

–  Factor analysis, link analysis

Decision Trees

Good for classification and prediction with known, discrete outcomes

Linear Regression

Used to predict a continuous variable from independent variables

Artificial Neural Networks (1)

 Biological to Mathematical

Source agh.edu

Artificial Neural Networks (2)

Source: Commonsenseatheism.com

Can be used on a range of problems; good for classification and estimation

Clustering

Source: Babelomics

Used  to  group  observa,ons  by  perceived  similarity    

Association Rule Mining

Transac'on   Items  

1   milk,  leDuce  

2   leDuce,  diapers,  beer,  cookies  

3   milk,  diapers,  beer,  plas,c  bags  

4   leDuce,  milk,  diapers,  beer  

5   leDuce,  milk,  diapers,  plas,c  bags  

Diapers -> Beer

Two concepts: support and confidence

Used to find relationships

Quick Quiz

•  How much revenue will this customer bring? –  Regression

•  Who is going to take a certain action? –  Classification

•  What are my customer segments? –  Clustering

•  If a customer buys X, what else might it buy? –  Association rules

Strengths & Weaknesses: Decision Trees

Strengths •  Easy to understand

–  Rules vs. equations

•  Easy to explain •  Not a black box •  Data doesn’t have to

follow any distribution •  Can handle interactions

between variables

Weaknesses •  Continuous value

predictions •  Can be computationally

expensive to train •  Can have problems if

many classes and few training examples

•  Overfitting

Strengths & Weaknesses: Regression

Strengths •  Simple to use •  Easy to explain

through independent variables

Weaknesses •  Relationship needs to

be linear •  Hard-to-handle

categorical variables or variables that interact

•  Outliers hard to model

Strengths & Weaknesses: Neural Networks

Strengths •  Good for a specific

class of problems •  May be easy to

implement •  Non-linear/interaction

variables

Weaknesses •  Hard-to-explain

output (black box) •  Output might be

unpredictable •  Training can take a

long time

Strengths & Weaknesses: K Means

Strengths •  Good for large

datasets •  Simple •  Efficient

Weaknesses •  Need to specify K

upfront •  Sensitive to outliers,

which may result in incorrect cluster boundaries

•  Needs a mean (categorical data?)

Strengths & Weaknesses: Association Rules

Strengths •  Simple •  Text data

(categorical)

Weaknesses •  Can be

computationally expensive

•  Potential for spurious patterns

•  Rules do not mean causality

Ensemble Modeling

•  Multiple models are combined to solve a problem

Vendors Are Offering a Range of Options for Predictive Analytics

•  UI easier to use: visual vs. code based •  Automation •  Collaboration/interactivity •  Cloud options •  Operationalizing and embedding advanced

analytics

Operationalizing

29

An example:

TDWI Big Data Maturity Model

QUESTIONS?