simplifyd predictive analytics primer

11
Simplified Analytics Predictive Analytics: Primer Sep 27, 2011

Upload: ramkumar-ravichandran

Post on 14-Dec-2014

518 views

Category:

Data & Analytics


3 download

DESCRIPTION

High level overview of Predictive Analytics techniques - Decision Trees, Regressions, Time Series Forecasting, Exponential Smoothing, etc. Was put together to train friends and mentees. Based on personal learnings/research and no proprietary info, etc. and no claims on 100% accuracy. Also every institution/organization/team uses it own steps/methodologies, so please use the one relevant for you and this only for training purposes.

TRANSCRIPT

Page 1: Simplifyd predictive analytics primer

Simplified AnalyticsPredictive Analytics: Primer

Sep 27, 2011

Page 2: Simplifyd predictive analytics primer

What is Predictive Analytics?

Various way of doing itForecasting TechniquesDecision TreesRegression

How to find out if a method works?

How to deploy them in real world?

When to do Predictive Analytics vs. not?

REFERENCES

Intended for Knowledge Sharing only. 2

Intended for Knowledge Sharing only. 2

CONTENTS

Page 3: Simplifyd predictive analytics primer

Intended for Knowledge Sharing only. 3

Intended for Knowledge Sharing only. 3

What is Predictive Analytics?

Prediction of future value of variable of interest(predicted) from past values of either itself or other explanatory variables(predictor)…eg. Stock price movements, credit card default rates, inventory management, etc.

Concepts of Time Windows..

Other time components..• Trend – long term organic growth• Seasonality – specific fluctuations repeating for certain time points(months, days) every year

• Development window (Jan’08 – Jun’10)• Observe the predicted variable (stock price,

default rate, etc.) and /or get the relationship with predictor variables

• Validation window (Jul’10 – Dec’10)• Check if prediction accuracy within acceptable

limits• If not, improve the prediction framework

• Prediction window (Jan’11 – May’11)• Use the predictive method to get the projections• Strategize business actions based on projections

0.00

20,000,000.00

40,000,000.00

60,000,000.00

80,000,000.00

100,000,000.00

120,000,000.00

Development Window Validation Window

Prediction Window

Page 4: Simplifyd predictive analytics primer

Intended for Knowledge Sharing only. 4

Intended for Knowledge Sharing only. 4

Various ways of doing it

All methods can be grouped in three broad categories..• Simple Forecasting Techniques• Decision Trees• Regression

Simple Forecasting Techniques:• Moving Averages – Moving Averages over last ‘x’ months• Decomposition Method – Tease out Trend and Seasonality components for use in predictions• Holt Exponential Smoothing Techniques –Apply Trend and Seasonality to Exponential Averages.

Exponential Averages assign progressively lesser weights to older observations.

Decision Trees:• Breaks down population into smaller buckets and predicts for each buckets. Yield much higher

prediction accuracy than simple forecasting techniques.

Regression:• Establishes a mathematical relationship between ‘predicted’ and ‘predictor’, which can then be

used to predict future values from known values of ‘predictor’.

Page 5: Simplifyd predictive analytics primer

Intended for Knowledge Sharing only. 5

Intended for Knowledge Sharing only. 5

Simple Forecasting Techniques

Simplest method of forecasting but cannot explain why it predicts certain value...

Moving Averages:

Prediction(t) = Average(Value at t-1 to t-x)For next month, shift average window by 1 month and so on.

Decomposition Method:

Prediction(t) = Trended value(T)*Seasonality Index(SI)-> T= Actual value in last available month*Growth factor; and Growth factor = (Actual(t) – Actual(t-1))/Actual(t-1)

-> SI = average of all Jan/average of all months;SI has to be calculated separately for each of 12 months and then SI relevant for “being predicted” month applied

Holt Exponential Smoothing:

Prediction(t) = (Smoothed series+ Trend(T))*SI->Smoothed series = Smoothing Factor * Actual last month + (1-Smoothing Factor) * Smoothed for last month and so on

Jan-60

Feb-60

Mar-60

Apr-60

May-60

Jun-60Jul-6

0

Aug-60

Sep-60

Oct-60

Nov-60

Dec-60

350

400

450

500

550

600

650

700#International airline passengers('000)

ActualsMoving Averages(12 months)Decomposition MethodHolt-Winters

Page 6: Simplifyd predictive analytics primer

• Begins with entire population and splits on ‘predicted’ variable(e.g., default rate) by a predictor variable, e.g. Customer type – Subprime or Premium

• Checks if the difference in ‘predicted’(default rate) is statistically significant using Chi-square or t-test• If the difference is significant, then it splits the nodes* by other variables,• If not, it goes back and tries to ‘significantly’ split the population by another variable

How long does it keep splitting?• Until it finds significant splits based on the Chi-square or t-tests• Until it hits max number of nodes* (manageable number for business actions)• When the counts in lower most nodes becomes less than 5%

*Each subgroup resulting from split is called a node

Intended for Knowledge Sharing only. 6

Intended for Knowledge Sharing only. 6

Decision Trees

Higher prediction accuracy and explain ability, since prediction is done at member sub-groups level…

All Credit Card holdersDefault rate: 2%

Sub-primeDefault rate: 5%

PremiumDefault rate: 1%

FICO <250Default rate: 8%

FICO: 250 to 400Default rate: 6%

FICO>400Default rate: 4%

Monthly spend <$500Default rate: 0.5%

Monthly spend >$500Default rate: 1.5%

nodes

Page 7: Simplifyd predictive analytics primer

• Estimates degree of relationship between the “predicted” variable and the “predictor” variablese.g. Credit Card default = intercept + b1*bankruptcy +b2*payment to income ratio

->intercept – unexplained factor->b1,b2– strength of relationship- how much “predicted”((default probability) changes with unit changes in “predictor” values(bankruptcy or payment to income ratio)

What are the various types of regression?

Intended for Knowledge Sharing only. 7

Intended for Knowledge Sharing only. 7

Regression

Highest prediction accuracy and explain ability, since prediction is done at individual member level…

Regression Methods

Linear Logistic ARIMA

When they should be used?To predict value of a variable,e.g., Credit Card spend, inventory quantity

To predict probability of certain event happening, e.g., credit card default, inventory shortage

To predict future values from historical figures, e.g., future stock price from past figures

Inherent assumptions in the technique

Predicted variable follows "normal distribution" meaning population has most members having about average values and lesser counts towards extremes

Probability of event happening follows "binomial distribution" meaning probability of observing 'x' defaulters by picking 'N' members is highest if the proportion of defaulters in population is (x/N)

‘Stationary time series’, i.e., the structure of time series doesn’t change significantly, i.e., increase in volatility or change in growth rate itself

Page 8: Simplifyd predictive analytics primer

Intended for Knowledge Sharing only. 8

Intended for Knowledge Sharing only. 8

How to find if a method works?Various measurement diagnostics can be used to check prediction accuracy…

• Root Mean Square Error (RMSE): Average difference between actual and predicted values. RMSE = average of square(actual – predicted)

• Error rate(%): Tells what is the error relative to actual values of predicted variable. Error rate (%) = RMSE/average of actuals

Decision Trees and Regression models have more sophisticated diagnostics…

• R-square: Tells how much of the variance in “predicted” variable is captured by the model.

• Rank Order: Checks if the predicted values correlate with actual values. Steps:

• Sort the population by predicted values• Split into groups with equal number of obs, generally ten groups or deciles• Get the average of both actual and predicted values for each group• Check if both averages are gradually decreasing from the top group to bottom

• Gains Chart: Useful mostly in logistic regression models. Tells if most of the defaulters are being captured in top groups itself. If not, models aren’t giving highest probability to actual defaulters and so models needs to be revisited.

• Akaike Information Criteria(AIC): Helps in selecting the most “parsimonious” regression models- maximum information capture with least number of predictors.

Page 9: Simplifyd predictive analytics primer

Intended for Knowledge Sharing only. 9

Intended for Knowledge Sharing only. 9

How to deploy them in real world?

Simple Forecasting Techniques are used to predict at portfolio level only, e.g., predictions for Auto-Lease portfolio’s loss rates

but both Decision Trees and Regression Models require separate infrastructure to get deployed for real time/non-real time predictions…

Decision Trees is used as a “rule engine”. Every customer will fall into one of the nodes and the prediction for that node is used to act on this customer’s request, e.g., Sub-prime customer with FICO<250 will be targeted even when he is just 1 payment due, vs. a premium customer in high customer will be given leverage to 4 payments due.

Regression Model gives “account level” estimates which are then used to act on customer’s request, e.g., Fraud models, etc. Models have to run every time a customer transacts.

Page 10: Simplifyd predictive analytics primer

OPPORTUNITY SIZINGMONTH 1 MONTH 2 MONTH 3

FT 1 FT2

Is ROI acceptable?

MIN COUNT REQUIREMENTS

MMF

NO MMF

Is minimum count

available?

REQUIRED ACCURACY OF PREDICTION

Prediction accuracy

unsatisfactory?

CONSTRAINTS EXPLANATION QUESTIONS

Intended for Knowledge Sharing only. 10

When to do Predictive Analytics vs. not?

Page 11: Simplifyd predictive analytics primer

Intended for Knowledge Sharing only. 11

REFERENCES

Simple Forecasting Techniqueshttp://itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm

Binomial Distributionshttp://www.itl.nist.gov/div898/handbook/eda/section3/eda366i.htm

Exponential Smoothinghttp://forecasters.org/pdfs/foresight/free/Issue19_goodwin.pdf

Decision Treeshttp://www.salford-systems.com/resources/whitepapers/index.html

Linear Regressionhttp://faculty.chass.ncsu.edu/garson/PA765/regress.htm

Logistic Regressionhttp://faculty.chass.ncsu.edu/garson/PA765/logistic.htm

ARIMA Regression(also called as Box-Jenkins methodology)http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc445.htm