data analytics for marketing decision support

117
IBM Research © 2007 IBM Corporation Data Analytics for Marketing Decision Support Saharon Rosset, Naoki Abe* IBM T.J. Watson Research Center *Acknowledgements to: Andrew Arnold, Chid Apte, John Langford, Rick Lawrence, Srujana Merugu, Edwin Pednault, Claudia Perlich, Rikiya Takahashi and Bianca Zadrozny

Upload: amit-kumar

Post on 20-Jan-2015

204 views

Category:

Education


6 download

DESCRIPTION

Very Useful documents of IBM for Marketing Analytics

TRANSCRIPT

Page 1: Data analytics for marketing decision support

IBM Research

© 2007 IBM Corporation

Data Analytics for MarketingDecision Support

Saharon Rosset, Naoki Abe*

IBM T.J. Watson Research Center

*Acknowledgements to: Andrew Arnold, Chid Apte, John Langford, Rick Lawrence, Srujana Merugu,

Edwin Pednault, Claudia Perlich, Rikiya Takahashi and Bianca Zadrozny

Page 2: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation2

Tutorial outline

Challenges of marketing analytics (SR)

– Integrating the “marketing” and “data mining” approaches

– Customizing data mining approaches to the challenges of marketing decision support

Survey of some useful ML methodologies (NA)

– Bayesian network modeling

– Utility-based classification (Cost-sensitive learning)

– Reinforcement learning and Markov decision processes

Detailed analysis and case studies:

– Customer lifetime value modeling (NA)

– Customer wallet estimation (SR)

Page 3: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation4

The grand challenges of marketing

Maximize profits (duh)

Initiate, maintain and improve relationships with customers:

– Acquire customers

– Create loyalty, prevent churn

– Improve profitability (lifetime value)

Optimize use of resources:

– Sales channels

– Advertising

– Customer targeting

Page 4: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation5

Some of the concrete modeling problems Channel optimization

Cross/up-sell (customer targeting)

New customer acquisition

Churn analysis

Product life-cycle analysis

Customer lifetime value modeling

– Effect of marketing actions on LTV?

Advertising allocation

RFM (Recency, Frequency, Monetary) analysis

...

Page 5: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation6

Data analytics for decision support: grand challenge

Beyond “modeling” the current situation, we need to offer insight about the effect or potential of possible actions and decisions:

– How would different channels / incentives affect LTV of our customers?

– How much more money could this customer be spending with us (customer wallet)

– Can we predict the effects of new actions that have never been tried in historical data? What if they have been tried on non-representative set?

– Can we be confident our results are actionable? Can we differentiate causality from correlation in our models?

Page 6: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation7

Tutorial outline

Challenges of marketing analytics

– Integrating the “marketing” and “data mining” approaches

– Customizing data mining approaches to the challenges of marketing decision support

Survey of some useful ML methodologies

– Bayesian network modeling

– Utility-based classification (Cost-sensitive learning)

– Reinforcement learning and Markov decision processes

Detailed analysis and case studies:

– Customer lifetime value modeling

– Customer wallet estimation

Page 7: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation8

CRM analytics:

– Relies on primary research (=surveys) to understand needs and wants

– Relies on (more or less) detailed models of customer behaviorUsually parametric statistical models

– Often estimates customer-level parameters

Data mining:

– Typically relies on data in Data Warehouse /Mart

– Uses minimum of parametric assumptions

– Often attempts to fit problem into “standard” modeling framework: classification, regression, clustering...

Typical marketing analytics vs. data mining

Page 8: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation9

Comparison of approaches

Criterion Marketing DM

Parametric models formalize knowledge of domain and problems

+ -

Robust against incorrect assumptions about domain and problems

- +

Actively collect the data to estimate model quantities (active learning)

+ -

Rely on existing, abundant data in Corporate Data Warehouses

- +

Integrate expert input from managers and customers (“wants and needs”)

+ -

Use data to learn new, surprising patterns about customer behavior

- +

Page 9: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation10

Rust, Lemon and Zeithaml (2004), “Return on Marketing: Using Customer Equity to Focus Marketing Strategy”, J. of Marketing

Modeling customer equity / lifetime value

– Combine several previous approaches

– Model the brand “switching matrix” as a function of customer preference, history and product properties

– Want to identify drivers of satisfaction (levers)

– Calculate effect (ROI) of marketing actions – pulling levers

Mostly relies on primary research collected specifically for this study

– Interviews with managers

– Survey of consumer preferences

Example 1: modeling and improving LTV

Page 10: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation11

Simplified version of paper’s business model

Marketing investment

Costs

Pullinglevers

Increasedequity

Return on marketing investment

Main goals:

Identify relevant levers

Quantify their effect

Page 11: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation12

Analytic setup (main components only)

logit(pijk) = 0k LASTijk + xik k

– pijk is probability that customer i buys item k given they bought item j previously

– LAST is a dummy variable for “inertia”

– Xik is a feature vector for customer i, product k

This is used to compute the brand switching matrix {pijk} and customer lifetime value is calculated as:CLVij = t PROFij Bijt

– PROF is a profit measure considering discounting, price & cost (assumed known)

– Bijt is probability customer i buys product j in time t, calculated

from the stochastic matrix {pijk}

Page 12: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation13

Data definitions

Potential drivers (marketing activities) are reflected in the components of xi

– Price

– Quality of service

etc.

The data to estimate the logit model is based on:

– Expert (manager) input

– Questionnaires of customers

– Corporate data warehouse (not implemented in their case study...)

Page 13: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation14

Results: important drivers for airline industry?

...

Etc. (all factors deemed important)

Driver Coefficient Std error Z score (coeff/std)

Inertia .849 .075 11.34

Quality .441 .041 10.87

Price .199 .020 9.86

Convenience .609 .093 6.56

......

......

Page 14: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation15

What would a data miner do?

Count more (or only) on historical data in data warehouse

– Variables would have different meaning

– Identify correlations, not necessarily drivers

Could use same analytic formulation, but also try alternative approaches

– Relate LTV directly to variables observed?

– Model transaction sizes in addition to switching?

– Use non-parametric modeling tools?

Etc.

Page 15: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation16

Common practice in marketing:

Define static, fixed customer segments

– Supposed to capture “true essence” of customers’ behaviors, needs and wants

– Often given catchy names: “Upwardly mobile businessmen” representing the “average” profile

Make marketing decisions at segment level, based on understanding of needs and wants

Example 2: the segmentation approach

Page 16: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation17

A market segmentation methodology

Based on Kotler (2000). Marketing Management. Prentice-Hall

1. Survey stage: primary research to capture motivations, attitudes, behaviors

2. Analysis stage: factor analysis, then clustering of survey data Identify segments

3. Profiling stage: analyze segments and give them names

Additional stage often taken is to assign all customers to the defined segments:

4. Assignment stage: build classification model to assign all customers to learned segments

Page 17: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation18

What would a data-miner do?

Option 1: clustering

– Replace primary research by warehouse data

– Cluster all customers

– Lose the “needs and wants” aspect

Option 2: supervised learning

– Treat each decision problem as separate modeling taskE.g., find “positive” and “negative” examples for each binary decision, learn model

– Advantage: customized

– Disadvantages:

• May not have right data to model decisions we want to make

• Past correlations may not be indicative of future outcomes

Page 18: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation19

Comparison of approaches

Criterion Marketing DM

Parametric models formalize knowledge of domain and problems

+ -

Robust against incorrect assumptions about domain and problems

- +

Actively collect the data to estimate model quantities (active learning)

+ -

Rely on existing, abundant data in Corporate Data Warehouses

- +

Integrate expert input from managers and customers (“wants and needs”)

+ -

Use data to learn new, surprising patterns about customer behavior

- +

Page 19: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation20

Count on historical data as much as possible

Avoid complex parametric models

– Let the data guide us

– Still want to integrate domain knowledge

Analyze and understand the special aspects of marketing modeling problems

– Importance of long-term relationship (lifetime value, loyalty)

– Effects of competition (customer wallet vs. customer spending)

Modify existing, or develop new, data analytics approaches to address problems properly

An integrated approach

Page 20: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation21

Tutorial outline

Challenges of marketing analytics

– Integrating the “marketing” and “data mining” approaches

– Customizing data mining approaches to the challenges of marketing decision support

Survey of some useful ML methodologies

– Bayesian network modeling

– Utility-based classification (Cost-sensitive learning)

– Reinforcement learning and Markov decision processes

Detailed analysis and case studies:

– Customer lifetime value modeling

– Customer wallet estimation

Page 21: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation22

Moving beyond revenue modelingTo really understand the profitability and potential of our

customers, we need to move beyond modeling their short-term revenue contribution

Revenue over time: Lifetime Value modeling

– How much can we expect to gain from customer over time?

– Incorporates loyalty/churn, prediction of future customer revenue

– LTV = t S(t) v(t) D(t) dt (S(t) is customer survival function, v(t) customer value over time, D(t) discounting factor)

Potential revenue: Customer Wallet Estimation

– How much revenue could we be generating from this customer?

– Incorporates competition, brand switching etc.

Page 22: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation23

LTV and Wallet: beyond standard modelingTime

Revenue

Now

Future

Next year

Sales / revenue modeling

Sales forecasting

LTV

mod

elin

g

Potential salesActual sales

Wallet estimation

?

Page 23: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation24

Types of decision support

Passive decision support

– Understand more about problems and causes

– Identify areas of need, under-performance etc.

– Help in making better decisions

Active decision support

– Model the effect of actions

– Actively help in deciding between alternative actions

Active decision support is typically more challenging in terms of data needed to learn models

Page 24: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation25

Depth and actionability of insightsDepth

Actionability

Basic concepts

Real insight

Revenue modeling

ActivePassive

Correlation Causality

Revenue forecast

Lever identification

LTV modeling

Wallet estimation

Understand effect of potential actions on LTV and Wallet

attainment

Page 25: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation26

The causality challenge Predictive models discover correlation

– Example: linear regressionSignificant t-statistic for coefficients imply they have a significant effect, not that they are actually causing the response

For active decision support we need to identify levers to pull to affect outcome

– Only works with causality

Causality is difficult to find or prove from observation data

– If we have knowledge about causality, we can formalize it as (say) Bayesian network and use in our models

– We can get closer to causality by case-control experiments

Page 26: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation27

Assume we observe for some companies:X = company’s marketing budget,Y = company’s salesand want to understand how to affect Y by controlling X

Assume we find that X is very “predictive” about Y

Possible scenarios:

Illustration: predictive power is not causality

Z

Y X

x y

x y

Causality successfully identified “lever”

Fixed percent of revenue to marketing?

Z=Company size independently determining both quantities?

Page 27: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation28

Some other challenges

Modeling effects of new/unobserved actions

– Critical for active support, often difficult or impossible

– Even for established actions, they may have been applied in different context than our planned campaign

Integrating expert knowledge into process

– Can be done formally via graphical models

Handling data issues: matching, leaks, cleaning

– Always critical

Delivering solutions and results

Page 28: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation29

Example: Telecom Churn Management

Cell phone company has set of customers, some leave (churn) every month

The goals of a Churn Management system:

Analyze the process of churn

– Causes

– Dynamics

– Effects on company

Design policies and actions to improve the situation

– Marketing campaigns

– Incentive allocation (offer new features or presents)

– Change in plans to contend with competition

Page 29: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation30

First step: understand current situation Who is likely to churn (predictive patterns)?

– Phones features / plans

– Usage patterns

– Demographics

Tools: segmentation, classification, etc.

Which of these patterns are causal?Tools: expert knowledge, Bayesian networks, etc.

Which causal effects not in data? Competition, economy etc.

Which of these customers are profitable?

– Short term: customer value

– Long term: lifetime value

– Growth potential: customer wallet

Page 30: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation31

Second step: design actions Can we affect causal churn patterns?

– For example, by improving customer service

Given possible incentives and marketing actions, what effect will they have?

– Loyalty and relationship

– Current customer value and wallet attainment

– Customer lifetime value

– Cost to company

How can we optimize use of our marketing resources?

– Identify segments we want to retain

– Identify effective marketing actions

Page 31: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation32

Tutorial outline

Challenges of marketing analytics

– Integrating the “marketing” and “data mining” approaches

– Customizing data mining approaches to the challenges of marketing decision support

Survey of some useful ML methodologies

– Bayesian network modeling

– Utility-based classification (Cost-sensitive and active learning)

– Reinforcement learning and Markov decision processes

Detailed analysis and case studies:

– Customer lifetime value modeling

– Customer wallet estimation

Page 32: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation33

Survey of Useful Methodologies

Bayesian Networks

– Motivation: need to address causality vs. correlation issue; need to formalize domain knowledge about relationships in data

– Example domain: Customer wallet estimation

Utility-based classification* (Cost-sensitive Learning)

– Motivation: need to handle utility of decision and cost of data acquisition in marketing decision problems

– Example domains: Targeted marketing, Brand switch modeling

Markov Decision Processes (MDP) and Reinforcement Learning

– Motivation: need to consider long term profit maximization

– Example domain: Customer lifetime value modeling

*c.f. Utility-Based Data Mining Workshop at KDD’05 and KDD’06

Page 33: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation34

Bayesian Network a.k.a Graphical Model

P(E) ¬P(E)

0.3 0.7

E P(C) ¬P(C)

F 0.4 0.6

T 0.7 0.3

M C P(R) ¬P(R)

F F 0.3 0.7

T F 0.9 0.1

F T 0.2 0.8

T T 0.6 0.4

Bayesian Network is a directed acyclic graphical model and defines a probability model

Here is a simple example…

Economy

Marketing Competition

Revenue

P(M,E,C,R) = P(E) P(M|E) P(C|E) P(R|M,C)

E P(M) ¬P(M)

F 0.3 0.7

T 0.9 0.1

Page 34: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation35

Bayesian Network as a General Unifying Framework

Bayesian Network provides a general framework that subsumes numerous known classes of probabilistic models, e.g.

– Naïve Bayes Classification

– Clustering (Mixture models)

– Auto regressive models

– Hidden Markov models, etc, etc

Bayesian Network provides a framework for discussing modeling, inference, causality, hidden variables, etc

Naïve Bayes classification

Class

Variable 1 Variable N…. Variable 1 Variable N….

Clustering/Mixture

Unobserved

Class

Hidden Markov Model

Symbol Symbol

State State

Unobserved

Page 35: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation36

Estimation and Inference Problems for Bayesian Networks

Parameter estimation from data given structure

– Given a graphical structure as input, and a model class, estimate the parameters of the models

Inference given model

– Given a full Bayesian network (i.e. graph and model parameters) and partial information on the realized values, infer the unknown values

– Useful for business scenario analyses

Latent variable estimation given structure

– Given a full Bayesian network and data for the observed variables, infer the values for the unobserved (latent) variables

Bayesian network structure learning from data

– Given data only, infer the best Bayesian network, including both the graphical structure and the model parameters

Inferring causal structure from data

– Given data only, infer not only the underlying Bayesian Network but the causality between variables

Page 36: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation37

Parameter Estimation (for Linear Gaussian Models)

Parameter estimation given graph structure reduces to standard estimation problem (e.g. maximum likelihood estimation) for the underlying model class

For example, for linear Gaussian models, it is solvable by linear regression

– P(M,E,C,R) = P(E) P(M|E) P(C|E) P(R|M,C)

– M = 1E + 1, 1 ~ N(0, 1)

– C = 2E + 2, 2 ~ N(0, 2)

– R = 3M + 4C + 3, 3 ~ N(0, 3)

There is active research for many other underlying model classes

Economy

Marketing Competition

Revenue

Page 37: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation38

Inference in a Given Model

Given an estimated model and realized values for a subset of the variables, it may be possible to compute the most likely values for unknown variables.

Inference in unrestricted Bayesian Networks is intractable (#P-hard)

For restricted classes, it is possible to efficiently perform inference: e.g. dependency trees

– P(M,E,C,R) = P(E) P(M|E) P(C|E) P(R|M)

– P(M|E,C,R) = P(M|E,C) P(R|M)

– = P(M|E) P(R|M)

– Simplified due to the conditional independence (d-separation) between M and C, implied by the graph structure

But there are considerable challenges for graph structures including undirected cycles (e.g. original graph for P(M,E,C,R))

Economy

Marketing Competition

Revenue

Page 38: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation39

Structure Learning Given Data

For unrestricted classes, structure learning is known to be intractable.

– Even for the class of poly-trees, robust estimation (i.e. when the true distribution may not be in the target class) is NP-hard

For some restricted classes, structure learning is efficient

– Dependency Trees can be efficiently robustly learned

– Poly-trees can be efficiently learned, given the assumption that the true model is in the target class

There is active research on proving conditions under which variable selection methods in regression (e.g. Lasso) can provably learn structure of general graphs

Economy

Marketing Competition

Revenue

Economy

Marketing Competition

Revenue

A dependency tree A poly-tree

Page 39: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation40

Inferring Causality from Data

Economy

Marketing Revenue

Marketing Competition

Revenue

P(M,E,R) = P(E) P(M|E) P(R|E)

P(M,E,C) = P(M) P(C) P(R|M,C)

M ┴ C | R

M ┴ R | E

Economy

Marketing Revenue

Economy

Marketing Revenue

The causal structure cannot be determined from data !

P(M,E,R) = P(M) P(E|M) P(R|E) P(M,E,R) = P(R) P(E|R) P(M|E)

The causal structure can be determined from data !

It can be inferred that Marketing can be

a “lever” for controlling Revenue !

C.f. [P. Spirtes, C. Glymour, and R. Scheines (2000)]

Page 40: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation41

Summary: Estimation and Inference with Bayesian Networks

Parameter estimation from data given structure

– It is efficiently solvable for many model classes

Inference given model

– Exact inference is known to be NP-complete for sub-class including undirected cycles

– It is efficiently solvable for tree structures and many models used in practice

Latent variable estimation, given structure

– Local optimum estimation is often possible via EM-algorithms

Bayesian network structure learning from data

– It is known to be “intractable” for general classes

– It is even NP-complete to estimate “polytrees” robustly

Inferring causal structure from data

– Sometimes possible but in general not

Given these facts, determining network structure using domain knowledge and using it to do parameter estimation and inference is common practice

example

Page 41: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation42

Tutorial outline

Challenges of marketing analytics

– Integrating the “marketing” and “data mining” approaches

– Customizing data mining approaches to the challenges of marketing decision support

Survey of some useful ML methodologies

– Bayesian network modeling

– Utility-based classification (Cost-sensitive learning)

– Reinforcement learning and Markov decision processes

Detailed analysis and case studies:

– Customer lifetime value modeling

– Customer wallet estimation

Page 42: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation43

Cost-sensitive Learning for Marketing Decision Support

Use of Basic Machine Learning (e.g. Classification and Regression) in Marketing Decision Support is well accepted

– Example applications include: targeted marketing, credit rating, and others

– But are they the best we have to offer ?

Regression is an inherently harder problem than is required

– One does not necessarily need to predict business outcome, customer behavior, etc, but is merely required to make business decisions

– Regression may fail to detect significant patterns, especially when data is noisy

Classification is an over simplification

– By mapping to classification, one loses information on the degree of goodness/badness of a business decision in the past data

Cost-sensitive classification provides the desired middle ground

– It simplifies the problem almost to classification and thus allows discovery of significant patterns;

– Yet retains and exploits the information on the degree of goodness of business decisions, in a way that is motivated by Utility theory

Page 43: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation44

Cost-sensitive Learning a.k.a. Utility-based Classification

In regression: given (x,r) ε X x R, generated from a sampling distribution, find F: – F(x) ≈ r – E.g. r = profit obtained by targeting customer x

In classification: given (x,y) ε X x {0,1} , generated from a sampling distribution, find F: – F(x) ≈ y – E.g. y = 1 if customer x is “good”, 0 otherwise

In utility-based classification: given (stochastic) utility function U and (x,y) ε X x {0,1} generated from a sampling distribution, find F: – E[U(x,y,F(x))] is maximized (or equivalently E[C(x,y,F(x))] is minimized)

– E.g. U(x,1,1) = Profit(x) = Profit obtained by targeting customer x, when x is indeed a “good” customer.

Page 44: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation45

Example Cost and Utility Functions

Simple formulation (cost/benefit matrix)

More realistic formulation (utility/cost dependent on individuals)

Predicted

True

0 1

0 1 0

1 0 1

Classification utility matrix

“Credit rating” utility

Predicted

True

bad good

bad 0 - Default Amt

good 0 Interest

“Targeted marketing” utility

Predicted

True

bad good

bad 0 - C

good 0 Profit – C

Predicted

True

0 1

0 0 1

1 1 0

Misclassification cost matrix

Page 45: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation46

Bayesian Approach with Regression

For each example x, choose the class that minimizes the expected cost:

Problem: Requires conditional density estimation and regression to solve a classification problem.

– Price is high computational and sample complexity Merit: more flexibility and general applicability

– Business constraints

– Variability in fixed costs

– But, is it necessary ?

ji

jixCxjPxi ),,()|(minarg)(*

need be estimated!

Page 46: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation47

A Classification Approach: Reducing cost-sensitive learning to weighted classification via “Costing” [ZLA’03]

• If Y is {0,1}, then minimizing cost is equivalent to minimizing

where

• Even though we have a 2 x 2 cost matrix, its minimization can be done using one weight per labeled example

• Given a distributional assumption on w(x,y), minimizing the above weighted error in the training data will generalize to unseen test instances !

• “Costing” Algorithm

• “Costing” repeatedly performs weighted rejection sampling with w(x,y) to obtain an ensemble of hypotheses

• Similar approaches have been applied to class probability estimation (“probing” [LZ’05) and quantile regression (“quanting” [LOZ’06])

)],())(([ yxwyxhIE YX

),()1,(),( yxCyxCyxw

example

Page 47: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation48

Exceptional

Performance

Empirical Evaluation with Targeted Marketing data sets

Method Costing

(200)

Transparent Box

Resampling (100k)

No weight

NB $13163 $12367 $12026 $0.24

Boosted NB $14714 $14489 $13135 -$1.36

C4.5 $15016 -$118 $2259 $0

SVMLight $13152 $13683 $12808 $0

KDD-98:Charity donation

DMEF-2:Targeted marketing

Method Costing

(200)

Transparent Box

Resampling (100k)

No weight

NB $37629 $32608 $12026 $16462

Boosted NB $37891 $36381 $13135 $121

C4.5 $37500 $478 $2259 $0

SVMLight $35290 $36443 $12808 $0

Rejection Sampling Feeding weights Classification

*Costing is state-of-the-art, but is restricted to 2-class problems

Resampling

Test Set Profits

Page 48: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation49

A Closer Look: “Costing” (cost based bagging) [ZLA’03]

Costing (Learner A, Data S, count T)

(1) For all set

(2) For t=1 to T do

(1) Let ht = A(S, w)

(3) Output H:

S y)(x,

),()1,(, yxCyxCw yx

yxh

y

t

xH)(

1maxarg)(

Same weight in every iteration

It only makes sense for 2-class

Page 49: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation50

A Multi-class Extension: Cost-sensitive boosting algorithm [AZL 2004]

T

tii yxh

1

),(

),()],([ 0,y x, yxCyxCEw HyS

||/1),(0 YyxH

}'),(|))0(),,{((' yx, SyxwIyxT

S' y)(x,

S' y)(x,

),()],([ 1,y x, yxCyxCEw tHyS

GBSE (Learner A, Expanded data S’, count T)

(1) For all initialize

(2) For all initialize weight

(3) For t=1 to T do

(a) For all (x,y) in S’ update weight

(b) Let

(c) Let ht = A(T,|w|)

(d) ft = Stochastic(hi )

(e) Ft = (1- α)Ft-1+αft

(4) Output h(x) = arg max( )

Weight updated in each iteration

the difference between average cost

by the current ensemble and cost of y

Y}y' S,y)(x,y |)y'{(x,S' y)(x, Define the “expanded sample” S’ as:

Page 50: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation51

Gradient Boosting with Stochastic Ensembles: Illustration

C(x,y)

At learning iteration t At learning iteration t+1

+

Cost C(x,y)

Predicted

Label, y+ - - + -

Training Labels

• The difference between the current average cost and the cost associated with a particular label is the boosting weight

• The sign of the weight, E[C(x,y)]– C(x,y), is the training label

Ave Cost

E[C(x,y)]

y

Page 51: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation52

Cost-sensitive boosting outperforms existing methods of cost-sensitive learning as well as classification and regression

Existing methods

Data Set Bagging AvgCost MetaCost GBSE

Annealing 1059±174 127±12 207±42 34±4

Solar 5403±397 237±38 5317±390 48±10

KDD-99 319±42 42±8 49±9 2±1

Letter 151±3 92±1 130±2 85±2

Splice 64±5 61±4 50±3 58±4

Satellite 190±10 108±6 104±6 93±6

Ave Test Set Cost (±SE)

Page 52: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation53

Tutorial outline

Challenges of marketing analytics

– Integrating the “marketing” and “data mining” approaches

– Customizing data mining approaches to the challenges of marketing decision support

Survey of some useful ML methodologies

– Utility-based classification (Cost-sensitive learning)

– Reinforcement learning and Markov decision processes

– Bayesian network modeling

Detailed analysis and case studies:

– Customer lifetime value modeling

– Customer wallet estimation

Page 53: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation54

Sequential Cost-sensitive Decision Making by Reinforcement Learning

Cost-sensitive classification provides an adequate framework for single marketing decision making

– Real world marketing decision making is rarely made in isolation, but is made sequentially

– Need to address the sequential dependency in decision making

Cost-sensitive classification

– Maximizes E[U(x,h(x)]

We now wish to

– Maximize Σt E[U(xt,h(xt)], where x may depend on earlier decisions …

This is nothing but Reinforcement Learning, if we view x as the “state”

– Maximize Σt E[U(st,π(st))], where st is determined stochastically according to a transition probability determined by st-1 and π(st-1).

Page 54: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation55

Review: Markov Decision Process (MDP)

At any given time t, the agent is in some state s.

It takes an action a, and makes a transition to the next state s’, dictated by transition probability T(s,a)

It then receives a “reward”, or utility U(s,a), which also depends on state s and action a.

The goal of a reinforcement learner in MDP is to learn a policy, namely π: S → A, mapping states to actions, so as to maximize the cumulative discounted reward:

),( R0t

ttt asU

Page 55: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation56

Modeling CRM process using Markov Decision Process (MDP) Customer is in some "state" (his/her attributes) at any point in time

Retailer's action will move customer into another state

Retailer's goal is to take sequence of actions to guide customer's path to maximize customer's lifetime value

Reinforcement Learning produces optimized targeting rules of the form If customer is in state "s", then take marketing action "a"

Customer state “s” represented by current customer attribute vector

estimates LTV(s,a) -- best policy is to choose a to maximize LTV(s,a)

Typical CRM Process

BargainHunter

Repeater

LoyalCustomer

ValuableCustomer

One Timer

Repeater

Defector Defector

Repeater

LoyalCustomer

PotentiallyValuable

Campaign A

Campaign B

Campaign C

Campaign E

Campaign D

MDP and Reinforcement Learning provide an advanced framework for modeling customer lifetime value

p 64

Page 56: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation57

Observed lifetime value reflects only customer’s lifetime value attained by current marketing policy, and therefore fails to capture their potential lifetime value

MDP based lifetime value modeling allows modeling of lifetime value based on optimized marketing policy (= the output of system !)

BargainHunter

Repeater

LoyalCustomer

ValuableCustomer

One Timer

Repeater

Defector Defector

Repeater

LoyalCustomer

PotentiallyValuable

Campaign A

Campaign B

Campaign C

Campaign E

Campaign D

Current marketing policy

Optimized marketing policy

• Estimated (potential) lifetime value will be based on the optimal path

• Output policy will lead the customer through the same path

MDP enables genuine lifetime value modeling, in contrast to existing approaches that use observed lifetime value

Customer A’s path under…

Page 57: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation58

And here is how this is possible…

The MDP enables the use of data for many customers in various stages (states) to determine potential lifetime value of a particular customer in a particular state

Reinforcement Learning can estimate the lifetime value (function) without explicitly estimating the MDP itself

– The key lies in the value iteration procedure based on “Bellman’s equation”

Repeater

LoyalCustomer

ValuableCustomer

Repeater Repeater

LoyalCustomer

PotentiallyValuable

Each rule is, in effect, trained with data corresponding to all subsequent states

LTV of a state = reward now + LTV of best next state

Rule a Rule b

Rule c

Rule d

)a',Q(s'maxa)]E[U(s, a)Q(s, 'a

Page 58: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation59

Reinforcement Learning Methods with Function Approximation

Value Iteration (based on Bellman Equation)

– Provides the basis for classic reinforcement learning methods like Q-learning

Batch Q-Learning (with Function Approximation)

– Solves value iteration as iterative regression problems

a)(s,Qmax arg (s)

)a',(s'Qmaxa)]E[U(s, a)(s,Q

]a)E[U(s, a)(s,Q

a

k'1k

0

a

))a',(s'Qmax),((a)(s,)Q-(1 a)(s,Q

a)U(s,a)(s,Q

k'k1k

0

aasU

Estimate using function approximation (regression)

Page 59: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation60

The graph below plots profits per campaign obtained in monthly campaigns over 2 years (in an empirical evaluation using benchmark data, i.e. KDD cup 98 data)

Lifetime value modeling based on reinforcement learning can achieve greater long term profits than the traditional approach

… to yield greater long term profits

Output policy of MDP approach (CCOM) “invests” in initial campaigns…

Output policy of MDP approach (CCOM) “invests” in initial campaigns…

Page 60: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation61

Tutorial outline

Challenges of marketing analytics

– Integrating the “marketing” and “data mining” approaches

– Customizing data mining approaches to the challenges of marketing decision support

Survey of some useful ML methodologies

– Utility-based classification (Cost-sensitive and active learning)

– Reinforcement learning and Markov decision processes

– Bayesian network modeling

Detailed analysis and case studies:

– Customer lifetime value modeling

– Customer wallet estimation

Page 61: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation62

Lifetime Value Modeling and Cross-Channel Optimized Marketing (CCOM)

Direct Mail

Kiosk

Web

Store

Call Center

$

$ $ $ $

Optimizes targeted marketing across multiple channels for lifetime value maximization.

Combines scalable data mining and reinforcement learning methods to realize unique capability.

Page 62: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation63

CCOM Pilot Project with Saks Fifth Avenue

Business Problem addressed: Optimizing direct mailing to maximize lifetime revenue at the store (and other channels)

Provided solution for the “Cross-Channel Challenge”: No explicit linking between marketing actions in one channel and revenue in another

CCOM mailing policy shown to achieve 7-8% increase in expected revenue in the store (in laboratory experiments) !

Direct Mail

Store

$ $ $ $

$

CCOM-pilot business problem

p56

Page 63: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation64

Some Example Features

Demographic Features action reward

FULL_LINE_STORE_OF_RES.: If a full-line store exists in the area 0.018 0.004

NON_FL_STORE_OF_RES.: If a non full-line store exists in area 0.012 -0.004

Transaction Features (concerning divisions relevant to current campaign)

CUR_DIV_PURCHASE_AMT_1M: Pur amt in last month in curr div 0.065 0.090

CUR_DIV_PURCHASE_AMT_2_3M: Pur amt in 2-3 month in curr div 0.099 0.080

CUR_DIV_PURCHASE_AMT_4_6M: Pur amt in 4-6 month in curr div 0.133 0.091

CUR_DIV_PURCHASE_AMT_1Y: Pur amt in last year in curr div 0.162 0.128

CUR_DIV_PURCHASE_AMT_TOT: Total Pur amt in current division 0.153 0.147

Promotion History Features (on divisions relevant to current campaign)

CUR_DIV_N_CATS_1M: Num cat sent last month in curr div 0.294 0.028

CUR_DIV_N_CATS_2_3M: Num cat sent 2-3 months ago in curr div 0.260 0.025

CUR_DIV_N_CATS_4_6M: Num cat sent 4-6 months ago in curr div 0.158 0.062

CUR_DIV_N_CATS_TOT: Total num cat sent in curr div to date 0.254 0.062

Control Variable

ACTION: To mail or not to mail 1.000 0.008

Target (Response) Variable

REWARD: Expected cumulative profits 0.008 1.000

Page 64: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation65

The Cross-Channel Challenge and Solution

The Challenge: No explicit linking between actions in one channel (mailing) and rewards in another (revenue)

• Very low correlation observed between actions and responses• Other factors determining “life time value” may dominate over the control variable

(marketing action) in estimation of “expected value” • Obtained models can be independent of the action and give rise to useless rules !

The Cross-Channel Solution: Learn the relative advantage of competing actions!

– Standard Method

– Proposed Method

Actions

Value in state s1 Value in state s2

Actionsa1 a2 a1 a2

Value in state s1 Value in state s2

Actionsa1 a2 a1 a2

Value in state s1 Value in state s2

Actions

a1 a2 a1 a2

Approximation

Page 65: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation66

The Learning Method

Definition of Advantage

– A(s,a):= 1/Δt(Q(s,a) – maxa’ Q(s,a’))

Advantage Updating Procedure [Baird ’94]

– Modifications: 1. Initialization with empirical life time value

– 2. Batch Learning with optional function approximation

Repeat 1. Learn 1.1. A(s,a):=(1-α)A(s,a)

+α (Amax(s)+(R(s,a)+γΔtV(s’)-V(s))/Δt) 1.2. Use Regression to estimate A(s,a) 1.3. V(s):=(1-β)V(s)

+β(V(s)+(Amax-new(s)-Amax-old(s))/α) 2. Normalize

A(s,a):=(1- ω)A(s,a)+ω(A(s,a)-Amax(s))

Page 66: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation67

Evaluation Results

Significant policy advantage observed with small number of iterations

Obtained policy with 7- 8% policy advantage, i.e. 7- 8% increase in expected revenue (for 1.6 million customers considered)

Mailing policy was constrained to mail same number of catalogues in each campaign as last year

CCOM to evaluate sequence of models and output best model

Policy Advantage

-4

-2

0

2

4

6

8

10

1 2 3 4 5

Learning iterations

Ad

va

nta

ge

(p

erc

en

tag

e)

Policy Advantage

-4

-2

0

2

4

6

8

1 2 3 4 5

Learning iterations

Ad

va

nta

ge

(p

erc

en

t)

Typical run (version 1)

Typical run (version 2)

Page 67: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation68

Evaluation Method

Challenge in Evaluation: Need to evaluate new policy using data collected by existing (sampling) policy

Solution: Use bias-corrected estimation of “policy advantage” using data collected by sampling policy

Definition of policy advantage:

– (Discrete Time) Advantage

– Policy Advantage

Estimating policy advantage with bias corrected sampling

Aπ(s,a):= Qπ (s,a) – maxa’ Qπ (s,a’)

As~π(π’):= Eπ [Ea~π’ [Aπ(s,a)]]

As~π(π’):= Eπ [(π’(a|s)/ π(a|s)) [Aπ(s,a)]]

Page 68: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation69

Combination of reinforcement learning (MDP) with predictive data mining enables automatic generation of trigger-based marketing targeting rules

Optimized with respect to the customer’s potential lifetime value

Stated in simple “if then” style, which supports flexibility and compatibility

Refined to make reference to detailed customer attributes and hence, well-suited to event and trigger-based marketing

This is made possible by

– Representing the states in MDP by customer’s attribute vectors

– Combining reinforcement learning with predictive data mining to estimate lifetime value as function of customer attributes and marketing actions

An example marketing targeting ruleoutput by CCOM system

Page 69: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation70

Some examples of rules output by CCOM

Interpretation: If a customer has spent in the current division but enough catalogues have been sent, then don’t mail

• Avoid saturation effects

• Differentiate between customers who may be near saturation and those who are not … Interpretation: If a customer has spent in the current division and has received moderately many relevant catalogues, then mail

• Invest in a customer until it knows it is not worth it

Interpretation: If a customer has spent significantly in the past and yet has not spent much in the current division (product group) then don’t mail

Page 70: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation71

Marketing Event

Event IdentifierChannel IdentifierEvent DateEvent Category DescriptionFixed Cost

Customer

Customer IdentifierFirst NameLast NameAgeGender

Transaction

Customer IdentifierTransaction DateProduct Category IdentifierEvent IdentifierChannel IdentifierTransaction RevenueTransaction Profit

Customer Marketing Action

Event IdentifierCustomer IdentifierMarketing Action DateMarketing Action

Period

Period IdentifierPeriod Duration

Customer Profile History

Customer IdentifierProfile History DatePeriod IdentifierProduct Category IdentifierChannel IdentifierAggregated Count of EventAggregated RevenueAggegated Profit

Channel

Channel IdentifierChannel Description

Product Category

Product Category IdentifierProduct Category Description

Customer Loyalty Level History

Customer IdentifierLoyalty Level Start DateLoyalty Level End DateLoyalty Level

EventProduct Category

Event IdentifierProduct Category IdentifierWeight

CCOM Output Models

Marketing Policy Model

Model IdentifierModel TypeModel

Lifetime Value Model

Model IdentifierModel TypeModel

CCOM - Logical Data Model

Optional Entity

CCOM is generically applicable by mapping physical data to this model

*Developed with CBO

Page 71: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation72

Tutorial outline

Challenges of marketing analytics

– Integrating the “marketing” and “data mining” approaches

– Customizing data mining approaches to the challenges of marketing decision support

Survey of some useful ML methodologies

– Bayesian network modeling

– Utility-based classification (Cost-sensitive learning)

– Reinforcement learning and Markov decision processes

Detailed analysis and case studies:

– Customer lifetime value modeling

– Customer wallet estimation

Page 72: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation73

Wallet Estimation Case Study Outline Introduction

– Business motivation and different wallet definitions

Modeling approaches for conditional quantile estimation

– Local and global models

– Empirical evaluation

A graphical model approach to wallet estimation

– Generic algorithm for class of latent variable modeling problems

MAP (Market Alignment Program)

– Description of application and goals

– The interview process and the feedback loop

– Evaluation of Wallet models performance in MAP

Page 73: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation74

What is Wallet (AKA Opportunity)?

Total amount of money a company can spend on a certain category of products.

Company Revenue

IT Wallet

IBM Sales

IBM sales IT wallet Company revenue

Page 74: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation75

Why Are We Interested in Wallet?

Customer targeting

– Focus on acquiring customers with high wallet

– Evaluate customers’ growth potential by combining wallet estimates and sales history

– For existing customers, focus on high wallet, low share-of-wallet customers

Sales force management

– Make resource assignment decisions

• Concentrate resources on untapped

– Evaluate success of sales personnel and sales channel by share-of-wallet they attain

OnT

argetM

AP

Page 75: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation76

Wallet Modeling Problem

Given:

– customer firmographics x (from D&B): industry, emloyee number, company type etc.

– customer revenue r

– IBM relationship variables z: historical sales by product

– IBM sales s

Goal: model customer wallet w, then use it to “predict” present/future wallets

No direct training data on w or information about its distribution!

Page 76: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation77

Historical Approaches within IBM Top down: this is the approach used by IBM

Market Intelligence in North America (called ITEM)

– Use econometric models to assign total “opportunity” to segment (e.g., industry geography)

– Assign to companies in segment proportional to their size (e.g., D&B employee counts)

Bottom up: learn a model for individual companies

– Get “true” wallet values through surveys or appropriate data repositories (exist e.g. for credit cards)

Many issues with both approaches (won’t go into detail)

– We would like a predictive approach from raw data

Page 77: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation78

Relevant Work in the Literature

While wallet (or share of wallet) is widely recognized as important, not much work on estimating it:

Du, Kamakura and Mela (2006) developed “list augmentation” approach, using survey data to model spending with competitors

Epsilon Data Management in white paper in 2001, proposed survey-based methodology

Zadrozny, Costa and Kamakura (2005) compared bottom-up and top-down approaches on IBM data. Evaluation is based on a survey.

Page 78: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation79

Traditional Approaches to Model Evaluation

Evaluate models based on surveys

– Cost and reliability issues

Evaluate models based on high-level performance indicators:

– Do the wallet numbers sum up to numbers that “make sense” at segment level (e.g., compared to macro-economic models)?

– Does the distribution of differences between predicted Wallet and actual IBM Sales and/or Company Revenue make sense? In particular, are the % we expect bigger/smaller?

– Problem: no observation-level evaluation

Page 79: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation80

Proposed Hierarchical IT Wallet Definitions TOTAL: Total customer available IT budget

– Probably not quantity we want (IBM cannot sell it all)

SERVED: Total customer spending on IT products covered by IBM

– Share of wallet is portion of this number spent with IBM?

REALISTIC: IBM sales to “best similar customers”

– This can be concretely defined as a high percentile of:P(IBM revenue | customer attributes)

– Fits typical definition of opportunity?

REALISTIC SERVED TOTAL

TOTAL

SERVED

REALISTIC

Page 80: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation81

An Approach to Estimating SERVED Wallets

Wallet is unobserved, all other variables are

Two families of variables --- firmographics and IBM relationship are conditionally independent given wallet

We develop inference procedures and demonstrate them

Theoretically attractive, practically questionable

(We will come back to this later)

Company

firmographics

IT spendwith IBM

Historical relationship

with IBM

SERVEDWallet

Page 81: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation82

Distribution of IBM sales to the customer given customer attributes: s|r,x,z ~ f,r,x,z

E.g., the standard linear regression assumption:

What we are looking for is the pth percentile of this distribution

REALISTIC Wallet: Percentile of Conditional

),(~,,| 2 zrxNzrxs

E(s|r,x,z) REALISTIC

Page 82: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation83

Estimating Conditional Distributions and Quantiles Assume for now we know which percentile p we

are looking for

First observe that modeling well the complete conditional distribution P(s|r,x,z) is sufficientIf have good parametric model and distribution

assumptions can also use it to estimate quantiles

– E.g.: linear regression under linear model and homoskedastic iid gaussian errors assumptions

Practically, however, may not be good idea to count on such assumptions

– Especially not a gaussian model, because of statistical robustness considerations

Page 83: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation84

Modeling REALISTIC Wallet Directly

REALISTIC defines wallet as pth percentile of conditional of spending given customer attributes

– Implies some (1-p)% of the customers are spending full wallet with IBM

Two obvious ways to get at the pth percentile:

– Estimate the conditional by integrating over a neighborhood of similar customers Take pth percentile of spending in neighborhood

– Create a global model for pth percentile Build global regression models

Page 84: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation85

Local Models: K-Nearest Neighbors Design distance metric, e.g.:

– Same industry

– Similar employees/revenue

– Similar IBM relationship

Neighborhood sizes (k):

– Neighborhood size has significant effect on prediction quality

Prediction:

– Quantile of firms in the neighborhood

Indu

stry

Employees IBM

spe

nd

Universe of IBM customers with D&B information

Neighborhood of target company

Target company i

Fre

qu

en

cy

IBM Sales

Wallet Estimate

Page 85: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation86

Global Estimation: the Quantile Loss Function Our REALISTIC wallet definition calls for estimating the

pth quantile of P(s|x,z).

Can we devise a loss function which correctly estimates the quantile on average?Answer: yes, the quantile loss function for quantile p.

This loss function is optimized in expectation when we correctly predict REALISTIC:

yyyyp

yyyypyyLp ˆ if )ˆ()1(

ˆ if )ˆ()ˆ,(

)|( of quantile p)|)ˆ,((minarg thˆ xyPxyyLE py

Page 86: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation87

-3 -2 -1 0 1 2 3

01

23

4

Some Quantile Loss Functions

p=0.8

p=0.5 (absolute loss)

Residual (observed-predicted)

Page 87: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation88

Quantile Regression Squared loss regression:

– Estimation of conditional expected value by minimizing sum of squares

Quantile regression:

– Minimize Quantile loss:

Implementation:

– assume linear function in some representation y = t f(x,z), solution using linear programming

– Linear quantile regression package in R (Koenker, 2001)

n

iiiip xzfsL

1

)),,(,(min

quantile regression

loss function

n

iiii xzfs

1

2)),,((min

yyyyp

yyyypyyLp ˆ if )ˆ()1(

ˆ if )ˆ()ˆ,(

Page 88: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation89

Quantile Regression Tree – Local or Global? Motivation:

– Identify a locally optimal definition of neighborhood

– Inherently nonlinear

Adjustments of M5/CART for Quantile prediction:

– Predict the quantile rather than the mean of the leaf

– Empirically, splitting/pruning criteria do not require adjustment

Industry = ‘Banking’

Sales<100K

IBM Rev 2003>10K

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

no

yes

no

no

yes

yes

Industry = ‘Banking’

Sales<100K

IBM Rev 2003>10K

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

Fre

qu

ency

IBM Sales

Wallet Estimate

no

yes

no

no

yes

yes

Page 89: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation90

Aside: Log-Scale Modeling of Monetary Quantities Due to exponential, very long tailed typical

distribution of monetary quantities (like Sales and Wallet), it is typically impossible to model them on original scale, because e.g.:

– Biggest companies dominate modeling and evaluation

– Any implicit homoskedasticity assumption in using fixed loss function is invalid

Log scale is often statistically appropriate, for example if % change is likely to be “homoskedastic”

Major issue: models ultimately judged in dollars, not log-dollars…

Page 90: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation91

Empirical Evaluation: Quantile Loss

Setup

– Four domains with relevant quantile modeling problems:direct mailing, housing prices, income data, IBM sales

– Performance on test set in terms of 0.9th quantile loss

– Approaches: Linear quantile regression, Q-kNN, Quantile trees, Bagged quantile trees, Quanting (Langrofd et al. 2006 -- reduces quantile estimation to averaged classification using trees)

Baselines

– Best constant model

– Traditional regression models for expected values, adjusted under Gaussian assumption (+1.28)

Page 91: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation92

Performance on Quantile Loss

Conclusions

– Standard regression models are not competitive

– If there is a time-lagged variable, LinQuantReg is best

– Otherwise, bagged quantile trees (and quanting) perform best

– Q-kNN is not competitive

Page 92: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation93

Residuals for Quantile Regression

Total positive holdout residuals: 90.05% (18009/20000)

Page 93: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation94

Graphical Model for SERVED(?) Wallet Estimation

Customer’s Firmographics (X)

Customer’s IT Wallet (W)

Customer’s Spendingwith IBM (S)

Customer’s Relationshipwith IBM (Z)

View 1

View 2

Two conditionally independent views !

Page 94: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation95

Generic Problem Setting

Unsupervised learning scenario:

Unobserved target variable

Observations on multiple predictor variables

Domain knowledge suggesting that the predictors form multiple conditionally independent views

Goal: To predict the target variable

Page 95: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation96

Summary of Results on Generic Problem Analysis of a relevant class of latent variable

models– Markov blanket can be split into conditionally independent

views

– For exponential linear models, the maximum likelihood estimation reduces to convex optimization problem

Solution approaches for Gaussian likelihoods– Reduction to single linear least squares regression

– ANOVA for testing conditional independence assumptions

Empirical evaluation– Comparable to supervised learning with significant amount of

training data

– Case study on wallet estimation

Page 96: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation97

Discriminative Maximum Likelihood Inference Given: Directed graphical model and parametric form of the

conditional distributions of nodes given their parents

Goal: Predict the target W using the parameter estimates that are most likely given the observed data and the graphical model:

Where = (0, 1) is the parameter vector for the parametric conditional likelihoods, and D is our data

Solution: Expectation-Maximization (EM) algorithm

– Converges to a local optimum in general

Estimating W: Mean or mode of “posterior”

dwww Z),|(SPX)|(Plogmax Z)X, | (SP logmax10D,

*

Z)X,|W(P *

Page 97: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation98

General Theoretical Result: Exponential Models

Theorem: When the conditional distributions p(W|X) and p(S|W,Z), correspond to exponential linear models with matching link functions, the incomplete discriminative log-likelihood: LD() = log PD,(S|X,Z)is a concave function of the parameters

Maximum likelihood estimation reduces to a convex optimization problem

EM algorithm converges to the globally optimal solution

Page 98: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation99

Gaussian Likelihoods and Linear Regression Assume both discriminative likelihoods P(W|X)

and P(S|W,Z) are linear and gaussian:

wi - txi = iw ~ N(0, w2) i.i.d

si - wi - tzi = is ~ N(0, s2) i.i.d

Previous theorem says that EM would give ML solution MLE= (MLE, MLE)

But if we add equations up we eliminate W:

si - txi - tzi = (is+ iw) ~ N(0, s2+ w

2) i.i.d

Maximum likelihood solution of this problem is linear regression and gives solution LS= (LS, LS)

– Are the two solutions the same?

Page 99: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation100

Equivalence and Interpretation

Equivalence Theorem: When U=[X,Z] is a full column rank matrix, the two estimates are identical: MLE =LS

Consistency of LS and unbiasedness of resulting W estimates

Can make use of linear regression computation and inference toolsIn particular: ANOVA to test validity of assumptions

Some caveats we glossed over

– In particular, full rank requirement implies cannot have intercept in both gaussian likelihoods!

Page 100: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation101

ANOVA for Testing Independence Assumptions

ANOVA: Variance-based analysis for determining the goodness of fit for nested linear models

Example of nested models:

– Model A: Linear model with only variables in X, Z and no interactions

– Model B: Allow interactions only within X and Z

– Model C: Allow interactions between variables in X and Z

Key Idea: if model C is statistically superior to model B conditional independence and/or parametric assumptions are rejected

Page 101: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation102

Some Simulation Results

Z

z

Page 102: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation103

Wallet Case Study Results

Modeling equations: (monetary values log scale)

log(wi) = f(xi) + cw + iw, iw~ N(0, σ2)

log(si) − log(wi) = g(zi) + cs + is, is~ N(0, σ2)

(cw, cs are intercepts, f, g are parametric forms)

Data is 2000 IBM customers in finance sector

ANOVA results consistent with cond. independence:

Page 103: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation104

Market Alignment Project (MAP): Background

MAP - Objective:

– Optimize the allocation of sales force

– Focus on customers with growth potential

– Set evaluation baselines for sales personal

MAP – Components:

– Web-interface with customer information

– Analytical component: wallet estimates

– Workshops with Sales personal to review and correct the wallet predictions

– Shift of resources towards customers with lower wallet share

Page 104: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation105

MAP Tool Captures Expert Feedback from the Client Facing Teams

Transaction Data

D&BData

Wallet models: Predicted

Opportunity

ResourceAssignments

Expert validated

Opportunity

Analytics and Validation

Data Integration

Insight Delivery and Capture

Post-processing

MAP Interview Team Client Facing Unit (CFU) Team

Web Interface

MAP interview process – all Integrated and Aligned Coverages

The objective here is to use expert feedback (i.e. validated revenue opportunity) from from last year’s workshops to evaluate our latest opportunity models

Page 105: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation106

MAP Workshops Overview Calculated 2005 opportunity using naive Q-kNN

approach

2005 MAP workshops

– Displayed opportunity by brand

– Expert can accept or alter the opportunity

Select 3 brands for evaluation: DB2, Rational, Tivoli

Build ~100 models for each brand using different approaches

Compare expert opportunity to model predictions

– Error measures: absolute, squared

– Scale: original, log, root

Page 106: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation107

Initial Q-kNN Model Used

Distance metric

– Identical Industry

– Euclidean distance on size (Revenue or employees)

Neighborhood sizes 20

Prediction

– Median of the non-zero neighbors

– (Alternatives Max, Percentile)

Post-Processing

– Floor prediction by max of last 3 years revenue

Indu

stry

Employees Reven

ue

Universe of IBM customers with D&B information

Neighborhood of target company

Target company i

Page 107: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation108

0

2

4

6

8

10

12

14

16

18

20

0 2 4 6 8 10 12 14 16 18 20

Expert Feedback

MODEL_OPPTY

Expert Feedback (Log Scale) to Original Model (DB2)

Experts reduce opportunity to 0(15%)

Experts acceptopportunity (45%)

Experts changeopportunity (40%)

Increase (17%)

Decrease (23%)

Page 108: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation109

Observations

Many accounts are set for external reasons to zero

– Exclude from evaluation since no model can predict this

Exponential distribution of opportunities

– Evaluation on the original (non-log) scale suffers from huge outliers

Experts seem to make percentage adjustments

– Consider log scale evaluation in addition to original scale and root as intermediate

– Suspect strong “anchoring” bias, 45% of opportunities were not touched

Page 109: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation110

Evaluation Measures

Different scales to avoid outlier artifacts

– Original: e = model - expert

– Root: e = root(model) - root(expert)

– Log: e = log(model) - log(expert)

Statistics on the distribution of the errors

– Mean of e2

– Mean of |e|

Total of 6 criteria

Page 110: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation111

Model Comparison Results

Model Rational DB2 Tivoli

Displayed Model (kNN) 6 6 4 5 6 6

Max 03-05 Revenue 1 1 0 3 1 4

Linear Quantile 0.8 5 6 2 4 3 5

Regression Tree 1 3 2 4 1 2

Q-kNN 50 + flooring 2 3 6 6 4 6

Decomposition Center 0 0 3 5 0 4

Quantile Tree 0.8 0 1 2 4 1 4

(Anchoring)

(Best)

We count how often a model scores within the top 10 and 20 for each of the 6 measures:

Page 111: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation112

MAP Experiments Conclusions Q-kNN performs very well after flooring but is

typically inferior prior to flooring

80th percentile Linear quantile regression performs consistently well (flooring has a minor effect)

Experts are strongly influenced by displayed opportunity (and displayed revenue of previous years)

Models without last year’s revenue don’t perform well

Use Linear Quantile Regression with q=0.8 in MAP 06

Page 112: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation113

MAP Business Impact MAP launched in 2005

– In 2006 420 workshops held worldwide, with teams responsible for most of IBM’s revenue

Most important use is segmentation of customer base

– Shift resources into “invest” segments with low wallet share

Extensive anecdotal evidence to success of process

– E.g., higher growth in “invest” accounts after resource shifts

MAP recognized as 2006 IBM Research Accomplishment

– Awarded based on “proven” business impact

Page 113: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation114

Summary

Wallet estimation problem is practically important and under-researched

Our contributions:

– Propose Wallet definitions: SERVED and REALISTIC

– Offer corresponding modeling approaches:

• Quantile estimation methods• Graphical latent variable model

– Evaluation on simulated, public and internal data

– Implementation within MAP project

We are interested in extending both theory and practice to other domains than IBM

Page 114: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation115

References Marketing Science

– R. Rust, K. Lemon and V. Zeithaml, “Return on Marketing: Using Customer Equity to Focus Marketing Strategy”, J. of Marketing, 2004.

– P. Kotler, Marketing Management. Millennium Ed., Prentice-Hall, 2000.

Cost-sensitive Learning– P. Domingo, Meta-Cost: A general method for making classifiers cost-sensitive,

The 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.

– B. Zadrozny, J. Langford and N. Abe, Cost-sensitive learning by cost-propotionate example weighting, in IEEE International Conference on Data Mining (ICDM), 2003.

– N. Abe, B. Zadrozny and J. Langford, An Iterative Method for Multi-class Cost-sensitive Learning, The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2004.

MDP and Reinforcement Learning– R. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press,

Cambridge, MA, 1998. – L. P. Kaelbling, M. L. Littman, A. W. Moore, Reinforcement Learning: A Survey,

Journal of Artificial Intelligence Research, 1996.

Page 115: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation116

References Bayesian Networks and Causal Networks

– K. Murphy, A brief introduction to Bayesian Networks and Graphical Models, http://www.cs.berkeley.edu/~murphyk/Bayes/bayes.html

– D. Heckerman, A tutorial on learning with Bayesian Networks, Microsoft Research MSR-TR-95-06, March 1995.

– J. Pearl, Causality: Models, Reasoning, and Inference, Cambridge University Press, 2000.

– P. Spirtes, C. Glymour and R. Scheines, Causation, Prediction, and Search, 2nd Edition (MIT Press), 2000.

Case Study: Lifetime Value Modeling

– N. Abe, N. Verma, C. Apte and R. Schroko, Cross channel optimized marketing by reinforcement learning, The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2004.

– E. Pednault, N. Abe, B. Zadrozny, H. Wang, W. Fan and C. Apte, Sequential cost-sensitive decision making by reinforcement learning, The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2002.

Page 116: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation117

References Case Study: Customer Wallet Estimation

– S. Rosset, C. Perlich, B. Zadrozny, S. Merugu, S. Weiss and R. Lawrence, Customer Wallet Estimation. 1st NYU workshop on CRM and Data Mining, 2005.

– S. Merugu, S. Rosset and C. Perlich, A New Multi-View Regression Method with an Application to Customer Wallet Estimation. The Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2006.

– R. Koenker, Quantile Regression. Econometric Society Monograph Series,

Cambridge University Press, 2005.

Page 117: Data analytics for marketing decision support

IBM Research

© 2006 IBM Corporation118

Thank [email protected]

[email protected]