gadatamining cna

73

Click here to load reader

Upload: arpita1790

Post on 14-May-2017

241 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GADataMining CNA

Genetic Algorithms for

Data Mining

Sid Bhattacharyya

Page 2: GADataMining CNA

Overview• Genetic Algorithms: a gentle introduction

– What are GAs– How do they work/ Why?– Critical issues

• Using Genetic algorithms (effectively)• Use in Data Mining

Page 3: GADataMining CNA

Natural Genetics to AI

• Computational models inspired by biological evolution– survival of the fittest– reproduction through cross-breeding

Page 4: GADataMining CNA

Genetic Algorithms• Population based search (parallel)

– simultaneous search from multiple points in search space

population members: potential solutions

• Fitness function (search objective)– numerical “figure of merit”/utility measure of an individual

selection• “Mating” and reproduction of individuals

crossover, mutation• Evolution from one generation to the next

iterative search, convergence

Page 5: GADataMining CNA

Advantage GAs• General purpose, robust search technique

– application to varied problem types

• Data mining– fitness function: flexible expression of modeling criteria,

tradeoffs amongst multiple objectives– models optimized to specific business objectives

– diverse model representation– linear, non-linear interaction terms, rules, sequences, etc.

Page 6: GADataMining CNA

GA Application Examples• Function optimizers

– difficult, discontinuous, multi-modal, noisy functions• Combinatorial optimization

– layout of VLSI circuits, factory scheduling, traveling salesman problem

• Design and Control– bridge structures, neural networks, communication networks

design; control of chemical plants, pipelines• Machine learning

– classification rules, economic modeling, scheduling strategies

Portfolio design, optimized trading models, directmarketing models, sequencing of TV advertisements,adaptive agents, data mining, etc.

Page 7: GADataMining CNA

GAs: Basic Principles

• Representation of individuals– String of parameters (genes) : chromosome

eg. F(p,q,r,s,t): p q r s t– Bit-string representation (?):

1 0 0 1 1 0 1 0 1 1 0 1 1 0 0– genotype and phenotype

Page 8: GADataMining CNA

GAs: Basic Principles• Survival of the fittest (Fitness function)

– numerical “figure of merit”/utility measure of an individual

– tradeoff amongst a multiple evaluation criteria– efficient evaluation

Page 9: GADataMining CNA

GAs: Basic Principles

• Reproduction to create offspring– Selection– Crossover– Mutation

Page 10: GADataMining CNA

GAs: Basic Principles

• Convergence– progression towards uniformity in population– premature convergence?

(local optima)

Page 11: GADataMining CNA

GA: Basic Operation

Solution1 (f1)

Solution2 (f2)

Solution3 (f3)

Solution4 (f4)

...

...

SolutionN (fN)

Solution1

Solution2

Solution2

Solution4

...

...

SolutionX

Offspring1(1,4)

Offspring2(1,4)

Offspring3(2,7)

Offspring4(2,7)

...

...

OffspringN(x,y)

Selection RecombinationCrossover Mutation

Generation t Generation t+1

Page 12: GADataMining CNA

GAs: Parallel Search

X

X

Hill climber

Fitness

x

Page 13: GADataMining CNA

Typical GA Run

Fitness

Generations

Best

Average

Page 14: GADataMining CNA

Operators: Selection

• Fitness proportionate selection (fi/f )• number of reproductive trials for

individuals

Page 15: GADataMining CNA

Selection• Roulette-wheel selection

(stochastic sampling with replacement)

– wheel spaced in proportion to fitness values

– N (pop size) spins of the wheel

Page 16: GADataMining CNA

Selection• Stochastic universal sampling

– N equally spaced pins on wheel– single turn of the wheel

Page 17: GADataMining CNA

Selection• Premature converge• Fitness scaling

f = f - (2*avg. - max.)• Ranked fitness• Elitism• Steady-state selection• Demetic grouping

Page 18: GADataMining CNA

Operators: CrossoverParent 1: 11010 101100101Parent 2: xxyxx yxyyxxyxy

crossover site

Offspring 1: 11010 yxyyxxyxyOffspring 2: xxyxx 101100101

(Single-pt. crossover)

• combining good building blocks

Page 19: GADataMining CNA

CrossoverParent 1: axpsqvqbtpihdParent 2: qzxxaycgbtphw

crossover sites

Offspring 1: azpsavcbtpphdOffspring 2: qxxxqyqgbtihw

(Uniform crossover)

Page 20: GADataMining CNA

Crossover

Fitness

x

X X

X ParentsOffspring

Page 21: GADataMining CNA

Operators: Mutation

• alters each gene with small probabilityx 1 y x 0 y 0 y y 0 x y x y

x 1 y x 0 y 1 y y 0 x x x y

Page 22: GADataMining CNA

Recombination operators

• Mutation & premature convergence• Mutation vs. Crossover

– operator probabilities– which is more important?

• Optimal parameter settings (!)

Page 23: GADataMining CNA

Non-Binary Representations• Integer, real-number, order-based, rules, ...

• Binary or Real-valued?real representations give faster, more consistent, more accurate results

• High-level representation– intuitive, can utilize specialized crossover and mutation– effective search over complex spaces– design of representation and operators --forma theory

Page 24: GADataMining CNA

Real-valued representationParent1: 3.45 0.56 6.78 0.976 2.5Parent2: 0.98 1.06 4.20 0.34 1.8

Offspring1: 3.22 0.56 6.78 0.65 2.12Offspring2: 1.43 1.06 4.20 0.41 1.93

(Arithmetic crossover)

Page 25: GADataMining CNA

High-level representationParent1:Parent2:

Offspring1:Offspring2:

{(1 .2 x 3 .4 ) (5 .8 x 6 .0 ) (0 .2 x 0 .61 )}1 2 7≤ ≤ ∧ ≤ ≤ ∧ ≤ ≤

{( . . ) ( . . ) ( . . )2 3 41 36 51 51 5616 2 4≤ ≤ ∧ ≤ ≤ ∧ ≤ ≤x x x∧ ≤ ≤ ∧ ≤ ≤( . . ) ( . . )}0 3 11 2 2 2 73 9x x

{ ( . . ) ( . . )}(1.2 x 3.4)1≤ ≤ ∧ ≤ ≤ ∧ ≤ ≤2 2 2 7 51 5 619 4x x

{( . . ) [( . . ) ]2 3 41 36 516 2≤ ≤ ∧ ≤ ≤ ∨ ≤ ≤x x (5.8 x 6.0)2

∧ ≤ ≤ ∧ ≤ ≤( . . ) }0 3 113x (0 .2 x 0.61)7

Page 26: GADataMining CNA

High-level representation

{( . . ) ( . . )}0 3 11 2 2 2 73 9≤ ≤ ∧ ≤ ≤x x{( . . ) ( . . ) ( . . )}0 3 11 2 2 2 7 51 6 23 9 4≤ ≤ ∧ ≤ ≤ ∧ ≤ ≤x x x

• Generalize/Specialize

{( . . ) ( . . )}0 3 11 2 2 2 73 9≤ ≤ ∧ ≤ ≤x x

{( . . ) ( . . )}0 45 0 9 19 2 93 9≤ ≤ ∧ ≤ ≤x x

Page 27: GADataMining CNA

Tree-structured representation (GP)

/

x 5

log

*

(x log(y))/5)

y<

if

y 7

0

* y

x 2

+AND

>

x 2

If (y<7) and (x>2) then 0, else 2x+y

Page 28: GADataMining CNA

Genetic search: Issues

• Coding scheme, fitness function critical– General mechanism so robust that, within

reasonable margins, parameter settings are not critical.

– exploiting problem-specific knowledge– the “art” in GA design!

Page 29: GADataMining CNA

Genetic search: Issues• Stochastic search

– multiple runs with different random streams• Exploration vs. exploitation of search• Does not guarantee optimality ! But ….

• Structured population models• Parallelizable for large data

Page 30: GADataMining CNA

GAs and Optimization• Search space: representation• Global search without gradient information

– functions with multiple local optima – non-differentiable functions

• Robust, assumption-free, and very general• Hybrid approaches -- GAs with conventional

optimization techniques

Page 31: GADataMining CNA

Using GAs ?

• When to use a GA? • GA and traditional techniques• How long does it take?• Will it perform better?

Page 32: GADataMining CNA

Using GAs

• population size• mutation, crossover rates• how many generations• multiple runs

Page 33: GADataMining CNA

Is it a “black-box”?

? Huh?

• Data characteristics• Fitness function• GA parameters

Page 34: GADataMining CNA

GA Application Examples• Function optimizers

– difficult, discontinuous, multimodal, noisy functions• Combinatorial optimization

– layout of VLSI circuits, factory scheduling• Design and Control

– bridge structures, neural networks, communication networks design; control of chemical plants, pipelines

• Machine learning– classification rules, economic modeling, scheduling strategies

Portfolio design, optimized trading models, direct marketing models, sequencing of TV advertisements, adaptive agents, data mining, etc.

Page 35: GADataMining CNA

GAs and Data Mining

• Discovery• Prediction• Hypothesis testing and refinement

Page 36: GADataMining CNA

Data Mining• Pattern templates

([attribute in {v1,v2}] and [attribute=value]) or([attribute in {v1,v2,v3}] and [attribute>value]) or …

• when S, if C then Pwhen region=neif inc > 41K and child>2then x-sales>100

• when S, C and P are positively correlated• the mean of A when S and C, is significantly different

from the mean of A when S

S

PC

Page 37: GADataMining CNA

Data mining

• How good are the patterns– accuracy– coverage– support

• Understandability

# cases in C and P# cases in C

# cases in C and P# cases in P

# cases in C# cases in S

Page 38: GADataMining CNA

GA for Data Mining• Fitness evaluation

Expected values

Chi-square

– higher values imply C and P are related

Correlation • linear correlation -- product moment corr. coefficient• monotonically correlated -- Spearman’s rank corr. coeff.• Correlation coefficient x support

Interesting rule

n c c

21

22212

12111

2221

1211nnrnnr

nnCnnCPPS

+=+=

I

ncr

e jiij =

∑∑−

=i i ij

ijije

cn 22 )(

χ

SV

2 sCramer' χ

=

SPSCS

PCSII

II −

Page 39: GADataMining CNA

DM application• Symbolic models of consumer choice

– assumption-free– behavioral insights for targeting promotions– advantage over decision trees algorithms?

• DTs are stepwise optimal, but not globally so• high noise-sensitivity of DTs

– advantages over neural networks

{ ( ) ( ) ( ) ( )}3 5 4 0 4 3 6 3 5 5≤ ≤ ∧ < > ∧ >in c K a g e in c K a g e B u yo r th e n

Page 40: GADataMining CNA

Performance evaluation• Accuracy/Error rate

– will higher accuracy give better performance for the target task?

“The use of error rate often suggests insufficiently careful thought about the real objectives of the research” – David J. Hand, Construction and Assessment of Classification Rules.

True NFalse N

False PTrue P

Actual

Predicted P

N

P N • sensitivity, specificity• misclassification costs

• Of course, with 99:1 split in data, default dummy model gives 99% accuracy.

Page 41: GADataMining CNA

Model Representation

• Non-linear tree-structured models (GP)– Non-linear interaction terms– Function set : internal nodes

• {+,-,*,/,log}

– Terminal set: leaf nodes• {constants, variables}

/

x1 5

log

*

(x1 log(x3))/5)

x3

Page 42: GADataMining CNA

DM Performance: Decile Analysis

DecileNumber ofCustomers

Number ofResponses

ResponseRate(%)

CumulativeResponses

CumulativeResponseRate (%)

CumulativeResponse Lift

top 2500 2179 87.2 2179 87.2 4472 2500 1753 70.1 3932 78.6 4033 2500 396 15.8 4328 57.7 2964 2500 111 4.4 4439 44.4 2285 2500 110 4.4 4549 36.4 1876 2500 85 3.4 4634 30.9 1587 2500 67 2.7 4701 26.9 1388 2500 69 2.8 4770 23.9 1229 2500 49 2.0 4819 21.4 110

bottom 2500 55 2.2 4874 19.5 100Total 25,000 4874 19.5

100.*eperformanc avg. overall

decileeperformanc avg. cum. = decile LiftCumulative

Page 43: GADataMining CNA

Decile Maximization(DMAX)• Objective

Find model f(x) (predictor variables x)such that performance in upper deciles (specified depth-of-file) is maximized

• Explicitly manages resource constraint– mailings to particular depths-of file

• Performance at different mailing depths– models optimized for different mailing depths

DecileNumber of

Responders/Profit

top max2 max3 max456789

bottom

Page 44: GADataMining CNA

DMAX: Illustrative Example

0

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 40

$10

$4

$2

$6

$9$7

$3

$1

$5

$8

OLS($28)

DMAX 40% ($32)

OLS: .14 X1 + .06 X2DMAX 40%: .19 X1 + .07 X2

Profit X1 X2$10 45 5$9 35 21$8 31 38$7 30 30$6 6 10$5 45 37$4 30 10$3 23 30$2 16 13$1 12 30

Page 45: GADataMining CNA

GA DMAX• Representation: w1 w2 w3 .. wk

• Integrated variable selection• Fitness evaluation

– classification accuracy– model reliability – maximize specified decile performance

• response, profit, etc.

• Hybrid algorithm

Page 46: GADataMining CNA

Comparative Performance: Case I

• Response modeling– maximize response in top 3 deciles– 4.6% response to mailing

DMAX (30%): - 0.01X1 - 2.51X2 - 0.008X3 - 0.08X4

LOGIT : - 0.40 - 0.01X2 - 0.007X3- 3.25X4

Neural Network: 3 layers, 2 hidden nodes, 12 coefficients

Page 47: GADataMining CNA

Case I: Genetic Algorithm DMAX (30%)

DecileNumber

ofCustomers

Numberof

Responses

DecileResponse

Rate

CumResponse

Rate

CumResponse

Lifttop 4,617 865 18.7% 18.7% 4112 4,617 382 8.3% 13.5% 2963 4,617 290 6.3% 11.1% 2444 4,617 128 2.8% 9.0% 1985 4,617 97 2.1% 7.6% 1676 4,617 81 1.8% 6.7% 1467 4,617 79 1.7% 5.9% 1308 4,617 72 1.6% 5.4% 1189 4,617 67 1.5% 5.0% 109

bottom 4,617 43 0.9% 4.6% 100TOTAL 46,170 2,104 4.6%

Page 48: GADataMining CNA

Case I:Cum Response Lift Comparison

DecileGenetic

AlgorithmDMAX(30%)

LogisticRegression

NeuralNetwork

top 411 384 385 2 296 284 277 3 244 227 2214 198 194 1865 167 166 1646 146 146 1467 130 131 1318 118 119 1189 109 108 108

bottom 100 100 100

Page 49: GADataMining CNA

Case II 2% Response RateCum Response Lift Comparison

DecileGenetic

AlgorithmDMAX(10%)

GeneticAlgorithm

DMAX(20%)

GeneticAlgorithm

DMAX(30%)

GeneticAlgorithm

DMAX(40%)

LogisticRegression

1 220 186 191 192 194 2 174 195 166 166 165 3 157 173 179 150 1484 148 158 158 161 154*5 139 145 146 146 1466 131 135 138 138 1387 122 124 127 127 1278 114 116 117 117 1179 108 108 109 109 109

bottom 100 100 100 100 100

Page 50: GADataMining CNA

Case II: 2% Response RateSmoothness: Logistic Regression

DecileNumber

ofCustomers

Numberof

Responses

DecileResponse

Rate

CumResponse

Rate

CumResponse

Lifttop 7,203 283 3.9% 3.9% 1942 7,220 200 2.8% 3.3% 1653 7,225 165 2.3% 3.0% 1484 7,215 255* 3.5% 3.1% 154*5 7,227 167 2.3% 3.0% 1466 7,220 140 1.9% 2.8% 1387 7,209 89 1.2% 2.6% 1278 7,228 68 0.9% 2.4% 1179 7,205 65 0.9% 2.2% 109

bottom 7,232 32 0.4% 2.0% 100TOTAL 72,184 1,464 2.0%

Page 51: GADataMining CNA

Case II: 2% Response RateSmoothness: GA DMAX (10%)

DecileNumber

ofCustomers

Numberof

Responses

DecileResponse

Rate

CumResponse

Rate

CumResponse

Lifttop 7,203 322 4.5% 4.5% 2202 7,220 188 2.6% 3.5% 1743 7,225 178 2.5% 3.2% 1574 7,215 178 2.5% 3.0% 1485 7,227 151 2.1% 2.8% 1396 7,220 133 1.8% 2.7% 1317 7,209 103 1.4% 2.5% 1228 7,228 84 1.2% 2.3% 1149 7,205 81 1.1% 2.2% 108

bottom 7,232 46 0.6% 2.0% 100TOTAL 72,184 1,464 2.0%

Page 52: GADataMining CNA

Case II: 2% Response RateSmoothness: GA DMAX (20%)

DecileNumber

ofCustomers

Numberof

Responses

DecileResponse

Rate

CumResponse

Rate

CumResponse

Lifttop 7,203 271 3.8% 3.8% 1862 7,220 299* 4.1% 4.0% 195*3 7,225 191 2.6% 3.5% 1734 7,215 162 2.2% 3.2% 1585 7,227 140 1.9% 2.9% 1456 7,220 119 1.8% 2.7% 1357 7,209 90 1.2% 2.5% 1248 7,228 85 1.2% 2.3% 1169 7,205 69 1.0% 2.2% 108

bottom 7,232 38 0.5% 2.0% 100TOTAL 72,184 1,464 2.0%

Page 53: GADataMining CNA

Comparative Performance: Case III

Profit modeling– maximize profit in top 2 deciles– mailing (profit / size)

» Non-responder: -$0.29 / 92.55%» Unpaid responder: -$5.65 / 7.10%» Paid responder: +$275 / 0.35%

Average profit for mailing: +$0.32

DMAX (20%): - .36X1 - .23X2 + .005X3 + .24X4

LOGIT(PR): - .01X1 - .03X2 + .322X3 + .25X4

Page 54: GADataMining CNA

Case IV: Profit ModelGenetic Algorithm DMAX (20%)

Decile

Number of

Customers

Percent PAID

Responders

Percent UNPAID

Responders

Decile Average

Profit

Cum Average

Profit

Cum Profit Lift

top 8,171 0.82% 10.1% $1.43 $1.43 444 2 8,171 0.62% 8.7% $0.96 $1.20 371 3 8,171 0.37% 8.2% $0.28 $0.89 277 4 8,171 0.34% 8.4% $0.20 $0.72 223 5 8,171 0.29% 5.9% $0.20 $0.62 191 6 8,171 0.32% 7.4% $0.19 $0.54 169 7 8,171 0.23% 4.0% $0.13 $0.49 151 8 8,171 0.18% 4.8% -$0.04 $0.42 130 9 8,171 0.24% 8.3% -$0.06 $0.37 114

bottom 8,171 0.17% 4.9% -$0.08 $0.32 100 TOTAL 81,710 0.35% 7.1%

Page 55: GADataMining CNA

Case IV: Profit ModelCum Profit Lift Comparison

DecileGenetic

AlgorithmDMAX (20%)

LogisticRegression

top 444 3852 371 2943 277 2354 223 1905 191 1846 169 1637 151 1468 130 1239 114 111

bottom 100 100

Page 56: GADataMining CNA

Modeling on Multiple Objectives• Model [y1,..,yk] = f (x)

– simultaneously optimize on multiple objectives

• Some common DM modeling desirables– response and high purchase revenues– likely churners with high usage of services– high tenure and usage– purchase and non-return– cross-selling, etc.

[or CPR (Combined Profit and Response) Models]

Page 57: GADataMining CNA

Multiple objectives• Traditional approaches

– multiple single-objective models, and combine– weighted average of objectives

• conflicting objectives– different levels of tradeoffs

• frontier of non-dominated solutions– choice of final model based on diverse decision-

maker objectives, can also be subjective

Page 58: GADataMining CNA

Pareto Frontier

• Non-dominated solutions– multiple objectives πi, f a(x) better than f b(x) if

• Single GA run obtains– tradeoff frontier of

non-dominated solutions f k(x)

))(())((: xx bi

ai ffi ππ ≥∀

))(())((: xx bj

aj ffj ππ >∃

π1

π2 non-dominated modelsdominated models

Page 59: GADataMining CNA

Multi-objective GA

• Pareto-Based Selection (Louis and Rawlins, ‘93)– randomly select a pair of solutions from population– generate two new “offspring”– determine the Pareto-optimal set from parents and offspring,

and choose two solutions for new population

• Elitistism• retain best solution intact in next population

• fosters local search around best solution

– retain non-dominated set of solutions intact in next generation

Page 60: GADataMining CNA

Fitness evaluation• DMAX approach

– fitness at specified depth-of-file d

Page 61: GADataMining CNA

Experimental Study: Data

• Cellular-phone provider seeking to identify potential high-value churners– two dependent variables

• binary Churn variable• continuous variable measuring revenue ($)

– predictors: minutes-of-use (peak and off-peak), average charges, and payment information, etc.

• obtained after EDA, normalized to 0 mean 1 s.d

– 50,000 sample: 25,000 for training, 25,000 for test set

Page 62: GADataMining CNA

Multiple Objectives: Performance

• Churn lift • model capturing more churners in top deciles is better

• $-Lift

• model giving high revenue customers in upper deciles is better

• overall modeling objective– maximize expected revenue saved through identification of high-

value churners– Churn-Lift * $-Lift

NC

NC

d

d /

NR

NR

d

d /

Page 63: GADataMining CNA

Decile 1 (trg)

050

100150200250300350400

0 100 200 300 400 500 600

Churn-Lift

$-Li

ft

GPGALogisticOLS

5 independent GA runs, aggregate the sets of non-dominated solutions

Experimental StudyNon-dominated models: Decile 1 (Training)

Page 64: GADataMining CNA

Experimental StudyNon-dominated models: Decile 1 (Test)

Decile 1 (Test)

0

50

100

150

200

250

300

350

400

0 100 200 300 400 500

Churn-Lift

$-Li

ft

GPGALogisticOLS

Page 65: GADataMining CNA

Experimental StudyNon-dominated models: Decile 2 (Test)

Decile 2 (Test)

0

50

100

150

200

250

300

0 50 100 150 200 250 300 350 400 450

Churn-Lift

$-Li

ft

GPGALogisticOLS

Page 66: GADataMining CNA

Experimental StudyNon-dominated models: Decile 3 (Test)

Decile 3 (Test)

0

50

100

150

200

250

0 50 100 150 200 250 300 350

Churn-Lift

$-Li

ft

GPGALogisticOLS

Page 67: GADataMining CNA

Experimental StudyNon-dominated models: Decile 7 (Test)

Decile 7 (Test)

60

80

100

120

140

80 90 100 110 120 130 140 150

Churn-Lift

$-Li

ft

GPGALogisticOLS

Page 68: GADataMining CNA

Experimental Study

Performance SummaryPerformance Decile 1 Decile 2 Decile 3 Decile 7

Churn-Lift, $-Lift 304.9, 261.7 265.4, 207.4 272.3. 155.0 138.8, 126.9 GA-best Product of Lifts 797.8 550.4 422.2 176.1

Churn-Lift, $-Lift 343.7, 256.5 343.5, 182.1 275.1, 178.3 139.4, 131.2 GP-best Product of Lifts 881.5 625.5 490.4 182.9

Churn-Lift, $-Lift 447.1,111.8 403.4, 72.6 295.9, 57.4 137.8, 66.7 Logistic Regression Product of Lifts 499.8 292.7 169.96 91.9

Churn-Lift, $-Lift 116.2, 360.5 108.1, 271.7 99.7, 223.2 91.8, 136.2 OLS Regression Product of Lifts 418.8 293.71 222.5 125.1

Churn-Lift, $-Lift 79, 357 76, 263 74, 217 78, 136 OLS *

Logistic Product of Lifts 282 201 160 106

Page 69: GADataMining CNA

General Optimization of Lifts• Fitness function

– Seeks a general maximization of lifts at all deciles

Page 70: GADataMining CNA

Specific vs. General Lift Opt

Performance Decile 1 Decile 2 Decile 3 Decile 7 $-Lift, Churn-Lift 304.9, 261.7 265.4, 207.4 272.3. 155.0 138.8, 126.9 GA-best

Lift-Opt Product of Lifts 797.8 550.4 422.2 176.1 $-Lift, Churn-Lift 303.2, 261 288.3, 188.8 276.7, 151.3 138.1, 104.5, GA-best

General-Opt Product of Lifts 791.4 544.3 418.6 144.3

Churn-Lift, $-Lift 343.7, 256.5 343.5, 182.1 275.1, 178.3 139.4, 131.2 GP-best Lift-Opt Product of Lifts 881.5 625.5 490.4 182.9

Churn-Lift, $-Lift 332, 252.5 265, 223.1 233.9, 186.5 132.3, 133.1 GP-best General-Opt Product of Lifts 838.3 591.2 436.2 176.1

Table: Best Prod-Lifts in Deciles

Page 71: GADataMining CNA

Specific vs. General Lift Opt.

Decile 1 Decile 2 Decile 3 Decile 7 Performance $-Lift Churn-

Lift $-Lift Churn

-Lift $-Lift Churn-

Lift $-Lift Churn-

Lift GA-best Lift-Opt

361.4

464.7

271.6

401.3

223.9

309.8

136.6

139.5

GA-best General-Opt 361.7 421 273.3 398.1 223.9 304.1 136.6 138.4

GA-best Lift-Opt

372.7

475.2

276.5

417.9

226.1

310.3

137.2

139.8

GA-best General Opt 372.1 421.3 276.8 378.3 226.6 296.7 137.1 139.8

Table: Best $-Lift and Churn-Lifts in Deciles

Page 72: GADataMining CNA

Case Study – “EC challenge”EDA, Variable-selection• Problem

– 15,178 obs., 79 variables, “response” dependent– Seeking maximum lift in the top decile

– Logistic regression model• 15 variables, after EDA, transformation This is the hard part!

(many of them combinations of multiple vars.)• Lift of 126 in the top decile

• EC approach– Include all variables– Explore simple “terms”: non-linear GP models

• small populations, looking for robust terms– Final model(s) using obtained terms

Page 73: GADataMining CNA

Case Study – “EC• Various 2-5 var. terms show some predictability

– Lifts ranging in 122-127• Models on these terms

– Non-linear, Linear model: lifts in 126-132

• Examples– 3 tan(HC211) + EC31 Trg:122.5 Test: 122.5– (OCC81 - log10(ORDTERM1/IC191))*STATE2*HHAS21 Trg: 124.9 Test:126.4– STATE2 * HHAS21 Trg: 121.3 Test: 121.3

– (OCC81 - log10(B)) * B * (A + B + (ORDTERM1 * (A + B))) Trg: 131.5 Test: 126.9A = (STATE2 - SECGENDE) and B = STATE2*HHAS21

– B + tan(2B + HHAS21) + EC31 + (ORDTERM1)*(B + Trg: 131.1 Test:127.8tan[B + HHAS21 + ((HHAS21*HV31)/2.1)] )

– AB^3 (1 + OCC81) + AB(OCC81) + 2DEB(OCC81)^2. Trg: 134.4 Test 131.6

– 4A + B + C + 2D + E + 2*OCC81 (10 vars. total) Trg: 132.5 Test: 131.7