kdd cup 2009 fast scoring on a large database presentation of the results at the kdd cup workshop...

KDD Cup2009

Fast Scoring on a Large DatabasePresentation of the Results at the KDD Cup Workshop

June 28, 2008

The Organizing Team

KDD Cup 2009 Organizing Team

Project team at Orange Labs R&D: • Vincent Lemaire• Marc Boullé• Fabrice Clérot• Raphaël Féraud• Aurélie Le Cam• Pascal Gouzien

Beta testing and proceedings editor:• Gideon Dror

Web site design: • Olivier Guyon (MisterP.net, France)

Coordination (KDD cup co-chairs): • Isabelle Guyon• David Vogel

http://perso.rd.francetelecom.fr/lemaire/

http://perso.rd.francetelecom.fr/boulle/

http://www.kddcup-orange.com/www2.mta.ac.il/~gideon

mailto:[email protected]



Thanks to our sponsors…

Orange

ACM SIGKDD

Pascal

Unipen

Google

Health Discovery Corp

Clopinet

Data Mining Solutions

MPS

KDD Cup Participation By Year

45 5724 31

136

1857

102

3768

95128

453

050

100150200250300350400450500

1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Year

Year # Teams

1997 45

1998 57

1999 24

2000 31

2001 136

2002 18

2003 57

2004 102

2005 37

2006 68

2007 95

2008 128

2009 453

Record KDD Cup Participation

Participation Statistics

1299 registered teams

7865 entries

46 countries :

Argentina Germany Malaysia South Korea

Australia Greece Mexico Spain

Austria Hong Kong Netherlands Sweden

Belgium Hungary New Zealand Switzerland

Brazil India Pakistan Taiwan

Bulgaria Iran Portugal Turkey

Canada Ireland Romania Uganda

Chile Israel Russian Federation United Kingdom

China Italy Singapore Uruguay

Fiji Japan Slovak Republic United States

Finland Jordan Slovenia

France Latvia South Africa

A worlwide operator

One of the main telecommunication operators in the world

Providing services to more than 170 millions customers over five continents

Including 120 millions under the Orange Brand

KDD Cup 2009 organized by OrangeCustomer Relationship Management (CRM)

Three marketing tasks: predict the propensity of customers

– to switch provider: Churn– to buy new products or services: Appentency– to buy upgrades or new options proposed to them: Up-selling

Objective: improve the return of investments (ROI) of marketing campaigns

– Increase the efficiency of the campaign given a campaign cost– Decrease the campaign cost for a given marketing objective

Better prediction leads to better ROI

Train and deploy requirements

– About one hundred models per month

– Fast data preparation and modeling

– Fast deployment

Model requirements– Robust– Accurate– Understandable

Business requirement– Return of investment for the

whole process

Input data

– Relational databases– Numerical or categorical– Noisy– Missing values– Heavily unbalanced distribution

Train data

– Hundreds of thousands of instances

– Tens of thousand of variables

Deployment

– Tens of millions of instances

Data, constraints and requirements

In-house systemFrom raw data to scoring models

0,n

1,n

0,n

1,n

0,1

0,n

0,n

0,n

1,1

1,1

1,n

0,n

0,n

1,1

1,n1,n

1,n1,n

0,n

1,n

1,1 0,n

1,n

1,1

0,n

1,1

0,n

1,1

0,n

1,n

0,n

1,1

1,1

0,n

1,n

1,1

1,n(1,1)

1,1

0,n

1,n

0,n

1,1

0,n

0,n

0,1

1,1

1,n

1,n

0,1

0,n

1,1

1,1

0,1

0,n

Heri tage tiers

Heri tage offre commerciale

0,n 1,n

1,1

0,n

(1,1)

0,n

0,n

0,n 0,1

0,1

0,n

1,n

1,n0,n

1,n

0,n

(1,1)

Fu appartient type FU

1,1

1,n

Offre

Id offreLibel lé offre

<pi>

Produit & Service

Id PSDate fin val idi té du P&SDate début val idi té du P&SDate création du P&SLibel lé P&S

<pi>

Identi té T iers

Id identi té tiersLoginType identi té tiers

<pi>

O composée de PS

Elément De Parc

Id EDPDate dernière uti l isation EDPDate première uti l isation EDP

<pi>

Modèle Conceptuel de Données

Modèle : MCD PAC_v4

Package :

Diagramme : T iers Services

Auteur : claudebe Date : 14/06/2005

Version :

PS a pour FU

T uti l ise IT

EDP souscri t ds O

Date début souscription offreDate fin souscription offre

DD

<O>

CRU concerne FU

Gamme

Id gammeLibel lé gammeDate création gammeDate fin de gamme

<pi>

G composée de PS

Fonction Usage

Id fonction d'usageLibél lé fonction usage

<pi>

T détient EDP

Date début détention EDPDate fin détention EDP

DD

<O>

Compte Facturation

Id compte facturationDate début val idi té compte facturationDate fin val idi té compte facturation

<pi>

F émise pour CF

Compte Rendu Usage

Id compte rendu usageDate début CRUDate fin CRUVolume descendant CRUVolume montant CRUType transmission

<pi>

IT génère CRU

CRU généré par EDP

Ligne Facture

Id l igne de factureLigne affichée sur factureMontant HTMontant TTC

<pi>

Type Ligne Facture

Id type l igne factureLibel lé type l igne facture

<pi>

Facture

Id factureDate échéance facture

<pi>

LF correspond à EDP

LF compose F

EDP facturé sur CF

T iers

Id tiersPrénom tiers PPNom tiers PPNom marital PPGenre PPDate naissance tiers PPDate création tiersDate clôture tiersDate modification tiersType T iers

<pi>

T a pour relation avec T

Foyer

Id foyerDate création foyerDate fin foyerNb personnes foyer

<pi>

Adresse

Id adresseCode postal distributionCommuneNb habitants communeDépartement

<pi>

F a pour A

Date début adresseDate fin adresse

DD

T a pour F

Date début appartenance foyerDate fin appartenance foyerRole tiers ds foyer

DDVA1Type Relation T iers

Id type relationDate création type de relation tiersLibel lé type relation tiers

<pi>

Statut Opérateur

Id statut opérateurLibel lé statut opérateur

<pi>

Operateur

Id opérateurLibel lé opérateur

<pi>

T a pour S

Date début statut tiersDate fin statut tiers

DD

CSP

Id CSP 350Libel lé CSP 350Id CSP 23Libel lé CSP 23Id CSP 5Libel lé CSP 5

<pi>

T a pour CSP

LF a pour TLF

Classification Offre

Id classification offreLibel lé classification offre

<pi>

O positionnée ds C

CO hiérarchie

Groupe de CRU

Id groupe de CRU <pi>

CRU appartient à la CCRU

Cercle Relationnel

Id CRLibél lé cercle relationnel

<pi>

CRU a pour OCR

CRU a pour DCR

Coordonnées T iers

Id coordonnée tiersDate création coordonnéeLibel lé coordonnée tiers

<pi>

T ti tulaire CT

C correspond à M

Données payeur

Inscription fichier contentieuxNb dossiers recouvrement acti fsNb dossiers réclamation acti fsNb dossiers recouvrementNb dossiers réclamationNiveau risque courantNiveau risque précédent

Classe de risque

Id classe risqueLibel lé classe risqueLibel lé court classe risqueNiveau risque minimumNiveau risque maximum

<pi>

T a pour CR

Date début tiers ds classe risqueDate fin tiers ds classe risque

DD

Offre composée

Id offre composéeLibel lé offre composée

<pi>

Offre commerciale

Id offre commercialeLibel lé offre commercialeDate création offreDate clôture offre

<pi>

O fai t partie OC

Date début rattachement offreDate fin rattachement offre

DD

EDP correspond PS

Positionnement classification

Id positionnementLibel lé positionnement

<pi>

P dans O P hiérarchie

CRU Enchainement

Média

Id médiaLibel lé média

<pi>

EDP a EU

moisvaleur

VA6N10

T payeur du CF

DP pour O

Etat Usage

Id EUlibel lé état usage

<pi>

Type de fonction d'usage

id type FUlib typ FU

<pi>

Customer

Services

Products

Call details

…

Data warehouse

– Relational data base

Data mart

– Star schema

Feature construction

– PAC technology– Generates tens of thousands of

variables

Data preparation and modeling

– Khiops technology

Id customer zip code Nb call/month Nb calls/hour Nb calls/month,weekday,hours,service …

scoring model

Data feeding

PAC

Khiops

Design of the challenge

Orange business objective– Benchmark the in-house system against state of the art techniques

Data– Data store

– Not an option– Data warehouse

– Confidentiality and scalability issues– Relational data requires domain knowledge and specialized skills

– Tabular format– Standard format for the data mining community– Domain knowledge incorporated using feature construction (PAC)– Easy anonymization

Tasks– Three representative marketing tasks

Requirements– Fast data preparation and modeling (fully automatic)– Accurate– Fast deployment– Robust– Understandable

http://ppsblog.members.winisp.net/Img/WorkingwithTabularDataSourcesinMampampA_BD3E/image_3.png

Data sets extraction and preparation

Input data– 10 relational table– A few hundreds of fields– One million customers

Instance selection– Resampling given the three marketing tasks– Keep 100 000 instances, with less unbalanced target distributions

Variable construction– Using PAC technology– 20000 constructed variables to get a tabular representation– Keep 15 000 variables (discard constant variables)– Small track: subset of 230 variables related to classical domain knowledge

Anonymization– Discard variable names, discard identifiers– Randomize order of variables– Rescale each numerical variable by a random factor– Recode each categorical variable using random category names

Data samples– 50 000 train and test instances sampled randomly – 5000 validation instances sampled randomly from the test set

Scientific and technical challenge

Scientific objective– Fast data preparation and modeling: within five days– Large scale: 50 000 train and test data, 15 000 variables– Hetegeneous data

– Numerical with missing values– Categorical with hundreds of values– Heavily unbalanced distribution

KDD social meeting objective– Attract as many participants as possible

– Additional small track and slow track– Online feedback on validation dataset– Toy problem (only one informative input variable)

– Leverage challenge protocol overhead– One month to explore descriptive data and test submission protocol

– Attractive conditions– No intellectual property conditions– Money prizes

Business impact of the challenge

Bring Orange datasets to the data mining community

– Benefit for community– Access to challenging data

– Benefit for Orange– Benchmark of numerous competing techniques– Drive the research efforts towards Orange needs

Evaluate the Orange in-house system

– High number of participants and high quality of the results– Orange in-house results:

– Improved by a significant margin when leveraging all business requirements

– Almost Parretto optimal when other criterions are considered (automation, very fast train and deploy, robustness and understandability)

– Need to study the best challenge methods to get more insights

KDD Cup 2009: Result Analysis

Best Result (period considered in the figure)

In House System (downloadable : www.khiops.com)

Baseline (Naïve Bayes)

Overall – Test AUC – Fast

Good Result Very Quickly Best Results (on each dataset) Submissions


Good Result Very Quickly Best Results (on each dataset) Submissions

In House (Orange) System: •No parameters•On 1 standard laptop (mono proc)•If deal as 3 different problems


Very Fast Good Result Small improvement after the first day

(83.85 84.93)

Overall – Test AUC – Slow

Very Small improvement after the 5th day

(84.93 85.2)Improvement due to unscrambling?

Overall – Test AUC – Submissions

23.24% of the submissions (>0.5)

< Baseline


> In House


< In House

Overall – Test AUC 'Correlation' Test / Valid

Overall – Test AUC'Correlation' Test / Train

Random Values Submitted

Boosting Method orTrain Target Submitted Over fitting

?

Overall – Test AUC

Test AUC - 12 hours Test AUC - 24 hours

Test AUC – 36 days Test AUC – 5 days

Overall – Test AUC

Test AUC - 12 hours

Test AUC – 36 days

• time to adjust model parameters ?

• time to train ensemble method ?

• time to find more processors ?

• time to test more methods

• time to unscramble ?

• …

Difference between :

• best result at the end of the first day and

• best result at the end of the 36 days

=1.35%

Test AUC = f (time)

Easier ?Harder ?

Churn – Test AUC – day [0:36] Appetency – Test AUC – day [0:36] Up-selling– Test AUC – day [0:36]

Test AUC = f (time)

Easier ?Harder ?

Difference between :

• best result at the end of the first day and

• best result at the end of the 36 days

=1.84% =1.38% =0.11%


CorrelationTest AUC / Valid AUC (5 days)

Easier ?Harder ?

Churn – Test/Valid – day [0:5] Appetency – Test/Valid – day [0:5] Up-selling– Test/Valid – day [0:5]

CorrelationTrain AUC / Valid AUC (36 days)

Difficulty to conclude something…

Churn – Test/Train – day [0:36] Appetency – Test/Train – day [0:36] Up-selling– Test/Train – day [0:36]

HistogramTest AUC / Valid AUC ([0:5] or ]5-36] days)

Knowledge (parameters?) found during 5 days helps after… ?


Knowledge (parameters?) found during 5 days helps after… ?

HistogramTest AUC / Valid AUC ([0:5] or ]5-36] days)

YES !


Churn – Test AUC – day ]5:36] Appetency – Test AUC – day ]5:36] Up-selling– Test AUC – day ]5:36]

Fact Sheets:Preprocessing & Feature Selection

0 20 40 60 80

Principal Component Analysis

Other prepro

Grouping modalities

Normalizations

Discretization

Replacement of the missing values

PREPROCESSING (overall usage=95%)

Percent of participants

0 10 20 30 40 50 60

Wrapper with search

Embedded method

Other FS

Filter method

Feature ranking

FEATURE SELECTION (overall usage=85%)


Forward / backward wrapper

Fact Sheets:Classifier

0 10 20 30 40 50 60

Bayesian Neural Network

Bayesian Network

Nearest neighbors

Naïve Bayes

Neural Network

Other Classif

Non-linear kernel

Linear classifier

Decision tree...

CLASSIFIER (overall usage=93%)


- About 30% logistic loss, >15% exp loss, >15% sq loss, ~10% hinge loss.

- Less than 50% regularization (20% 2-norm, 10% 1-norm).

- Only 13% unlabeled data.

Fact Sheets: Model Selection

0 10 20 30 40 50 60

Bayesian

Bi-level

Penalty-based

Virtual leave-one-out

Other cross-valid

Other-MS

Bootstrap est

Out-of-bag est

K-fold or leave-one-out

10% test

MODEL SELECTION (overall usage=90%)


- About 75% ensemble methods (1/3 boosting, 1/3 bagging, 1/3 other).

- About 10% used unscrambling.

Run in parallel

Multi-processor

None

>= 32 GB

> 8 GB

<= 8 GB <= 2GB

Fact Sheets: Implementation

Java

Matlab

C C++

Other (R, SAS)

Mac OS

Linux Unix Windows

Memory

Operating System

Parallelism

Software Platform

Winning methods

Fast track:

- IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc.)

- ID Analytics, Inc., USA +: Filter+wrapper FS. TreeNet by Salford Systems an additive boosting decision tree technology, bagging also used.

- David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter FS, ensemble of decision trees.

Slow track:

- University of Melbourne: CV-based FS targeting AUC. Boosting with classification trees and shrinkage, using Bernoulli loss.

- Financial Engineering Group, Inc., Japan: Grouping of modalities, filter FS using AIC, gradient tree-classifier boosting.

- National Taiwan University +: Average 3 classifiers: (1) Solve joint multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost with tree-based weak leaner. (3) Selective Naïve Bayes.

-(+: small dataset unscrambling)

Conclusion

Participation exceeded our expectations. We thank the participants for their hard work, our sponsors, and Orange who offered:

– A problem of real industrial interest with challenging scientific and technical aspects

– Prizes.

Lessons learned:

– Do not under-estimate the participants: five days were given for the fast challenge, only a few hours sufficed to some participants.

– Ensemble methods are effective.– Ensemble of decision trees offer off-the-shelf solutions

to problems with large numbers of samples and attributes, mixed types of variables, and lots of missing values.

kdd cup 2009 fast scoring on a large database presentation of the results at the kdd cup workshop...

Documents

kdd cup workshopjune

organizing teamkdd cup

kdd cup2009fast scoring

organizing team project

olivier guyon

vivi1virgin islands

proceedings editor

orange labs rd