adaptive machine learning for credit card fraud detection

Introduction Sampling Concept Drift Alert-Feedback Interaction Conclusions

Adaptive Machine Learning forCredit Card Fraud Detection

Andrea Dal Pozzolo

4/12/2015

PhD public defence

Supervisor: Prof. Gianluca Bontempi

1/ 68


THE Doctiris PROJECT

I 4 years PhD project funded by InnovIRIS (Brussels Region)I Industrial project done in collaboration with Worldline

S.A. (Brussels, Belgium)I Requirement: 50% time at Worldline (WL) and 50% at

MLG-ULBI Goal: Improve the Data-Driven part of the Fraud Detection

System (FDS) by means of Machine Learning algorithms.

2/ 68


INTRODUCTION

I This thesis is about Fraud Detection (FD) in credit cardtransactions.

I FD is the process of identifying fraudulent transactionsafter the payment is authorized.

I This process is typically automatized by means of a FDS.

Technical sections will be denoted by the symbol

3/ 68


THE FRAUD DETECTION PROCESS

Predic;ve&model&

Rejected&

Terminal&check&Correct'Pin?''

Sufficient'Balance'?''Blocked'Card?'

Feedback&

Inves;gators&Fraud&score&

4/ 68


WHAT’S MACHINE LEARNING?

I The design of algorithms that discover patterns in acollection of data instances in an automated manner.

I The goal is to use the discovered patterns to makepredictions on new data.

Predic've)Model)

Training)samples)

Tes'ng)samples)Machine)Learning)

algorithm)

We let computers learn these patterns.

5/ 68


CHALLENGES FOR A DATA-DRIVEN FDS1. Unbalanced distribution (few frauds).2. Concept Drift (CD) due to fraud evolution and change in

customer behaviour.3. Few recent supervised transactions (true class of only

transactions reviewed by investigators).4. For confidentiality reason, exchange of datasets and

information is often difficult.

6/ 68


PCA PROJECTION

Figure : Transactions distribution over the first two PrincipalComponents in different days.

7/ 68


RESAMPLING METHODS

Undersampling- Oversampling- SMOTE-

Figure : Resampling methods for unbalanced classification. Thepositive and negative symbols denote fraudulent and genuinetransactions. In red the new observations created with oversamplingmethods.

8/ 68


CONCEPT DRIFTIn data streams the concept to learn can change due tonon-stationary distributions.

t +1

t

Class%priors%change% Within%class%change% Class%boundary%change%

Figure : Illustrative example of different types of Concept Drifts.

9/ 68


THESIS CONTRIBUTIONS

1. Learning with unbalanced distribution:I impact of undersampling on the accuracy of the final

classifierI racing algorithm to adaptively select the best unbalanced

strategy2. Learning from evolving and unbalanced data streams

I multiple strategies for unbalanced data stream.I effective solution to avoid propagation of old transactions.

3. Prototype of a real-world FDSI realistic framework that uses all supervised information

and provides accurate alerts.4. Software and Credit Card Fraud Detection Dataset

I release of a software package and a real-world dataset.

10/ 68


PUBLICATIONS

I A. Dal Pozzolo, G. Boracchi, O. Caelen, C. Alippi, and G. Bontempi.Credit Card Fraud Detection with Alert-Feedback Interaction.Submitted to IEEE TNNLS, 2015.

I A. Dal Pozzolo, O. Caelen, and G. Bontempi.When is undersampling effective in unbalanced classification tasks?In ECML-KDD, 2015.

I A. Dal Pozzolo, O. Caelen, R. Johnson and G. Bontempi.Calibrating Probability with Undersampling for Unbalanced Classification.In IEEE CIDM, 2015.

I A. Dal Pozzolo, G. Boracchi, O. Caelen, C. Alippi, and G. Bontempi.Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information.In IEEE IJCNN, 2015.

I A. Dal Pozzolo, R. Johnson, O. Caelen, S. Waterschoot, N. Chawla, and G. Bontempi.Using HDDT to avoid instances propagation in unbalanced and evolving data streams.In IEEE IJCNN, 2014.

I A. Dal Pozzolo, O. Caelen, Y. Le Borgne, S. Waterschoot, and G. Bontempi.Learned lessons in credit card fraud detection from a practitioner perspective.Elsevier ESWA, 2014.

I A. Dal Pozzolo, O. Caelen, S. Waterschoot, G. Bontempi,Racing for unbalanced methods selection.In IEEE IDEAL, 2013

11/ 68


THE FRAUD DETECTION PROBLEM

I We formalize FD as a classification task f : Rn → {+,−}.I X ∈ Rn is the input and Y ∈ {+,−} the output domain.I + is the fraud (minority) and − the genuine (majority)

class.I Given a classifier K and a training set TN, we are interested

in estimating for a new sample (x, y) the posteriorprobability P(y = +|x).

Fraud&

Genuine&

12/ 68


THE FRAUD DETECTION DATA

I WL is processing millions of transactions per day.I When a transaction arrives it contains about 20 raw

features (e.g. CARD ID, AMOUNT, DATETIME,SHOP ID, CURRENCY, etc.).

I Aggregate features are compute to include the cardholderhistorical behaviour (e.g. AVG AMOUNT WEEK,TIME FROM LAST TRX, NB TRX SAME SHOP, etc.).

I Final feature vector contains about 50 variables.

13/ 68


ACCURACY MEASURE OF A FDS

I In a detection problem it is more important toprovide accurate ranking than correctclassification.

I In a realistic FDS, investigators check only alimited number of transactions.

I Hence, Pk is the most relevant measure, with kbeing the number of transactions checked.

I Pk is the proportion of frauds in k most riskytransactions.

k Predicted))Score True)Class1 0.99 +2 0.98 +3 0.98 :4 0.96 :5 0.95 +6 0.92 :7 0.91 +8 0.84 :9 0.82 :10 0.79 :11 0.76 :12 0.75 +13 0.7 :14 0.67 :15 0.64 +16 0.61 :17 0.58 :18 0.55 +19 0.52 :20 0.49 :

14/ 68


Theoretical contribution

CHAPTER 4-

UNDERSTANDING SAMPLING METHODS

Publications:I A. Dal Pozzolo, O. Caelen, and G. Bontempi.

When is undersampling effective in unbalanced classification tasks?In ECML-KDD, 2015.

I A. Dal Pozzolo, O. Caelen, R. Johnson and G. Bontempi.Calibrating Probability with Undersampling for Unbalanced Classification.In IEEE CIDM, 2015.

I A. Dal Pozzolo, O. Caelen, S. Waterschoot, G. Bontempi,Racing for unbalanced methods selection.In IEEE IDEAL, 2013

15/ 68


OBJECTIVE OF CHAPTER 4

In Chapter 4 we aim toI analyse the shift in the posterior distribution due to

undersampling and to study its impact on the finalranking.

I show under which condition undersampling is expected toimprove classification accuracy.

I propose a method to calibrate posterior probabilities andto select the classification threshold.

I show how to efficiently select the best unbalanced strategyfor a given dataset.

16/ 68


EFFECT OF UNDERSAMPLINGI Suppose that a classifier K is trained on set TN which is

unbalanced.I Let s be a random variable associated to each sample

(x, y) ∈ TN, s = 1 if the point is sampled and s = 0otherwise.

I Assume that s is independent of the input x given the classy (class-dependent selection):

P(s|y, x) = P(s|y)⇔ P(x|y, s) = P(x|y)

.

Undersampling--

Unbalanced- Balanced-

Figure : Undersampling: remove randomly majority class examples.In red samples that are removed from the unbalanced dataset (s = 0).

17/ 68


POSTERIOR PROBABILITIESLet ps = p(+|x, s = 1) and p = p(+|x). We can write ps as afunction of p:

ps =p

p + β(1− p)(1)

where β = p(s = 1|−). Using (1) we can obtain an expression ofp as a function of ps:

p =βps

βps − ps + 1(2)

Figure : p and ps at different β. Low values of β leads to a morebalanced problem.

18/ 68


SHIFT AND CLASS SEPARABILITY

(a) ps as a function of β

3 15

0

500

1000

1500

−10 0 10 20 −10 0 10 20x

Cou

nt class01

(b) Class distribution

Figure : Class distribution and posterior probability as a function of βfor two univariate binary classification tasks with norm classconditional densities X− ∼ N (0, σ) and X+ ∼ N (µ, σ) (on the leftµ = 3 and on the right µ = 15, in both examples σ = 3). Note that pcorresponds to β = 1 and ps to β < 1.

19/ 68


CONDITION FOR A BETTER RANKING

Using νs > ν and by making a hypothesis of normality in theerrors, K has better ranking with undersampling when

dps

dp>

√νs

ν(3)

where dpsdp is the derivative of ps w.r.t. p:

dps

dp=

β

(p + β(1− p))2

20/ 68


UNIVARIATE SYNTHETIC DATASET

−2 −1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Post

erio

r pro

babi

lity

(a) Class conditional distributions(thin lines) and the posterior distri-bution of the minority class (thickerline).

(b) dpsdp (solid lines),

√νsν

(dottedlines).

Figure : Non separable case. On the right we plot both terms ofinequality 3 (solid: left-hand, dotted: right-hand term) for β = 0.1and β = 0.4

21/ 68


REAL DATASETS

Figure : Difference between the Kendall rank correlation of ps and pwith p, namely Ks and K, for points having the condition (3) satisfiedand not. Ks and K are calculated as the mean of the correlations overall βs.

22/ 68


REAL DATASETS II

Figure : Ratio between the number of sample satisfying condition 3and all the instances available in each dataset averaged over all theβs.

23/ 68


DISCUSSION

I When (3) is satisfied the posterior probability obtainedafter sampling returns a more accurate ordering.

I Several factors influence (3) (e.g. β, variance of theclassifier, class separability)

I Practical use (3) is not straightforward since it requiresknowledge of p and νs

ν (not easy to estimate).I The sampling rate should be tuned, using a balanced

distribution is not always the best option.

24/ 68


SELECTING THE BEST STRATEGY

I With no prior information about the data distribution isdifficult to decide which unbalanced strategy to use.

I No-free-lunch theorem [23]: no single strategy is coherentlysuperior to all others in all conditions (i.e. algorithm,dataset and performance metric)

I Testing all unbalanced techniques is not an option becauseof the associated computational cost.

I We proposed to use the Racing approach [17] to performstrategy selection.

25/ 68


RACING FOR STRATEGY SELECTION

I Racing consists in testing in parallel a set of alternativesand using a statistical test to remove an alternative if it issignificantly worse than the others.

I We adopted F-Race version [4] to search efficiently for thebest strategy for unbalanced data.

I The F-race combines the Friedman test with HoeffdingRaces [17].

26/ 68


RACING FOR UNBALANCED TECHNIQUE SELECTION

Automatically select the most adequate technique for a givendataset.

1. Test in parallel a set of alternative balancing strategies on asubset of the dataset

2. Remove progressively the alternatives which aresignificantly worse.

3. Iterate the testing and removal step until there is only onecandidate left or not more data is available

Candidate(1 Candidate(2 Candidate(3subset(1( 0.50 0.47 0.48subset(2 0.51 0.48 0.30subset(3 0.51 0.47subset(4 0.60 0.45subset(5 0.55

Candidate(1 Candidate(2 Candidate(3subset(1( 0.50 0.47 0.48subset(2 0.51 0.48 0.30subset(3 0.51 0.47subset(4 0.49 0.46subset(5 0.48 0.46subset(6 0.60 0.45subset(7 0.59subset(8subset(9subset(10

Original(Dataset

Time

27/ 68


F-RACE METHODI Use 10-fold Cross Validation (CV) to provide the data

during the race.I Every time new data is added to the race, the Friedman

test is used to remove significantly bad candidates.I We made a comparison of CV and F-race using different

classifiers.

Candidate(1 Candidate(2 Candidate(3subset(1( 0.50 0.47 0.48subset(2 0.51 0.48 0.30subset(3 0.51 0.47subset(4 0.60 0.45subset(5 0.55

Candidate(1 Candidate(2 Candidate(3subset(1( 0.50 0.47 0.48subset(2 0.51 0.48 0.30subset(3 0.51 0.47subset(4 0.49 0.46subset(5 0.48 0.46subset(6 0.60 0.45subset(7 0.59subset(8subset(9subset(10

Original(Dataset

Time

28/ 68


F-RACE VS CROSS VALIDATIONDataset Exploration Method Ntest % Gain Mean Sd

ecoliRace Under 46 49 0.836 0.04CV SMOTE 90 - 0.754 0.112

letter-aRace Under 34 62 0.952 0.008CV SMOTE 90 - 0.949 0.01

letter-vowelRace Under 34 62 0.884 0.011CV Under 90 - 0.887 0.009

letterRace SMOTE 37 59 0.951 0.009CV Under 90 - 0.951 0.01

oilRace Under 41 54 0.629 0.074CV SMOTE 90 - 0.597 0.076

pageRace SMOTE 45 50 0.919 0.01CV SMOTE 90 - 0.92 0.008

pendigitsRace Under 39 57 0.978 0.011CV Under 90 - 0.981 0.006

PhosSRace Under 19 79 0.598 0.01CV Under 90 - 0.608 0.016

satimageRace Under 34 62 0.843 0.008CV Under 90 - 0.841 0.011

segmentRace SMOTE 90 0 0.978 0.01CV SMOTE 90 - 0.978 0.01

estateRace Under 27 70 0.553 0.023CV Under 90 - 0.563 0.021

covtypeRace Under 42 53 0.924 0.007CV SMOTE 90 - 0.921 0.008

camRace Under 34 62 0.68 0.007CV Under 90 - 0.674 0.015

compustatRace Under 37 59 0.738 0.021CV Under 90 - 0.745 0.017

creditcardRace Under 43 52 0.927 0.008CV SMOTE 90 - 0.924 0.006

Table : Results in terms of G-mean for RF classifier.29/ 68


DISCUSSION

I The best strategy is extremely dependent on the datanature, algorithm adopted and performance measure.

I F-race is able to automatise the selection of the bestunbalanced strategy for a given unbalanced problemwithout exploring the whole dataset.

I For the fraud dataset the unbalanced strategy chosen had abig impact on the accuracy of the results.

30/ 68


CONCLUSIONS OF CHAPTER 4

I Undersampling is not always the best strategy for allunbalanced datasets.

I This result warning against a naive use of undersamplingin unbalanced tasks.

I With racing [9] we can rapidly select the best strategy for agiven unbalanced task.

I However, we see that undersampling and SMOTE areoften the best strategy to adopt for FD.

31/ 68


Methodologicalcontribution

CHAPTER 5-

LEARNING WITH CONCEPT DRIFT

Publications:I A. Dal Pozzolo, O. Caelen, Y. Le Borgne, S. Waterschoot, and G. Bontempi.

Learned lessons in credit card fraud detection from a practitioner perspective.Elsevier ESWA, 2014.

I A. Dal Pozzolo, R. Johnson, O. Caelen, S. Waterschoot, N. Chawla, and G. Bontempi.Using HDDT to avoid instances propagation in unbalanced and evolving data streams.In IEEE IJCNN, 2014.

32/ 68



The objectives of this chapter are:I investigating the benefit of updating the FDS.I assessing the impact of Concept Drift (CD).I comparing alternative learning strategies for CD.I studying an effective way to deal with unbalanced

distribution in a streaming environment.

33/ 68


LEARNING APPROACHES

We compare three standard learning approaches ontransactions from February 2012 to May 2013.

Approach Strengths WeaknessesStatic • Speed • No CD adaptation

Update • No instances propagation • Need several batches• CD adaptation for the minority class

Propagate and Forget • Accumulates minority • Propagation leads toinstances faster larger training time

• CD adaptation

# Days # Features # Transactions Period422 45 2’202’228 1Feb12 - 20May13

34/ 68


THE STATIC APPROACH

Time%Sta)c%approach%

Fraudulent%transac)ons%Genuine%transac)ons%

Bt�2 Bt�1 Bt Bt+1 Bt+2 Bt+3 Bt+4

Mt

35/ 68


THE UPDATE APPROACH

Time%Update%approach%

Bt�2 Bt�1 Bt Bt+1

Mt�2 Mt�1 MtMt�3

Et

Bt�5 Bt�4 Bt�3

Fraudulent%transac)ons%Genuine%transac)ons%

36/ 68


THE PROPAGATE AND FORGET APPROACH

Time%Propagate%and%Forget%approach%

Bt�2 Bt�1 Bt Bt+1Bt�3

Mt�2 Mt�1 MtMt�3

EtFraudulent%transac)ons%Genuine%transac)ons%

37/ 68


BEST OVERALL STRATEGYThe Propagate and Forget approach is the best while the Staticapproach is the worst.

NNET.Under.One.1M.Static.60KSVM.Under.One.1M.Static.60K



RF.Under.One.1M.Static.30KRF.Under.One.1M.Static.60K

RF.SMOTE.One.1M.Static.90KRF.Under.One.1M.Static.90K

RF.EasyEnsemble.One.1M.Static.90KRF.Under.One.1M.Static.120K

RF.Under.15days.1M.Update.90KRF.Under.Daily.30M.Forget.30Kgen

RF.Under.Weekly.1M.Update.90KRF.Under.Daily.15M.Forget.30KgenRF.Under.Daily.10M.Forget.30Kgen

RF.Under.Daily.1M.Update.90KRF.Under.Daily.30M.Update.90K

RF.SMOTE.Daily.1M.Forget.90KgenRF.Under.Daily.5M.Forget.30KgenRF.Under.Daily.1M.Forget.90Kgen

RF.Under.Daily.15M.Update.90KRF.Under.Daily.5M.Update.90K

RF.EasyEnsemble.Daily.1M.Update.90KRF.SMOTE.Daily.1M.Update.90K

RF.EasyEnsemble.Daily.1M.Forget.90Kgen

0 1000 2000 3000 4000

Sum of the ranks

Best significantFALSETRUE

Metric: PrecisionRank

Figure : Comparison of all strategies using sum of ranks in all batcheswhen using Pk as accuracy measure.

38/ 68


DISCUSSION

I Without CD we would expect the three approaches toperform similarly.

I The static approach is consistently the worst.I Ensembles are often better alternative than single models.I Higher accuracy can be achieved by using SMOTE or

undersampling the data.I Instance propagation can be an alternative way to

rebalance the data stream.

39/ 68


CONCLUSION OF CHAPTER 5

I Updating the models daily and combining them intoensemble is an effective strategy to adapt to CD.

I Retaining positives samples/undersampling the datastream improves the accuracy of the FDS.

I However, it is not always possible to either revisit or storeold instances of a data stream.

I In that case we reccomend using Hellinger DistanceDecision Tree [7] (HDDT) to avoid instances propagation.

40/ 68


Proposed FDS

CHAPTER 6-

LEARNING WITH ALERT-FEEDBACK INTERACTION

Publications:I A. Dal Pozzolo, G. Boracchi, O. Caelen, C. Alippi, and G. Bontempi.

Credit Card Fraud Detection with Alert-Feedback Interaction.Submitted to IEEE TNNLS, 2015.

I A. Dal Pozzolo, G. Boracchi, O. Caelen, C. Alippi, and G. Bontempi.Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information.In IEEE IJCNN, 2015.

41/ 68



I Propose a framework meeting realistic working conditions.I Model the interaction between the FDS generating alerts

and investigators.I Integrate investigators’ feedback in the FDS.

42/ 68


REALISTIC WORKING CONDITIONS

I Alert-Feedback Interaction: the supervised samplesdepends on the alerts reported to investigators.

I Only a small set of alerts can be checked by humaninvestigators.

I The labels of the majority of transactions are available onlyseveral days later (after customers have reportedunauthorized transactions).

I Investigators trust a FDS only if it provides accurate alerts,i.e. alert precision (Pk) is the accuracy measure thatmatters.

43/ 68


SUPERVISED SAMPLES

I The k transactions with largest PKt−1(+|x) define the alertsAt reported to the investigators.

I Investigators provide a feedback about the alerts in At,defining a set of k supervised couples (x, y):

Ft = {(x, y), x ∈ At}, (4)

I Ft are the only immediate supervised samples.I Dt−δ are transactions that have not been checked by

investigators, but their label is assumed to be correct afterthat δ days have elapsed.

44/ 68


SUPERVISED SAMPLES III Ft are a small set of risky transactions according the FDS.I Dt−δ contains all the occurred transactions in a day (≈ 99%

genuine transactions).

t −1 t t +1

Feedbacks)Delayed)Samples)

t −δ

All)fraudulent)transac6ons)of)a)day)

All)genuine)transac6ons)of)a)day)

Fraudulent)transac6ons)in)the)feedback)Genuine)transac6ons)in)the)feedback)

FtFt�1Ft�3Ft�4Ft�5Ft�6 Ft�2Dt�7Dt�8 Bt+1

Figure : The supervised samples available at day t include: i)feedbacks of the first δ days and ii) delayed couples occurred beforethe δth day.

45/ 68


FEEDBACKS VS DELAYED SAMPLES

Learning from feedbacks Ft is a different problem than learningfrom delayed samples in Dt−δ:

I Ft provides recent, up-to-date, information while Dt−δmight be already obsolete once it comes.

I Percentage of frauds in Ft and Dt−δ is different.I Samples in Ft are selected as the most risky by Kt−1.I A classifier trained on Ft learns how to label transactions

that are most likely to be fraudulent.

Feedbacks and delayed samples have to be treated separately.

46/ 68


LEARNING STRATEGYA conventional solution for CD adaptation isWt [20]. To learnseparately from feedbacks and delayed transactions wepropose AW

t (aggregation of Ft andWDt ).

FtFt�1Ft�3Ft�4Ft�5Ft�6 Ft�2Dt�7Dt�8

Wt

FtWDt

AWt

Figure : Learning strategies for feedbacks and delayed transactionsoccurring in the two days (M = 2) before the feedbacks (δ = 7).

47/ 68


CLASSIFIER AGGREGATIONS

WDt has to be aggregated with Ft to exploit information

provided by feedbacks. We combine these classifiers byaveraging the posterior probabilities.

PAWt

(+|x) = αtPFt(+|x) + (1− αt)PWDt

(+|x) (5)

AWt with αt ≥ 0.5 gives larger influence to feedbacks on the

probability estimates w.r.tWt.

48/ 68


DATASETSWe used three datasets of e-commerce credit card transactions(fraud << 1%):

Table : Datasets

Id Start day End day # Days # Transactions # Features2013 2013-09-05 2014-01-18 136 21’830’330 512014 2014-08-05 2014-12-31 148 27’113’187 512015 2015-01-05 2015-05-31 146 27’651’197 51

2013 2014 2015

●

●●●●

●●

●

●●●●

●

●●●

●

●

●

●●●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●●●●●●●

●●●

●

●

●

●●

●

●

●

●●

●●

●●

●

●●

●●

●

●●

●●●

●

●

●

●

●

●●●

●

●●●●●

●

●●●

●●

●●●●

●●

●

●

●

●

●●●●

●●●●●●●●●●●

●●●●●●●●

●●●

●

●

●●●●●

●

●

●

●●●●

●●

●●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●●●

●●

●●

●●

●

●

●●●

●

●●

●

●●●●●

●●●

●●

●●

●

●●●●

●●

●

●

●●●●

●●

●●●

●

●●●●●●●●●●●●

●

●

●

●

●

●●

●●●

●

●

●

●●

●

●●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●●●

●●●

●

●●

●

●●

●●●●●

●●●●●●●

●●

●●

●●●

●

●

●

●

●

●

●

●●

●●

●●●●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●●

●

●

●●●●●●●

●

●●

●

●●

●

●●●

●

●

●

●●

●

●

●

●●

●

●●

●

●

●●

●

●●●●●

●

●●

●●●●

●

●

●●●

●

●●●

●●

●

●

●

●●

●

●

●●●●●

●●●

●

●

●●

●●●

●

●

●

●

●

●●

●●●

●

●

●●●

●

●●●●

●

0

500

1000

1500

2000

2500

Sep Oct Nov Dec Jan Aug Sep Oct Nov Dec JanJan Feb Mar Apr May Jun

Day

Num

ber

of fr

auds

49/ 68


SEPARATING FEEDBACKS FROM DELAYED SAMPLESAggregating two distinct classifiers (Ft withWD

t ) using AWt is

better than pooling together feedbacks and delayed sampleswithWt.

dataset classifier mean sd2013 F 0.62 0.252013 WD 0.54 0.222013 W 0.57 0.232013 AW 0.70 0.212014 F 0.59 0.292014 WD 0.58 0.262014 W 0.60 0.262014 AW 0.69 0.242015 F 0.67 0.252015 WD 0.66 0.212015 W 0.68 0.212015 AW 0.75 0.20

Table : Average Pk with δ = 7, M = 16 and αt = 0.5.

50/ 68


EXPERIMENTS WITH ARTIFICIAL CD

AW

W

(a) Sliding window strate-gies on dataset CD1

AW

W

(b) Sliding window strate-gies on dataset CD2

WAW

(c) Sliding window strate-gies on dataset CD3

Figure : Average Pk per day on datasets with artificial CD usingmoving average of 15 days. The vertical bar denotes the date of CD.

51/ 68


CLASSIFIERS IGNORING AFIExperiments with a classifierRt trained on all recenttransactions between t and t− δ. Rt makes the unrealisticassumption that all transactions are labeled by investigators.

FtFt�1Ft�3Ft�4Ft�5Ft�6 Ft�2Dt�7Dt�8

FtWDt

AWt

Rt

Figure : Rt makes the unrealistic assumption that all transactionsbetween t and t− δ have been checked by investigators.

52/ 68


CLASSIFIERS IGNORING AFI II

AUC is a global ranking measure, while Pk measures theranking quality only on the top k.

(a) dataset 2013metric classifier mean sdAUC F 0.83 0.06AUC WD 0.94 0.03AUC R 0.96 0.01AUC AW 0.94 0.03

Pk F 0.67 0.24Pk WD 0.54 0.22Pk R 0.60 0.22Pk AW 0.72 0.20

(b) dataset 2014metric classifier mean sdAUC F 0.81 0.08AUC WD 0.94 0.03AUC R 0.96 0.02AUC AW 0.93 0.03

Pk F 0.67 0.27Pk WD 0.56 0.27Pk R 0.63 0.25Pk AW 0.71 0.25

(c) dataset 2015metric classifier mean sdAUC F 0.81 0.07AUC WD 0.95 0.02AUC R 0.97 0.01AUC AW 0.94 0.02

Pk F 0.71 0.21Pk WD 0.62 0.21Pk R 0.68 0.20Pk AW 0.76 0.19

Table : Average AUC and Pk for the sliding approach (δ = 15, M = 16and αt = 0.5).

53/ 68


ADAPTIVE WEIGHTING

Experiments with different weighting strategies in AWt .

(a) dataset 2013α adaptation mean sdprobDif 0.70 0.22acc 0.72 0.21accMiss 0.72 0.21precision 0.72 0.21recall 0.71 0.21auc 0.72 0.21none(αt = 0.5) 0.72 0.20

(b) dataset 2014α adaptation mean sdprobDif 0.68 0.26acc 0.71 0.24accMiss 0.70 0.24precision 0.71 0.24recall 0.70 0.25auc 0.71 0.24none(αt = 0.5) 0.71 0.24

(c) dataset 2015α adaptation mean sdprobDif 0.73 0.21acc 0.76 0.19accMiss 0.75 0.19precision 0.75 0.19recall 0.74 0.20auc 0.75 0.19none(αt = 0.5) 0.76 0.19

Table : Average Pk of AWt with adaptive αt (δ = 15).

54/ 68


ADAPTIVE WEIGHTING II

Figure : PFt(+|x) and PWDt(+|x) for alerts in different days.

55/ 68


CONCLUSION OF CHAPTER 6

I The goal of the FDS is to return precise alerts (high Pk).I In a realistic settings feedbacks and delayed samples are

the only supervised information available.I Feedbacks and delayed samples address two different

classification tasks.I Not convenient to pool the two types of supervised

samples together.

56/ 68


CONCLUSIONS ANDFUTURE PERSPECTIVES

57/ 68


SUMMARY OF CONTRIBUTIONS

In this thesis we:I formalize realistic working conditions of a FDS.I study how undersampling can improve the ranking of a

classifier K.I release a R package called unbalanced [8] with racing.I investigate strategies for data streams with both CD and

unbalanced distribution.I prototype a real-world FDS that, for the first time, learns

from investigators’ feedback and allows us to have moreprecise alerts.

58/ 68


LEARNED LESSONS

I Rebalancing a training set with undersampling is notguaranteed to improve performances.

I F-Race [4] can be effective to rapidly select the unbalancedtechnique to use.

I The data stream defined by credit card transactions isseverely affected by CD, updating the FDS is often a goodidea.

I For a real-world FDS it is mandatory to produce precisealerts.

I Feedbacks and delayed samples should be separatelyhandled when training a FDS.

I Aggregating two distinct classifiers is an effective strategyand that it enables a quicker adaptation to CD.

59/ 68


TAKE-HOME MESSAGES

I Class imbalance and CD seriously affect the performancesof a classifier.

I In a real-world scenario, there is a strong Alert-Feedbackinteraction that has to be explicitly considered.

I The best learning algorithm for a FDS changes with theaccuracy measure.

I A standard classifier that is trained on all recenttransactions is not the best when we want to produceaccurate alerts.

60/ 68


FUTURE WORK: TOWARDS BIG DATA SOLUTIONS

I This work will be continued in the BruFence project.I Big Data Mining for Fraud Detection and Security.I Three research groups (ULB-QualSec, ULB-MLG and

UCL-MLG) and three companies (Worldline, Steria andNviso).

I 2015 - 2018 funded by InnovIRIS (Brussels Region).

BruFence&

•  Big&Data&Mining&for&Fraud&Detec;on&and&Security&&•  2015O2018&funded&by&Innoviris&(Brussels&Region).&&

• ULB - QualSec

• ULB - MLG

• UCL - MLG

!

!

!

!

• Wordline

• Steria

• NViso

BruFence - consortium

Spice

• ULB - QualSec

• ULB - MLG

• UCL - MLG

!

!

!

!

• Wordline

• Steria

• NViso


Spice

• ULB - QualSec

• ULB - MLG

• UCL - MLG

!

!

!

!

• Wordline

• Steria

• NViso


Spice

• ULB - QualSec

• ULB - MLG

• UCL - MLG

!

!

!

!

• Wordline

• Steria

• NViso


Spice

61/ 68


THE BruFence PROJECT

I Goal: implement the FDS of Chapter 6 using Big Datatechnologies (e.g. Spark, NoSQL DB).

I Real-time framework scalable w.r.t. the amount of dataand resources available.

I Determine to which extent external network data canimprove prediction accuracy.

I Compute aggregated features in real time and using alarger set of historical transactions.

Figure : Apache Big Data softwares

62/ 68


ADDED VALUE FOR WL

I RF has emerged as the best algorithm (now used by WL).I New ways to create aggregated features.I Undersampling can be used to improve the accuracy of

FDS, but probabilities need to be calibrated.I Our experimental analysis on CD gave WL some

guidelines on how to improve the FDS.I Including feedbacks in the learning process allows WL to

obtain more accurate alerts.

63/ 68


CONCLUDING REMARKS

I Doctiris was an unique opportunity to work on real-worldfraud detection data.

I Real working conditions define new challenges (oftenignored in the literature).

I Companies are more interested on practical results, whileacademics on theoretical ones.

I Industry and university, should look at each other andexchange ideas in order to have a much larger impact onsociety.

64/ 68


Website: www.ulb.ac.be/di/map/adalpozzEmail: [email protected]

Thank you for the attention

65/ 68

www.ulb.ac.be/di/map/adalpozz

mailto:[email protected]


BIBLIOGRAPHY I[1] D. N. A. Asuncion.

UCI machine learning repository, 2007.

[2] C. Alippi, G. Boracchi, and M. Roveri.A just-in-time adaptive classification system based on the intersection of confidence intervals rule.Neural Networks, 24(8):791–800, 2011.

[3] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer.Moa: Massive online analysis.The Journal of Machine Learning Research, 99:1601–1604, 2010.

[4] M. Birattari, T. Stutzle, L. Paquete, and K. Varrentrapp.A racing algorithm for configuring metaheuristics.In Proceedings of the genetic and evolutionary computation conference, pages 11–18, 2002.

[5] D. A. Cieslak and N. V. Chawla.Detecting fractures in classifier performance.In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, pages 123–132. IEEE, 2007.

[6] D. A. Cieslak and N. V. Chawla.Learning decision trees for unbalanced data.In Machine Learning and Knowledge Discovery in Databases, pages 241–256. Springer, 2008.

[7] D. A. Cieslak, T. R. Hoens, N. V. Chawla, and W. P. Kegelmeyer.Hellinger distance decision trees are robust and skew-insensitive.Data Mining and Knowledge Discovery, 24(1):136–158, 2012.

[8] A. Dal Pozzolo, O. Caelen, and G. Bontempi.unbalanced: Racing For Unbalanced Methods Selection., 2015.R package version 2.0.

65/ 68


BIBLIOGRAPHY II[9] A. Dal Pozzolo, O. Caelen, S. Waterschoot, and G. Bontempi.

Racing for unbalanced methods selection.In Proceedings of the 14th International Conference on Intelligent Data Engineering and Automated Learning. IDEAL,2013.

[10] C. Elkan.The foundations of cost-sensitive learning.In International Joint Conference on Artificial Intelligence, volume 17, pages 973–978. Citeseer, 2001.

[11] R. Elwell and R. Polikar.Incremental learning of concept drift in nonstationary environments.Neural Networks, IEEE Transactions on, 22(10):1517–1531, 2011.

[12] J. Gao, B. Ding, W. Fan, J. Han, and P. S. Yu.Classifying data streams with skewed class distributions and concept drifts.Internet Computing, 12(6):37–49, 2008.

[13] T. R. Hoens, R. Polikar, and N. V. Chawla.Learning from streaming data with concept drift and imbalance: an overview.Progress in Artificial Intelligence, 1(1):89–101, 2012.

[14] M. G. Kelly, D. J. Hand, and N. M. Adams.The impact of changing populations on classifier performance.In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages367–371. ACM, 1999.

[15] A. Liaw and M. Wiener.Classification and regression by randomforest.R News, 2(3):18–22, 2002.

66/ 68


BIBLIOGRAPHY III[16] R. N. Lichtenwalter and N. V. Chawla.

Adaptive methods for classification in arbitrarily imbalanced and drifting data streams.In New Frontiers in Applied Data Mining, pages 53–75. Springer, 2010.

[17] O. Maron and A. Moore.Hoeffding races: Accelerating model selection search for classification and function approximation.Robotics Institute, page 263, 1993.

[18] C. R. Rao.A review of canonical coordinates and an alternative to correspondence analysis using hellinger distance.Questiio: Quaderns d’Estadıstica, Sistemes, Informatica i Investigacio Operativa, 19(1):23–63, 1995.

[19] M. Saerens, P. Latinne, and C. Decaestecker.Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure.Neural computation, 14(1):21–41, 2002.

[20] D. K. Tasoulis, N. M. Adams, and D. J. Hand.Unsupervised clustering in streaming data.In ICDM Workshops, pages 638–642, 2006.

[21] V. N. Vapnik and V. Vapnik.Statistical learning theory, volume 1.Wiley New York, 1998.

[22] S. Wang, K. Tang, and X. Yao.Diversity exploration and negative correlation learning on imbalanced data sets.In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 3259–3266. IEEE, 2009.

[23] D. H. Wolpert.The lack of a priori distinctions between learning algorithms.Neural computation, 8(7):1341–1390, 1996.

67/ 68


BIBLIOGRAPHY IV

68/ 68


OPEN ISSUES

I Agreeing on a good performance measure (cost-based Vsdetection).

I Modelling Alert-Feedback Interaction.I Using both the supervised and unsupervised information

available.

68/ 68


AVERAGE PRECISION

Let N+ be the number of positive (fraud) case in the originaldataset and TPk be the number of true positives in the first kranked transactions (TPk ≤ k). Let us denote Precision andRecall at k as Pk = TPk

k and Rk = TPkN+ . We can then define AP as:

AP =

N∑k=1

Pk(Rk − Rk−1) (6)

where N is the total number of observation in the dataset.

68/ 68


RANKING ERRORWe have a wrong ranking if p1 > p2 and its probability is:

P(p2 < p1) = P(p2 + ε2 < p1 + ε1) = P(ε1 − ε2 > ∆p)

where ε2 − ε1 ∼ N (0, 2ν). By making an hypothesis ofnormality we have

P(ε1 − ε2 > ∆p) = 1− Φ

(∆p√

2ν

)(7)

where Φ is the cumulative distribution function of the standardnormal distribution.The probability of a ranking error with undersampling is:

P(ps,2 < ps,1) = P(η1 − η2 > ∆ps)

and

P(η1 − η2 > ∆ps) = 1− Φ

(∆ps√

2νs

)(8)

68/ 68


CLASSIFICATION THRESHOLD

Let r+ and r− be the risk of predicting an instance as positiveand negative:

r+ = (1− p) · l1,0 + p · l1,1r− = (1− p) · l0,0 + p · l0,1

where li,j is the cost in predicting i when the true class is j andp = P(y = +|x). A sample is predicted as positive ifr+ ≤ r− [21].

Alternatively, predict as positive when p > τ with τ :

τ =l1,0 − l0,0

l1,0 − l0,0 + l0,1 − l1,1(9)

68/ 68


CONDITION FOR A BETTER RANKINGK has better ranking with undersampling w.r.t. a classifierlearned with unbalanced distribution when

P(p2 < p1) > P(ps,2 < ps,1)

P(ε1 − ε2 > ∆p) > P(η1 − η2 > ∆ps) (10)

or equivalently from (7) and (8) when

1− Φ

(∆p√

2ν

)> 1− Φ

(∆ps√

2νs

)⇔ Φ

(∆p√

2ν

)< Φ

(∆ps√

2νs

)since Φ is monotone non decreasing and νs > ν: Then it followsthat undersampling is useful (better ranking) when

dps

dp>

√νs

ν

68/ 68


CORRECTING POSTERIOR PROBABILITIES

We propose to use p′, the bias-corrected probability obtainedfrom ps:

p′ =βps

βps − ps + 1(11)

Eq. (11) is a special case of the framework proposed by Saerenset al. [19] and Elkan [10] for correcting the posterior probabilityin the case of testing and training sets sharing the same priors.

68/ 68


CORRECTING THE CLASSIFICATION THRESHOLDWhen the costs of a FN (l0,1) and FP (l1,0) are unknown, we canuse the priors. Let l1,0 = π+ and l0,1 = π−, from (9) we get:

τ =l1,0

l1,0 + l0,1=

π+

π+ + π−= π+ (12)

since π+ + π− = 1. Then we should use π+ as threshold with p:

p −→ τ = π+

Similarlyps −→ τs = π+

s

From Elkan [10]:τ ′

1− τ ′1− τs

τs= β (13)

Therefore, we obtain:

p′ −→ τ ′ = π+

68/ 68


CREDIT CARDS DATASETReal-world credit card dataset with transactions from Sep 2013,frauds account for 0.172% of all transactions.

LB RF SVM

●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●

●●●●●● ●●●●●●

●●●●●● ●●●●●●

●●●●●● ●●●●●●

●●●●●●●● ●●●●●●●●

●●●●●●●● ●●●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.900

0.925

0.950

0.975

1.000

0.10.

20.

30.

40.

50.

60.

70.

80.

9 10.

10.

20.

30.

40.

50.

60.

70.

80.

9 10.

10.

20.

30.

40.

50.

60.

70.

80.

9 1

beta

AU

CProbability

pp'ps

Credit−card

LB RF SVM●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●

●●●●●●

●●●●●●●● ●●●●●●●●

3e−04

6e−04

9e−04

0.10.

20.

30.

40.

50.

60.

70.

80.

9 10.

10.

20.

30.

40.

50.

60.

70.

80.

9 10.

10.

20.

30.

40.

50.

60.

70.

80.

9 1

beta

BS

Probabilitypp'ps

Credit−card

68/ 68


DISCUSSION

I As a result of undersampling, ps is shifted away from p.I Using (11), we can remove the drift in ps and obtain p′

which has better calibration.I p′ provides the same ranking quality of ps.I Using p′ with τ ′ we are able to improve calibration without

losing predictive accuracy.I Unfortunately there is no way to select the best β

automatically yet.

68/ 68


HELLINGER DISTANCE

I Quantify the similarity between two probabilitydistributions [18]

I Recently proposed as a splitting criteria in decision treesfor unbalanced problems [6, 7].

I In data streams, it has been used to detect classifierperformance degradation due to concept drift [5, 16].

68/ 68


HELLINGER DISTANCE II

Consider two probability measures P1,P2 of discretedistributions in Φ. The Hellinger distance is defined as:

HD(P1,P2) =

√√√√∑φ∈Φ

(√P1(φ)−

√P2(φ)

)2(14)

Hellinger distance has several properties:

I HD(P1,P2) = HD(P1,P2) (symmetric)I HD(P1,P2) >= 0 (non-negative)I HD(P1,P2) ∈ [0,

√2]

HD is close to zero when the distributions are similar and closeto√

2 for distinct distributions.

68/ 68


HELLINGER DISTANCE DECISION TREES

I HDDT [6] use Hellinger Distance as splitting criteria indecision trees for unbalanced problems.

I All numerical features are partitioned into q bins (onlycategorical variables).

I The Hellinger Distance between the positive and negativeclass is computed for each feature and then the featurewith the maximum distance is used to split the tree.

HD(f +, f−) =

√√√√√ q∑j=1

√ |f +j ||f +| −

√|f−j ||f−|

2

(15)

where j defines the jth bin of feature f and |f +j | is the number of

positive instances of feature f in the jth bin.Note that the class priors do not appear explicitly in (15).

68/ 68


AVOIDING INSTANCES PROPAGATION

I HDDT allows us to remove instance propagations betweenbatches, benefits:

I improved predictive accuracyI improved speedI single-pass through the data.

I An ensemble of HDDTs is used to adapt to CD andincrease the accuracy of single classifiers.

I We test our framework on several streaming datasets withunbalanced classes and concept drift.

68/ 68


DATASETS

We used different types of datasets.I UCI datasets [1] (unbalanced, no CD).I Synthetic datasets from MOA [3] (artificial CD).I A credit card dataset (online payment transactions

between the 5th and the 25th of Sept. 2013).

Name Source Instances Features Imbalance RatioAdult UCI 48,842 14 3.2:1

can UCI 443,872 9 52.1:1compustat UCI 13,657 20 27.5:1

covtype UCI 38,500 10 13.0:1football UCI 4,288 13 1.7:1

ozone-8h UCI 2,534 72 14.8:1wrds UCI 99,200 41 1.0:1text UCI 11,162 11465 14.7:1

DriftedLED MOA 1,000,000 25 9.0:1DriftedRBF MOA 1,000,000 11 1.02:1

DriftedWave MOA 1,000,000 41 2.02:1Creditcard FRAUD 3,143,423 36 658.8:1

68/ 68


RESULTS CREDIT CARD DATASETFigure 18 shows the sum of the ranks for each strategy over allthe batches. For each batch, we assign the highest rank to themost accurate strategy and then sum the ranks over all batches.

HDIG No Ensemble

0.00

0.25

0.50

0.75

1.00

BD BL SE UNDER BD BL SE UNDER

Sampling

AURO

C AlgorithmC4.5HDDT

Figure : Batch average results interms of AUROC (higher is better)using different sampling strategiesand batch-ensemble weightingmethods with C4.5 and HDDTover Credit card dataset.

BL_C4.5SE_C4.5BD_C4.5

UNDER_C4.5BL_HDIG_C4.5

SE_HDDTSE_HDIG_C4.5UNDER_HDDT

BL_HDDTBD_HDIG_C4.5

BD_HDDTBD_HDIG_HDDT

UNDER_HDIG_C4.5SE_HDIG_HDDT

UNDER_HDIG_HDDTBL_HDIG_HDDT

0 100 200

Sum of the ranks

Best significantFALSETRUE

AUROC

Figure : Sum of ranks in all chunksfor the Credit card dataset interms of AUROC. In gray are thestrategies that are not significantlyworse than the best having thehighest sum of ranks.

68/ 68


HELLINGER DISTANCE BETWEEN BATCHES

Let Bt be the batch at time t used for training Kt and Bt+1 thetesting batch. We compute the distance between Bt and Bt+1 fora given feature f using Hellinger distance as [16]:

HD(Bft ,B

ft+1) =

√√√√√∑v∈f

√ |Bf=vt ||Bt|

−

√|Bf=v

t+1 ||Bt+1|

2

(16)

where |Bf=vt | is the number of instances of feature f taking value

v in the batch at time t, while |Bt| is the total number ofinstances in the same batch.

68/ 68


HELLINGER DISTANCE AND INFORMATION GAIN

We can make the assumption that feature relevance remainsstable over time and define a new distance function thatcombines IG and HD as:

HDIG(Bt,Bt+1, f ) = HD(Bft ,B

ft+1) ∗ (1 + IG(Bt, f )) (17)

The final distance between two batches is then computedtaking the average over all the features.

AHDIG(Bt,Bt+1) =

∑f∈Bt

HDIG(Bt,Bt+1, f )

|f ∈ Bt|(18)

Then Kt receives as weight wt = AHDIG(Bt,Bt+1)−M, where Mis the ensemble size.

68/ 68


ADAPTIVE WEIGHTING II

Let FAWt

be the feedbacks requested by AWt and FFt be the

feedbacks requested by Ft. Let Y+t (resp. Y−t ) be the fraudulent

(resp. genuine) transactions at day t (Bt), we define thefollowing sets:

I CORRFt = FAWt∩ FFt ∩ Y+

t .

I ERRFt = FAWt∩ FFt ∩ Y−t .

I MISSFt = FAWt∩ {Bt \ FFt} ∩ Y+

t .

68/ 68


ADAPTIVE WEIGHTING IIIn the example below we have |CORRFt | = 4, |ERRFt | = 2 and|MISSFt | = 3.

FAWt

FFt

FWDt

Bt

Figure : Feedbacks requested by AWt (FAW

t) are a subset of all the

transactions of day t (Bt). FFt ( FWDt

) denotes the feedbacks requestedby Ft (WD

t ). The symbol + is used for frauds and − for genuinetransactions.

68/ 68


ADAPTIVE WEIGHTING IIIWe can now compute some accuracy measures of Ft in thefeedbacks requested by AW

t :I accFt = 1− |ERRFt |

kI accMissFt = 1− |ERRFt |+|MISSFt |

kI precisionFt =

|CORRFt |k

I recallFt =|CORRFt ||FAW

t∩Y+

t |

I aucFt = probability that fraudulent feedbacks rank higherthan genuine feedbacks in FAW

taccording to PFt .

I probDifFt = difference between the mean of PFt calculatedon the frauds and the mean on the genuine in FAW

t.

If we choose aucFt as metric for Ft and similarly aucWDt

forWDt ,

then αt+1 is:αt+1 =

aucFt

aucFt + aucWDt

(19)

68/ 68


SELECTION BIAS AND AFII At day t AFI provides Ft, which is not representative of Bt.I ∀(x, y) ∈ Bt let s ∈ {0, 1}where s = 1 if (x, y) ∈ Ft, and

s = 0 otherwise (Ft ⊂ Bt).I With Ft we get PKt(+|x, s = 1) rather than PKt(+|x).I A standard solution to SSB consist into training a

weight-sensitive algorithm, where (x, y) has weight w:

w =P(s = 1)

P(s = 1|x, y)(20)

I Ft are selected according to P(+|x), hence P(s = 1|x) andP(+|x) are highly correlated.

I We expect P(s = 1|x, y) to be larger for feedbacks, leadingto smaller weights in (20).

68/ 68


INCREASING δ AND SSB CORRECTION

Table : Average Pk when δ = 15.

(a) slidingdataset classifier mean sd

2013 F 0.67 0.242013 WD 0.54 0.222013 AW 0.72 0.202014 F 0.67 0.272014 WD 0.56 0.272014 AW 0.71 0.252015 F 0.71 0.212015 WD 0.62 0.212015 AW 0.76 0.19

(b) ensembledataset classifier mean sd

2013 F 0.67 0.232013 ED 0.48 0.232013 AE 0.72 0.212014 F 0.67 0.282014 ED 0.49 0.272014 AE 0.70 0.252015 F 0.71 0.222015 ED 0.56 0.182015 AE 0.75 0.20

Table : Average Pk of Ft with methods for SSB correction (δ = 15).

dataset SSB correction mean sd2013 SJA 0.66 0.242013 weigthing 0.64 0.252014 SJA 0.66 0.272014 weigthing 0.65 0.282015 SJA 0.71 0.222015 weigthing 0.68 0.23

68/ 68


TRAINING SET SIZE AND TIME

●●●

●●●●●●●●●●●●● ●●●●●●●●●●●●●●●0e+00

1e+06

2e+06

3e+06

delayed feedback

Classifier

Tra

inin

g se

t siz

e

adaptationslidingensemble

(a) Training set size

●●●●

● ●●0

100

200

300

400

delayed feedback

Classifier

Tra

inin

g tim

e

adaptationslidingensemble

(b) Training time

Figure : Training set size and time (in seconds) to train a RF for thefeedback (Ft) and delayed classifiers (WD

t and EDt ) in the 2013 dataset.

68/ 68


FEATURE IMPORTANCE

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

TX_ACCEPTEDHAD_REFUSED

NB_REFUSED_HISTX_INTL

HAD_TRX_SAME_SHOPTX_3D_SECURE

IS_NIGHTHAD_LOT_NB_TX

RISK_GENDERNB_TRX_SAME_SHOP_HIS

RISK_CARD_BRANDRISK_D_AGE

RISK_LANGUAGEHAD_TEST

RISK_D_AMTRISK_TERM_MCC_GROUP

TX_HOURRISK_BROKER

RISK_TERM_MCCGRISK_LAST_COUNTRY_HIS

AGERISK_D_SUM_AMT

RISK_TERM_REGIONRISK_TERM_MCC

NB_TRX_HISRISK_TERM_COUNTRY

AMOUNTRISK_TERM_CONTINENT

MIN_AMT_HISSUM_AMT_HIS

RISK_LAST_MIDUID_HISRISK_TERM_MIDUID

0 20 40 60

Feature importance

RF model day 20130906

Figure : Average feature importance measured by the mean decreasein accuracy calculated with the Gini index in the RF models ofWD

t inthe 2013 dataset. The randomForest package [15] available in R usethe Gini index as splitting criteria.

68/ 68

adaptive machine learning for credit card fraud detection

Technology