03 niall adams

Upload: paritosh-sharma

Post on 07-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 03 Niall Adams

    1/49

    Credit Card Transaction Fraud Detection

    Niall Adams

    Department of MathematicsImperial College London

    January 2009

    http://goforward/http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    2/49

    Obligatory contents slide

    My objective is to report some of the our recent work on fraud

    detection, mostly sponsored under the EPSRC ThinkCrimeinitiative.

    1. Transaction fraud problem, process, challenge

    2. Supervised and unsupervised approaches (and data

    manipulation)3. Combining methods

    4. Streaming approach to handle change

    Collaborators: David Hand, Dave Weston, Chris Whitrow, PiotrJuszczak, Dimitris Tasoulis, Christoforos Anagnostopoulos.

    Collaborating banks: Abbey National, Alliance and Leicester,Capital One, Lloyds TSB

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    3/49

    Another plastic card fraud anecdote...

    Fraud is a piece of cake?

    Two very hungry German couriers ate a fruit cake destined for aGerman newspaper and in its place mailed a box of credit carddata. The data including names, addresses and card transactionsended up at the Frankfurter Rundschau daily. The mix-uptriggered an alarm, and police advised credit card customers withLandesbank Berlin to check their accounts for inconsistencies.Fruitcake must be different in Germany for people to want to use it

    as something other than a paperweight. (from slashdot)

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    4/49

    Transaction fraud

    Plastic-card fraud is a serious problem: losses to fraud in the UK inFH 2008 amounted to 307 million pounds. ...and once upon atime, this sounded like a lot of money...The problem is getting worse: (Source: www.cardwatch.org.uk)

    There seems to be a perception that there is a fundamentalbackground level of fraud. Losses to fraud are absorbed bycustomers, merchants, lenders. Thus the industry is veryconcerned with determining chargeback responsibility.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    5/49

    Types of fraud

    Transaction fraud is often grouped into a number of categories,including:

    Counterfeit. Creating duplicate cards. Skimming refers toreading a cards magnetic strip to create a duplicate card.

    Mail non-receipt fraud. Cards are intercepted in the post andused without permission. Extra effort may be required by thefraudster to get other information, perhaps by phishing.

    Card not present fraud. Includes phone, internet and mail

    order fraud.Chip&PIN simply shifted the problem.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    6/49

    The nature of fraud attacks is changing

    Online banking fraud dramatically increasing: FH 08 21.4m - up185%! (source: apacs). Mostly phishing incidents, but increasing

    money mule adverts.

    (Source: www.cardwatch.org.uk)Fraudsters change tactics to adapt to bank security measure (eg.HSBC checking all transactions now) arms race. Fraud

    population behaviour changing, but not legitimate population?

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    7/49

    Transaction stream

    Plastic card transaction processing uses a very complicated ISinfrastructure (eg. Visa Europe processes 6000 transactions asecond) to connect banks and merchants.Processing requirements include

    Speed

    fraud filtering while minimizing false positives

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    8/49

    Schematic processing path

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    9/49

    Challenges I

    In this talk we will explore methods that could stand in for thefraud filter, or operate immediately after. We will have to use amodified performance metric, and note that our data is subject toselection bias due to the fraud filter.

    Temporal aspects each account consists of an irregularly spaced sequence of

    (complicated) transactions records

    need for rapid processing

    shifting fraud tactics fraud identification delayed

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    10/49

    Challenges II

    Population and system factors

    Imbalanced data sets (P(fraud)

  • 8/6/2019 03 Niall Adams

    11/49

    Approaches

    Most existing fraud filters are relatively simple supervisedpredictive models, based on very carefully selected variables. Forexample, FALCON is (essentially) a logistic regression on a largeset of variables.Fundamentally, can consider two approaches:

    supervised learning - using the transaction fraud labels. A

    population approach unsupervised learning - has a customer departed from normal

    behavior? An account level approach.

    Will consider these approaches, and hybrids.Taking a tool based approach means that different approachesneed different features

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    12/49

    Superficially:

    Supervised use known fraud label, so possibly resistant to unusual

    non-fraud transactions implemented on a window, so using older fraud transactions,

    and not immediately responsive to each account decision threshold specification straightforward (in principle)

    Unupervised respond to every transaction on an account capacity to respond to new types of fraud, since not modeling

    known frauds risk of higher false positive rate setting some parameters less straightforward (in principle)

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    13/49

    Data

    Typical transaction record has more than 70 fields, including

    transaction value

    transaction time and date

    transaction category (payment, refund, ATM mobile top-upetc)

    ATM/POS indicator

    Merchant category code - large set, ranging from specificairlines to massage parlours

    card reader response codes

    Fundamental problem is to select which data to extract. Moreover,different (supervised/unsupervised) tools will handle transactionsdifferently.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    14/49

    Performance assessment

    Superficially, fraud detection looks like a two class classification

    problem fraud versus non-fraud for which a suitable measure isAUC (area under ROC curve).

    However, AUC integrates over allocation costs. Moreover, there isa temporal aspect related to timeliness of detection.

    Suppose that the cost of investigating a case is 1 unit. Both TPand FP incur this. Estimates from a collaborating bank suggest amissed (FN) fraud costs 100 such units. We construct a measure,TC, that accounts for the number of fraud and non fraudtransactions on an account, which deploying this cost information.

    Subtle arguments in Hand et al. (2008) show that this exactsummary can be derived from a operating characteristic curvemodified to account for temporal ordering of transactions.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    15/49

    Supervised methods

    Perhaps most natural approach - transactions ultimately labeled as

    fraud or non-fraud - is two class classification.Many possible methods for this, ranging from logistic regression tosupport vector machines. The question is how to pre-process (eg.alignment issues) the transaction database for presentation to thesupervised learner.

    We explored the approach of transaction aggregation transforming transaction level data to account level data.xi - fixed length vector extracted from account i transaction.

    yi = (

    x{1}

    i . . . ,x{n}

    i )This is the activity record for account i, based on n sequentialtransactions. is the transformation - which we restrict to beinsensitive to the order of the arguments

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    16/49

    Selected variables for x using expert advice and extensive

    exploratory data analysis, to explore relationship between variablesand fraud label.Variables included:

    number of POS transactions

    value of POS transactions transactions identified by magnetic strip

    simplified merchant category codes

    The function was tailored to compute various counts and

    averages (again, using extensive exploratory analysis - which seemshard to escape)

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    17/49

    To illustrate, using 5 transactions

    Various tricks to handle time-of-day data.Note these type of approaches are reasonably standard in theindustry. We end up with a maximum of 67 variables.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    18/49

    If any transaction in an activity record is labeled as fraud, then wedeem all transactions in the record as fraud.We fix the number of days in the activity record across the

    population - thereby inducing variable numbers of transaction peraccount.We experimented with the following classifiers to explore theimpact of this length, considering activity records of 7 days, 3 days,1 day, and 1 transaction:

    Logistic regression Naive Bayes (all variables binned)

    QDA (with some covariance regularization)

    SVM withe Gaussian RBF kernels, kernel width and

    regularization parameter set by experimentation Random forests, using 200 bootstrap samples, and 10

    variables set at each split

    CART, K-NN (both with some further tinkering)

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    19/49

    To recap, we occupy a feature space with activity records, oflength 1,3,7, built using consecutive windows. each object in thisspace has a fraud label, and we use a variety of classifiers, ofvarious expressive power, to made predictions.

    These methods are deployed on real data samples from commercialcollaborators, consisting of tens or hundreds of millions oftransactions.

    We try to use the data fairly, so quote out of sample predictionsrepresenting the temporal ordering of the data.

    B k A

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    20/49

    Bank A

    TC=0.09 corresponds to guessing. Standard error approx. 0.001(bootstrap). In general, longer records better. Best performancefrom random forest.

    B k B

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    21/49

    Bank B

    Mostly, longer activity records better. Random forests again best

    method.Note different performance on different banks - different customerbases.Mixing different length records would be of interest, and remainsfor future work.

    U i d M th d A t l l l d t ti

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    22/49

    Unsupervised Methods: Account level anomaly detection

    Attempt to detect departure from normal behavior. Construct adensity estimate over suitable defined transaction features, thenflag new transactions with low estimated density.

    Three accounts, difference in time of consecutive transactions:

    0 2 4 6 8 10 12

    x 105

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4x 10

    3

    time [s]

    pdf

    30 trans.

    0 2 4 6 8 10 12

    x 105

    0

    1

    2

    3

    4

    5

    6x 10

    3

    time [s]

    pdf

    63 trans.

    0 2 4 6 8 10 12

    x 105

    0

    0.002

    0.004

    0.006

    0.008

    0.01

    0.012

    0.014

    time [s]

    53 trans.

    All these accounts are (?) legitimate. Account level approachavoids need to handle this source of heterogeneity.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    23/49

    We consider a two-stage approach.1. Estimation stage - accumulate enough transactions to

    construct a model of normal behaviour. We use a fixednumber, but this is a free parameter.

    2. Operational stage - use the model of behaviour to flagtransactions as normal or abnormal. Treat abnormal as fraud

    Generic issues to handle: choice of model, choice of threshold,method of handling temporal nature of data.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    24/49

    For account i, we have the transaction sequence

    Xi =xt|xt R

    N, t = 1, 2, . . .

    Here, we have chosen to represent the transaction record as a

    collection of continuous variables. Of course, other options arepossible.Some trickery required to handle categorical variables likemerchant category codes.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    25/49

    For a specific account, suppose we have legitimate transaction

    data, X, then our detector for new transaction x

    h(x|X, ) = I(p(x|X, ) > ) =

    1 x is classified as a legitimate,

    0 x is classified as a fraud

    Here p() is a density estimate (the model), and refers to controlparameters for the model. is the alert threshold. Difficult to set without context, but onepossibility relate to the maximum proportion of flagged cases

    that we can afford to investigate.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    26/49

    time of day [s]

    m

    oney[p]

    0 2 4 6 8

    x 104

    0

    5000

    10000

    15000

    20000

    time of day [s]

    m

    oney[p]

    0 2 4 6 8

    x 104

    0

    5000

    10000

    15000

    20000

    Models

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    27/49

    Models

    We explored many possibilities, including

    Kernel density estimate (Parzen)

    Naive Parzen (NParzen)

    Mixture of Gaussians (MoG)

    Gaussian (Gauss)

    nearest neighbour (1-NN)

    etc, etc...

    Control parameters difficult; various procedures, or arbitrarily fixed.ATM and merchant type represented as distances and modelled

    with linear programming data description. Essentially finddistance of each point from representative plane, and transform tohave character of probability

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    28/49

    Features

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    29/49

    Features

    We represent the jth transaction as

    amount

    amount difference

    time

    time difference (crude method of incorporating some temporalstructure)

    Merchant type *

    ATM-id *

    * - categorical variables.Of course, selection of variables could be optimised, but this mightbe impractical in the streaming context.

    Some results

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    30/49

    Some results

    same data sets as before, different features, so avoid directcomparison with supervised.Performance order, two banks, two measures (TC and AUC)

    D1

    performance curve SVDD MST 1-NN Gauss NParzen SOM MoG MPM ParzenROC SVDD MST NParzen SOM 1-NN MPM Gauss MoG Parzen

    D2performance curve SVDD MST 1-NN NParzen SOM Gauss MoG Parzen MPM

    ROC SVDD MST NParzen SOM 1-NN Gauss MoG Parzen MPM

    Supervised-classifiers built on this data exhibit similar performance.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    31/49

    Of interest to examine performance into the future. Buildsupervised classifier on same data as unsupervised, then examineperformance into the future (fixed costs, false positive rates)

    2 3 4 5 6

    0.11

    0.13

    0.15

    0.17

    month

    FP

    Parzen

    oneclass classifier

    twoclass classifier

    2 3 4 5 6

    0.11

    0.13

    0.15

    month

    FP

    1NN

    oneclass classifier

    twoclass classifier

    2 3 4 5 6

    0.09

    0.13

    0.17

    0.21

    month

    FP

    SVM

    oneclass classifier

    twoclass classifier

    Evidence that the account-level approach degrades more gracefully

    over time.

    Unsupervised Methods: Peer group analysis

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    32/49

    Unsupervised Methods: Peer group analysis

    Peer group analysis (PGA) is a new method, attempting to usemore than just a single accounts data for anomaly detection.

    Premise: some accounts exhibit similar behaviour (ie follow similartrajectories through some feature space). Use anomaly concept,

    but incorporating behaviour of similar accounts.Two stage process: (1). learn identity of similar accounts(temporal clustering) (2). anomaly detection over similar accounts.

    Lots of implementation issues! One instantiation not competitive

    with previous approaches, but does identify objects that are notsimply population outliers.

    Combination

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    33/49

    C b

    Cannot practically run different detectors in parallel. Combinationessential, and perhaps yield improved performance?

    Again, different approaches to combination possible. Perhaps mostelegant is to incorporate unsupervised scores into supervised

    method. But technically and practically difficult.Instead, we consider the output of each detector, and consider howto combine them. For each transaction we have a score from eachof Random forest, an SVM-based anomaly detector, and an

    instantiation of PGA.Normalize all scores to have character of P(fraud).

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    34/49

    with each transaction represented by three scores (three variables),one from each detection sub-system we can consider different sortsof combiner

    ad-hoc

    max Supervised

    logistic regression, naive Bayes K-NN

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    35/49

    Build all sub-systems on first part of data, and predict on second.Note, no model updating.

    Method Loss % 0.1 AUC % 0.1Random forest 8.63 68.5SVM anomaly 9.08 54.8PGA 9.08 32.8logistic 8.14 67.9NB 7.23 87.9133-NN (2 var ) 7.22 88.2123-NN (3 var) 7.04 88.5

    - not using PGA, K selected by CV study. Strikingly, PGA,

    which has no standalone merit, may add a little to the performanceof a suitably constructed combiner.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    36/49

    Combination strategies can certainly provide improvedperformance. Still working out why: one point is that PGA workson histories with frequent transactions, account level detectionbetter for infrequent transactions.

    So, tools can be put together, but we have ignored the issue ofchange over time. All these methods have been built on static

    windows of data. This is consistent with the industry norm - buildthe detector - monitor performance - rebuild when performance isdeemed to have degraded too far.

    Clearly, a static window gives some capacity to handle changing

    populations (old data not relevant). But there may be a betterway to do it...

    Temporal adaption - current work

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    37/49

    Consider the problem of computing the mean vector and covariance

    matrix of a sequence of n multivariate vectors. Standard resultssay this computation can be implemented as a recursion

    mt = mt1 + xt, t = mt/n, m0 = 0 (1)

    St = St1 + (xt t)(xt t)T, t = St/n, S0 = 0 (2)

    After n steps, this would give the equivalent offline result. If we aremonitoring vectors coming from a non-stationary system, the

    simple averaging of this type is biased.If we knew the precise dynamics of the system, we have a chanceto construct an optimal filter. However, we do not.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    38/49

    One approach to tracking the mean value would be to run with awindow. Alternatively, we can use ideas from adaptive filter theory,

    and incorporate a forgetting factor, (0, 1], in the previousrecursion

    nt = nt1 + 1, n0 = 0 (3)

    mt = mt1 + xt, t = mt/nt (4)

    St = St1 + (xt t)(xt t)T, t = St/nt (5)

    down weights old information more smoothly than a window.

    nt is the effective sample size or memory. = 1 gives offline

    solutions, and nt = n. For fixed < 1 memory size tends to1/1(1 ) from below.

    Setting

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    39/49

    Two choices for , fixed value, or variable forgetting, t. Fixedforgetting: set by trial and error.Variable forgetting: result from Haykin (1997) (from adaptive filtertheory) say tune t according to a local gradient descent rule

    t = t1 2t , t: residual error at time t, small (6)

    Amazingly, using results from numerical linear algebra, thisframework can still yield efficient updating rules. Performance verysensitive to . Very careful implementation required.

    We are exploring extending this idea, to construct a framework forsequential likelihood estimation with forgetting.

    Illustration

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    40/49

    Tracking mean and covariance in 2d

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    41/49

    change detection properties, two fixed values of , 5D, abrupt

    change

    -400

    -200

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    0 200 400 600 800 1000 1200 1400 1600 1800 2000

    ValueofGra

    dient

    Time

    Reaction of Gradient at Abrupt Change at t = 1000

    lambda fixed at 0.99lambda fixed at 0.95

    Streaming classifier

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    42/49

    Since we have a method for adaptively and incrementally

    estimating mean vectors and covariances matrices, we can nowconsider an adaptive version of Gaussian based classification (sincethese methods only require means and covariances).

    Recall

    P(c|x) =f(x|c)P(c)

    f(x)

    Change can happen in various ways, but population drift usuallyrefers to the class prior P(C) and/or the class conditional densitiesP(x|c).

    Linear/quadratic discriminant analysis (LDA/QDA) motivated byreasoning that f(x|c) N(, ). LDA: assume covariance matrixcommon across classes. QDA: different covariances.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    43/49

    Now, instead of using static estimates of and , use the

    adaptive estimates. Can use the same adaptive forgetting factorto handle changing prior probabilities also. This leads to a numberof ways of constructing a stream classifier.

    This requires some amount of hack, because the theory requires

    regularly-spaced data in time. To handle this, we simply updateevery time an observation arrives.

    Also, to test the idea, provide fraud flag immediately afterclassification (unrealistic, but we have some ideas for the realproblem).

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    44/49

    Using 18 variables, based on numerical elements of the transaction

    record, and a means of coding merchant category codes, we havethe following performance, over 5 years (AUC measured once amonth). (Note, a little regularisation required).

    Performance of static versions (adaptive windows) very poor. This

    model can be implemented very efficiently.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    45/49

    What does the adaptive forgetting factor show?

    While rebuilding the detector will always be needed, this approachmight help mitigate some losses due to population drift.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    46/49

    In the context of the whole problem, we have the opportunity toplace the time-adaption in different places. This suggests the

    following intriguing idea for fraud detection between systemrebuilds:

    Conclusions

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    47/49

    Transaction fraud detection is an important, but hard, real worldproblem. A significant amount of engineering is required toproduce effective solutions.

    Different modeling approaches and tools can have merit - and it

    appears they can be effectively combined. This suggests that thedifferent tools are capturing the fraud signal in non-overlappingways.

    We have the problem of handling changing populations (arms race,economic drift etc). Preliminary results suggest that temporallyadaptive methods may have some utility in this context.

    Future Work

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    48/49

    Explore continuous updating and adaption of subsystems andcombiner.

    Extend adaptive classifier to finite mixture model (moreflexible), approximate logistic regression and RBF networks.

    More realistically handle the delayed fraud label.

    http://find/http://goback/
  • 8/6/2019 03 Niall Adams

    49/49