outlier and fraud detection using hadoop

1

Outlier and Fraud Detection

Big Data Science MeetupJuly, 2012

Fremont, CA

2

About Me

Pranab Ghosh

• 25+ years in the industry• Worked with various technologies and platforms• Worked for startups, Fortune 500 and anything in between• Big data consultant for the last few years• Currently consultant in Apple• Active Blogger • Blog site: http://pkghosh.wordpress.com/• Owner of several open source projects • Project site: https://github.com/pranab• Passionate about data and finding patterns in data

My Open Source Hadooop Projects

• Recommendation engine (sifarish) based on content based and social recommendation algorithms

• Fraud analytic (beymani) using proximity and distribution model based algorithms. Today’s talk is related to this project.

• Web click stream analytic (visitante) for descriptive and predictive analytics

3

Outlier Detection

• Data that do not conform to the normal and expected patterns are outliers• Wide range of applications in various domains including finance, security, intrusion detection in cyber security•Criteria for what constitutes an outlier depend on the problem domain• Typically involve large amount data, which may be unstructured, creating opportunity of using big data technologies

4

Data Type

• Instance data, where the outlier detection algorithm operates on individual instance of data e.g., particular credit transaction involving large amount of money purchasing unusual product

• Sequence data with temporal or spatial relationship. The goal of outlier detection is to find unusual sequence e.g., intrusion detection and cyber security.

• Our focus is on outlier detection for instance data using Hadoop. We will be using credit card transaction data as an example

5

Challenges

• Defining the normal regions in a data set is the main

challenge. The boundary between normal and outlier may not be crisply defined.

• Definition of normal behavior may evolve with time. What is normal today may be considered anomalous in future and vice versa.

• In many cases the malicious adversaries adapt themselves to make the operations look like normal and try to stay undetected

6

Instance Based Analysis

• Supervised classification techniques using labeled training data with normal and outlier data e.g., Bayesian filtering, Neural Network, Support Vector Machine etc. Not very reliable because of lack of labeled outlier data

• Multivariate probability distribution based. Data point with low probability are likely to be outliers

• Proximity based approaches. Distance between data points are calculated in a multi dimensional feature space

• Relative density based. Density is inverse of average distance to neighbors..

.

7

Instance Based Analysis (contd)

• Shared nearest neighbor based. We consider the number of shared neighbor between neighboring data points.

• Clustering based. Data points with poor cluster membership are likely outliers.

• Information theory based. Inclusion of outlier causes increase in entropy of the data set. We identify data points whose removal causes large drop in entropy in the data set

• … and many more techniques.

.

8

Sequence Based Analysis

•Having a list of known sequences corresponding to malicious behavior and detecting those in the data. Does not works well for new and unknown threats

•Markov chain which considers observable states and probability of transition between states.

•Hidden Markov Model, where the system has both hidden and observable states

9

Model Based vs Memory Based

•As you may have observed, with some of the methods, we build a model from the training data and apply the model to detect outliers

•With the other methods, we don’t build a model but use the existing data directly to detect outliers

•The technique we will discuss today is based on the later approach i.e., memory based.

10

Average Distance to k Neighbors

•We find the distance between each pair of points. This has computational complexity of O(n x n) •For any point we find k nearest neighbors, where k is an user configured number• For each point, we find the average distance to the k nearest neighbors• Identify data points with high average distance to it’s neighbors. Outliers will have high average distance to neighbors•We can select data points above some threshold average distance or choose the top n based on avearge distance

11

Big Data Ecosystem

12

Credit: http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/

When to Use Hadoop

13Credit: http://www.aaroncordova.com/2012/01/do-i-need-sql-or-hadoop-flowchart.html

Map Reduce Data Flow

14Credit : Yahoo Developer Network

Hadoop at 30000 ft

• MapReduce – Parallel processing pattern. Functional programming model. Implemented as a framework, with user supplied map and reduce code.

• HDFS – Replicated and partitioned file system. Sequential access only. Writes are append only.

•Data Locality – Code moves where the data resides and gets executed there.

•IO Bound – Typically IO bound (disk and network)

15

Credit Card Transaction

We have a very simple data model. Each credit card transaction contains the following 4 attributes

1. Transaction ID2. Time of the day3. Money spent4. Vendor type

Here are some examples. The last one is an outlier, injected into the data set.

YX66AJ9U 1025 20.47 drug store98ZCM6B1 1910 55.50 restaurantXXXX7362 0100 1875.40 jewellery store

16

Distance Calculation

• For numerical attribute (e.g. money amount), distance is the difference in values

• For unranked categorical attribute (e.g. vendor type), the distance is 0 if they are same and 1 otherwise. The distances could be set softly between 0 and 1 (e.g product color).

•If the unranked categorical attributes have hierarchical relationship, the minimum no of edges to traverse from one node to the other could be used as distance (e.g., vendor type hierarchy)

17

Distance Aggregation

• We aggregate across all attributes to find the net distance between 2 entities

•Different ways to aggregate: Euclidean, Manhattan. Attributes can be weighted during aggregation, indicative of their relative importance

18

Pair Wise Distance Calculation MR

•It’s an O(nxn) problem. If there are 1 million transactions, we need to perform 1 trillion computation.• The work will be divided up between the reducers. If we have a 100 node Hadoop cluster with 10 reducers slots per node, each reducer will roughly perform 1 billion distance calculation.•How do we divide up the work? Use partitioned hashing. If h1 = hash(id1) and h2 = hash(id2), we use function of h1 and h2 as the key of the mapper output. For example f(h1,h2) = h1 << 10 | h2.•All the transactions with id hashed to h1 or h2 will end up with the same reducer.

19

Partitioned Hashing

• Code snippet from SameTypeSimilarity.java

String partition = partitonOrdinal >= 0 ? items[partitonOrdinal] : "none"; hash = (items[idOrdinal].hashCode() % bucketCount + bucketCount) / 2 ;

for (int i = 0; i < bucketCount; ++i) { if (i < hash){ hashPair = hash * 1000 + i; keyHolder.set(partition, hashPair,0); valueHolder.set("0" + value.toString()); } else {

hashPair = i * 1000 + hash; keyHolder.set(partition, hashPair,1); valueHolder.set("1" + value.toString()); }

context.write(keyHolder, valueHolder);

20

Output of Distance MR

• The output has 3 fields: the first transaction ID, second transaction ID and the distance

6JHQ79UA JSXNUV9R 5 6JHQ79UA Y1AWCM5P 89 6JHQ79UA UFS5ZM0K 172

21

Nearest Neighbor MR

• Next we need to find the nearest k neighbors of each data point. We essentially need the neighbors of a data point sorted by distance.•Use a technique called secondary sorting. We tag some extra data to the key which will force the key to be sorted by the data tagged as the mapper emits it’s key and value.•Going back to the output of the previous MR, this is how the mapper of this MR will emit key, value

key -> (6JHQ79UA, 5) value -> (JSXNUV9R, 5) key -> (JSXNUV9R,, 5) value -> (6JHQ79UA, 5)

key -> (6JHQ79UA, 89) value -> (Y1AWCM5P, 89) key -> (Y1AWCM5P, 89) value -> (6JHQ79UA, 172)

22

Nearest Neighbor MR (contd)

• On the reducer side when the reducer gets invoked, we will get a transaction ID as a key and a list of neighboring transaction ID and distance pair as the value •In the reducer, we iterate through the values and take the average distance and emit the transaction ID and average distance as output. We could use median also.1IKVOMZE 5 1JI0A0UE, 173 1KWBJ4W3, 278 ........... XXXX7362, 538

•As expected we find the outlier we injected into the dataset having a very large average distance to it’s neighbor.

23

Secondary Sorting

• Define reducer partitioner using the base part of the key (transaction ID) ensuring all values for a key will be routed to the same reducer

•Define grouping comparator using the base part of the key, ensuring all the values for a transaction ID will be passed in same reducer invocation

•Sorting is based on both parts of the key i.e. transaction ID and the distance.

24

How to Choose k

• High or low values for k will cause large error a.k.a bias variance trade off

•Small k -> low bias error -> high variance error

•Large k -> low variance error -> high bias error

•Find optimum k by experimenting with different values.

25

Segmentation

• In reality data might be segmented or clustered first and then outlier detection process run on the relevant cluster.

•What is normal in one segment may be an outlier in another

26

Fraud or Emerging Normal Behavior

• We have been able to detect the outlier. But how do we know whether it’s a fraudulent transaction or emerging buying pattern.•Your credit card may have been compromised and someone is using it. Or you have fallen in love and decided to shower him or her with expensive high price ticket items.•We can’t really tell the difference, except that once there is enough data points for this emerging behavior, we won’t be getting these false positives from our analysis

27

Thank You

Q & [email protected]

Big Data Consultant

28

mailto:[email protected]