outlier detection using k-nearest neighbour graph ville hautamäki, ismo kärkkäinen and pasi...

Post on 28-Dec-2015

217 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Outlier Detection Using k-Nearest Neighbour Graph

Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti

Department of Computer ScienceUniversity of Joensuu, Finland

What is an outlier?

Outlier is an observation that deviates too much from other observations so that it arouses suspicions that it was generated by a different mechanism.

Outliers

Motivation

Classical problem: Noise in measurements needs to be removed because of the detrimental effect for statistical inference. For example making clustering more robust.

Data mining: Deviating or surprising measurement is interesting. In this case we want to detect outliers.

Typical applications: Breast cancer detection Intrusion detection in network security systems Detecting credit card fraud

Taxonomy of outlier detection

Distribution-based methods: A classical statistical approach. If distribution is known, then define an discordance test for that distribution.

Clustering-based methods: Define observation as an outlier if it does not fit to the overall clustering pattern.

Distance-based methods: Define an observation as an outlier if p% of samples are dmin distance away from it.

Density-based methods: Observation is defined as an outlier if the local density is low.

K-Nearest Neighbour Graph

Definition: Every vector in the data set forms one node and every node has pointers to its k nearest neighbours.

KDIST: Density-based outlier detection

[S. Ramaswamy et al., SIGMOD, Texas, 2000.]

Define k Nearest Neighbour distance (KDIST) as the distance to the kth nearest vector.

Vectors are sorted by their KDIST distance. The last n vectors in the list are defined as outliers.

Intuitive idea is that when KDIST is large, vector is in sparse area and is likely to be an outlier.

Problem: user has to know in advance how many outliers he has in the data set.

MeanDIST: proposed Improvement

We define MeanDIST as the mean of k nearest distances.User supplied parameters of the algorithms (cutting point k in KDIST, local threshold t in MeanDIST).

MkNN: Mutual k-Nearest Neighbour

[M.R. Brito et al. Statistics & Probability Letters, 35(1):33-42, August, 1997.]

Mutual k-Nearest Neighbour (MkNN) uses a special case of kNN Graph: there exists an undirected edge between two vectors a and b if they belong to each others k-nearest neighborhood.

Connected components are designated as clusters

Isolated vectors as outliers.

ODIN: Outlier Detection using Indegree Number (proposed)

Definition: Given kNN Graph G for dataset S, outlier is a vertex, that has indegree less than or equal to T.

Test data sets

We used Receiver Operating Characteristics (ROC):• False Rejection (FR)• False Acceptance (FA).

Half Total Error Rate:(HTER) = (FR + FA)/ 2.

Summary of results

HTER error rates (k, threshold):

ODIN achieves zero error for HR, NHL1 and NHL2.

MeanDIST is better for larger synthetic set.

For KDD, none of the methods work well.

KDIST vs. MeanDISTShould we use the kNN distance as was done in the original paper or the mean of distances as proposed?

MeanDIST is more robust as a function of k for synthetic data.

Parameter Stability

Value of k is not important as long as threshold is below 0.1.

A clear valley in error surface between threshold values 20-50.

Minimum error

Minimum error

MeanDIST for KDD dataset: ODIN for S1 dataset:

Conclusions

Proposed method (ODIN) compares well with other similar methods.

Proposed improvement to RSS method (MeanDIST) outperforms KDIST in one case. In other cases, improvement is not significant.

In the case of small number of observations, density estimation methods perform poorly.

Future Work

Improve the performance of ODIN by integrating ideas from density estimation schemes.

Test with more real world datasets.

Consider validity of other proximity graphs for outlier detection task, especially need to study applicability of Adaptive Neighbourhood Graph.

Work to improve parameter robustness in proposed algorithms.

top related