a vertical outlier detection algorithm with clusters as by-product dongmei ren, imad rahal, and...

A Vertical Outlier Detection Algorithm with Clusters as by-product

Dongmei Ren, Imad Rahal, and William Perrizo

Computer Science and Operations Research North Dakota State University

Outline

Background Related work The Proposed Work

Contributions of this Paper Review of the P-Tree technology Approach

Conclusion

Background Something that deviates from the standard be

havior Can find anomaly interests Critically important in information-based area

s such as: Criminal Activities in Electronic commerce, Intrusion in network, Unusual cases in health monitoring, Pest infestations in agriculture

Background (cont’d) Related Work

Su et al.(2001) proposed the initial work for cluster-based outlier detection:

Small clusters are considered as outliers. He et al (2003) introduced two new definitions:

cluster-based local outlier outlier factor. they proposed an outlier detection algorithm

CBLOF (Cluster-based Local Outlier Factor). Outlier detection is tightly coupled with the clustering

process (faster than Su’s approach). However, the clustering process takes precedence over the

outlier-detection process. Not highly efficient.

The Proposed Work Improve the efficiency and scalability (to data size)

of the cluster-based outlier detection process Density-based clustering

Our Contributions: Local Connective Factor (LCF)

Used to measure the membership of data points in clusters LCF-based outlier detection method

can efficiently detect outliers and group data into clusters in a one-time process.

does not require the beforehand clustering process, the first step in contemporary cluster-based outlier detection methods

The Proposed Work (cont’d)

A vertical data representation, the P-tree Performance improved by using a vertical data repres

entation, the P-tree.

Review of P-Trees Traditionally, data have been represented horizontally

and then processed vertically (i.e. row by row) Great for query processing Not as good when one interested in collective data properties Poorly scales with very large datasets (performance).

Our previous work --- a vertical data structure, the P-Tree (Ding et al.,2001).

Decompose relational tables into separate vertical bit slices by bit position

Compress (when possible) each of the vertical slices using a P-tree

Due to the compression and scalable logical P-Tree operations, this vertical data structure can address the problem of non-scalability with respect to size.

b) 3-bit slices

Review of P-Trees (cont’d)

AND, OR and NOT operations

P1-trees for the dataset

Construction of P-Trees

Dataset

23225277

Pure-1

Pure-0

Mixed


Value P-tree R(A1,A2,…,An) Every Ai (a1, a2, …, am) Pi,j represents all tuples having a 1 in

Ai,aj

P(Ai = 101) =P1 & P’2 & P3

Tuple P-tree P(111,101,…) = P(A1 = 111) & P(A2 = 101) & …

Review of P-Trees (cont’d) The inequality P-tree (Pan,2003) :

represents data points within a data set D satisfying an inequality predicate, such as attribute x ≥v and x ≤v.

Calculation of P x ≥v : x be an m-bit attribute within a data set D Pm-1, …, P0 be P-trees for vertical bit slices of x v = bm-1…bi…b0, where bi is ith binary bit value of v Px≥v be the predicate tree for the predicate x≥v , Px≥v= Pm-1 opm-1 … Pi opi Pi-1 … op1 P0, i = 0, 1 … m-1, where:

opi is AND if bi=1, opi is OR otherwise the operators are right binding, e.g. the inequality tree Px ≥101 = (P

2 AND (P OR P0)).


Calculation of Px≤v: Px≤v = P’m-1opm-1 … P’i opi P’i-1 … opk+1

P’k, k ≤ i ≤ m -1, k is the rightmost bit position with value

of “0”, bk=0, bj=1, for all j<k

Review of P-Trees (cont’d) High Order Bit (HOBit) Distance Metric

A bitwise distance function. Measures distances based on the most significant consecutive m

atching bit positions starting from the left. Assume Ai is an attribute in a data set. Let X and Y be the Ai values

of two tuples, the HOBit Distance between X and Y is defined by m (X,Y) = max {i+1 | xi⊕yi = 1 },

where xi and yi are the ith bits of X and Y respectively, and ⊕denotes the XOR (exclusive OR).

e.g. X=1101 0011, Y=1100 1001 m=5

Correspondingly, the HOBit similarity is defined bydm (X,Y) = BitNum - max {i+1| xi⊕yi = 1 }.

Definitions

/rr)DiskNbr(x,Dens rxDiskNbr ||),(

Direct DiskNbr

x

Indirect DiskNbr

/RCluster(R)Dens Rcluster ||)(

Definition 1: Disk Neighborhood --- DiskNbr(x,r) Given a point x and radius r, the disk neighborhood of x is defined as the set :

DiskNbr(x, r)={y X | d(x,y) r}, where d(x,y) is the distance between x and y Direct & indirect neighbors of x with distance r

Definition 2: Density of DiskNbr(x, r) --- DensDiskNbr (x,r) (Breunig,2000, density-based)

Definition 3: Density of a cluster --- Denscluster(R), is defined as the total number of points in the cluster divided by the radius R of the cluster:

Definitions (cont’d) Definition 4: Local Connective Factor (LCF) with res

pect to a DiskNbr(x, r) and the closest cluster (with radius R ) to x

The LCF of the point x, denoted as LCF (x, r), is the ratio of DensDiskNbr(x,r) over Denscluster(R).

LCF indicates to what degree, point x is connecte

d with the closest cluster to x.

)(/

),( RclusterDens

rxDiskNbrDensLCF(x,r)

The Outlier Detection Method

Given a dataset X, the proposed outlier detection method is processed by: Neighborhood Merging

Groups non-outliers (points with consistent density) into clusters LCF-based Outlier Detection

Finds outliers over the remaining subset of the data, which consists of points on cluster boundaries and real outliers.

Neighborhood Merging

Start point x

x rr 2r 4r 8r

Finding Outliers


All points in the 2r-neighborhood are grouped together. The process will call the “LCF-based outlier detection” next and mine

outliers over the set of points in {DiskNbr(x,4r) – DiskNbr(x,2r)}.

xr

Merging nonOutlier points

Start point

xr r 2r 4r

Merging non-Outlier points

User chooses an r, the process picks up an arbitrary point x, calculates DensDiskNbr(x,r); increases the radius from r to 2r, calculate DensDiskNbr(x,2r), and observes the ratio between the two.

If the ratio is in the range, [1/(1+ε),(1+ε)](Breunig, 2000), the expansion and the merging will be continued by increasing the radius to 4r,8r...

If the ratio is outside the range, the expansion stops. Point x and its k*r-neighbors are merged into one cluster.

Neighborhood Merging using P-Trees

ξ – neighbors (XI) represents the neighbors with distance ξ bits from x, e.g. ξ = 0,1, 2 ... 8 if x is an 8-bit value

For point x, let X= (x0,x1,x3, … xn-1) xi = (xi,m-1, …xi,0),

xi,j is the jth bit value in the ith attribute. The ξ- neighbors of xi is the value P-tree: Pxi using last m-ξ bits (taking the las

t m bits of Xi) The ξ- neighbors of X in the tuple P-tree (AND Pxi using last m-ξ bits) for all xi

The neighbors are merged into one cluster by PC = PC OR PX

ξ

where PC is a P-tree representing the currently processed cluster, and PX

ξ is the inequality P-tree representing the ξ-neighbors of x.

“LCF-based Outlier Detection”

Search for all the direct neighbors of x. For each direct neighbor

get its direct neighbors (those will be the indirect neighbors of x). Calculate the LCF w.r.t current cluster for the resulting neighborhood.

For the points in {DiskNbr(x,4r)-DiskNbr(x,2r)}, the “LCF-based outlier detection” process finds the outlier points, and starts a new cluster if necessary.


Start point x

xr 2r 4r 8r

Merging into current cluster

A New Cluster!

Outliers!

“LCF-based Outlier Detection” (cont’d)

1/(1+ε) ≤ LCF ≤ (1+ε): x and its neighbors can be merged into the current cluster.

LCF > (1+ε). The point x is in a new cluster with higher density, start a new cluster, call the neighborhood merging procedure.

LCF < 1/ (1+ε). Point x and it neighbors can either be in a cluster with low density or be outliers.

Get indirect neighbors of x recursively. In case the number of all neighbors is larger than some t

(Papadimitriou, 2003, t=20), call neighborhood merging for another cluster

In case less than t neighbors are found, we identify those small number of points as outliers.

“LCF-based Outlier Detection” using P-Trees

Direct neighbors represented by a P-Tree --- DT-Px

r let X= (x0,x1,x3, … xn-1)

xi = (xi,m-1, …xi,0)

xi,j is the jth bit value in the ith attribute. For attribute xi, Pxi

r = Pxi value>xi-r & Pxi valuexi+r For r- neighbors are in the P-tree DT-Px

r = (AND Pxir) for

all xi

|DiskNbr(x,r)|= rootCount(DT-Pxr)

“LCF-based Outlier Detection” using P-Trees (cont’d)

Indirect neighbors represented by a P-Tree --- IN-Px

r IN-Px

r = (OR q Nbr(x,r) DT-Pqr) AND (NOT (DT-Px

r)), where NOT is the complement operation of the P-Tree

Outliers are inserted into outlier sets by OR operation LCF < 1/(1+ε) and t<20 POls = POls OR DT-Px

r OR IN-Pxr, where POls is an out

lier set represented by a P-Tree.

Preliminary Experimental Study

Compare with MST(2001): Su’s et al. two-phase clustering based

outlier detection algorithm, denoted as MST,MST is the first approach to perform cluster-based outlier detection.

CBLOF(2003): He’s et al. CBLOF (cluster-based local outlier factor) method, faster.

NHL data set (1996) Run Time and Scalability Comparison

Preliminary Experimental Study (Cont’d)

0

500

1000

1500

2000

2500

3000

data size

Comparison of run time

MST 5.89 10.9 98.03 652.92 2501.43

CBLOF 0.13 1.1 15.33 87.34 385.39

LCF 0.55 2.12 7.98 28.63 71.91

256 1024 4096 16384 65536

Run time comparison Scalability is the best among the three algorithms The outlier sets were largely the same

Conclusion An outlier detection method with clusters as by-pr

oduct can efficiently detect outliers and group data into clusters

in a one-time process; does not require the beforehand clustering process, whic

h is the first step in current cluster-based outlier detection methods; the elimination of the pre-clustering makes outlier detection process faster

A vertical data representation, the P-tree. The performance of our method is further improved by usin

g a vertical data representation, the P-tree. Parameter Tuning is really important

ε, r , t Future direction

Study more parameter tuning effects Better quality of clusters … Gamma measure Boundary points testing

Reference1. V.BARNETT, T.LEWIS, “Outliers in Statistic Data”, John Wiley’s Publisher2. Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International Conference on Knowledge

Discovery and Data Mining Proceedings, 1997, pp. 219-222.3. Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large Data Bases Conference

Proceedings, 1998, pp. 24-27. 4. Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data Bases Conference

Proceedings, 1999, pp. 211-222. 5. Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for mining outliers from large datasets”, International Conference

on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data Year of Publication: 2000, ISSN:0163-5808

6. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander, “LOF: Identifying Density-based Local Outliers”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000

7. Spiros Papadimitriou, Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using the Local Correlation Integral, 19th International Conference on Data Engineering, March 05 - 08, 2003, Bangalore, India

8. Jiang, M.F., S.S. Tseng, and C.M. Su, Two-phase clustering process for outliers detection, Pattern Recognition Letters, Vol 22, No. 6-7, pp. 691-700.

9. A.He, X. Xu, S.Deng, Discovering Cluster Based Local Outliers, Pattern Recognition Letters, Volume24, Issue 9-10, June 2003, pp.1641-1650

10. He, Z., X., Deng, S., 2002. Squeezer: An efficient algorithm for clustering categorical data. Journal of Computer Science and Technology.11. A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3):264-323, 199912. Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large Databases. 2nd

International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164-169.13. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT'98.14. Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing, 2002.15. W. Perrizo, “Peano Count Tree Technology,” Technical Report NDSU-CSOR-TR-01-1, 2001.16. M. Khan, Q. Ding and W. Perrizo, “k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees” , Proc. Of PAKDD 2002,

Spriger-Verlag LNAI 2776, 200217. Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees, CAINE 200318. Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD 200319. Jiawei Han, Micheline Kambr, “Data mining concepts and techniques”, Morgan kaufman Publishers

Thank you!

Determination of Parameters

Determination of r Breunig et al. shows choosing miniPt = 10-30 work well in general [6]

(miniPt-Neighborhood) Choosing miniPts=20, get the average radius of 20-neighborhood, raver

age. In our algorithm, r = raverage=0.5

Determination of ε Selection of ε is a tradeoff between accuracy and speed. The larger ε

is, the faster the algorithm works; the smaller ε is, the more accurate the results are.

We chose ε=0.8 experimentally, and get the same result (same outliers) as Breunig’s, but much faster.

The results shown in the experimental part is based on ε=0.8. We coded all the methods. For the running environment, see the paper part.

a vertical outlier detection algorithm with clusters as by-product dongmei ren, imad rahal, and...

Documents

mixedreview of p

ptree ding

outlierdetection process

vertical data structure

data size

calculation of p x v

group data

trees contdvalue ptree