class outlier mining
DESCRIPTION
Class Outlier Mining: Distance Based ApproachTRANSCRIPT
![Page 1: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/1.jpg)
Class Outlier Mining:Distance‐Based Approach
ByNabil M. Hewahi and Motaz K. Saad
Presented byMotaz K. Saad
[email protected]. 2008
International Journal of Intelligent Technology, Vol. 2, No. 1, pp 55-68, 2007
![Page 2: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/2.jpg)
Abstract
• In large datasets, identifying exception or rarecases with respect to a group of similar cases is to be considered very significant problem. (unusual pattern)
• The traditional problem (Outlier Mining) is to find exception or rare cases in a dataset irrespective of the class label of these cases, they are considered rare event with respect to the whole dataset.
2
![Page 3: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/3.jpg)
Abstract (Cont.)
• Present an overview of Class Outlier.
• Introduce a novel definition of a class outlier and propose COF factor.
• Propose a new algorithm for class outlier mining.
• Present experimental results.
• Perform a comparison study.
3
![Page 4: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/4.jpg)
Outlier Definition
• An Outlier is a data object that does not comply with the general behavior of the data (unusual pattern)
• It can be considered as noise or exception but is quite useful in fraud detection and rare events analysis.
4
![Page 5: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/5.jpg)
Outlier Mining
• It is the problem of detecting rare events, deviant objects, and exceptions.
• Is an important data mining issue in knowledge discovery; it has attracted increasing interests in recent years.
5
![Page 6: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/6.jpg)
Outliers
Outliers
6
![Page 7: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/7.jpg)
Outlier Mining: Business Applications
• Medical • Education • Fraud detection • Credit approving• Stock market analysis• Identifying computer network intrusions• Data cleaning• Surveillance and auditing• Health monitoring systems• Insurance, banking, money laundering telecommunication ..., etc).
7
![Page 8: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/8.jpg)
Outlier Detection Methods
• Statistical based (Distribution based)
• Clustering
• Distance‐based– K Nearest Neighbors (KNN)
– Density‐Based
• Model‐Based (Neural Network): Replicator Neural Network RNN
8
![Page 9: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/9.jpg)
Statistical (Distribution) based Outlier Detection Method
9
![Page 10: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/10.jpg)
Variation of Distance‐Based Approach for detecting Outlier
10
![Page 11: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/11.jpg)
NN Model for Detection Outlier
11
A schematic view of a fully connected Replicator Neural Network.
![Page 12: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/12.jpg)
What is Class Outlier?
• All the previous Definition of Outliers do not consider the class labels of the data set
• This means all the previous methods of Outliers Mining are devoted on the overall data set without looking closely to each class label separately.
12
![Page 13: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/13.jpg)
Class Outlier vs. Outlier
13
Class Outliers
Outlier
X ClassClass
![Page 14: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/14.jpg)
Class Outlier Example in heart‐statlog dataset
14
Att.#Inst# 1 2 3 4 5 6 7 8 9 10 11 12 13 Class
69 47 1 3 108 243 0 0 152 0 0 1 0 3 present 62 44 1 3 120 226 0 0 169 0 0 1 0 3 absent150 41 1 3 112 250 0 0 179 0 0 1 0 3 absent179 50 1 3 129 196 0 0 163 0 0 1 0 3 absent38 42 1 3 130 180 0 0 150 0 0 1 0 3 absent253 51 1 3 110 175 0 0 123 0 .6 1 0 3 absent23 47 1 3 112 204 0 0 143 0 .1 1 0 3 absent
Class Outlier Mining take in consideration the class label of the dataset
Class Outlier
![Page 15: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/15.jpg)
Class Outlier Example in house‐vote‐84 dataset
15
Att.#Inst# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Class407 n n n y y y n n n n y y y y n n democrat
306 n n n y y y n n n n n y y y n n republican
83 n n n y y y n n n n n y y y n n republican
87 n n n y y y n n n n n y y y n n republican
303 n n n y y y n n n n n y y y n n republican
119 n n n y y y n n n n n y y y n n republican
339 y n n y y y n n n n y y y y n n republican
Class Outlier Mining take in consideration the class label of the dataset
Class Outlier
![Page 16: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/16.jpg)
Party Behavior of house‐vote‐84 dataset
16
231
29714
245
8
55
200
12
218
45
436
213
18 22
142
4
163
23
157
8324
133
11
135
2013
0
50
100
150
200
250
300
ynnoVoteynnoVoteynnoVoteynnoVoteynnoVote
Issue 3Issue 4Issue 5Issue 8Issue 12
Issue #
Vote
s
Democrat Republican
![Page 17: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/17.jpg)
Class Outlier Definitions
• There are three definition that handle the problem that is “given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels”.– Semantic Outlier [He, et al. WAIM’02, 2002]
– Cross Outlier [Papadimitriou and Faloutsos, SSTD’03, 2003]
– Generalization of def. 1 & 2 [He, et al. ESWA'04, 2004]
17
![Page 18: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/18.jpg)
Semantic Outlier• Semantic Outlier: a data point, which behaves differently with other data points in the same class, while looks normal with respect to data points in another class [He, et al. 2002]
18
![Page 19: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/19.jpg)
Cross Outlier• Cross‐outlier: Given two sets (or classes) of objects, find those which deviate with respect to the other set [Papadimitriou and Faloutsos 2003]
19
![Page 20: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/20.jpg)
Definition (3)
• Generalization of Definition 1 & 2: The generalization does not consider only outliers that deviate with respect to their own class, but also outliers that deviate with respect to other classes [He, et al. 2004].
20
![Page 21: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/21.jpg)
The proposed definitions
• Distance (Similarity) Function
• K Nearest Neighbors
• PCL(T, K)
• Deviation
• K‐Distance
• Class Outlier
• Class Outlier Factor (COF)
21
![Page 22: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/22.jpg)
Distance (Similarity) Function
• Given a data set D = {t1, t2, t3, ..., tn} of tuples where each tuple ti = <ti1, ti2, ti3, ..., tim, Ci> contains m attributes and the class label Ci, the similarity function based on the EuclideanDistance between two data tuples, X = <x1, x2, x3, ...., xm> and Y = <y1, y2, y3,..., ym> (excluding the class labels) is
22
( ) ( )∑ −m
=iYX=YX,d
1
22
![Page 23: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/23.jpg)
K Nearest Neighbours
• For any positive integer K, the K-Nearest Neighbours of a tuple ti are the K closest tuples in the data set.
23
![Page 24: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/24.jpg)
PCL (T, K)
• The Probability of theclass label of theinstance T with respectto the class labels of itsK Nearest Neighbours.
• The instance T has the class label y, So the PCLof the instance T is 2/7.
24
![Page 25: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/25.jpg)
Deviation (T)
• Given a subset DCL = {t1, t2, t3, ..., th} of a data set D = {t1, t2, t3, ..., tn}. Where h is the number of instances in DCL and n is the number of instances of D.
• Given the instance T, DCL contains all the instances that have the similar class label of that of the instance T.
• The Deviation of T is how much the instance T deviates from DCL subset.
25
![Page 26: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/26.jpg)
Deviation (T) (Cont.)
• The Deviation is computed by summing the distance between the instance T and every instance in DCL.
26
( ) ( ) DCL.tWhere,tT,d=TDeviation i
h
=ii ∈∑
1
![Page 27: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/27.jpg)
Deviation (T) (Cont.)
27
![Page 28: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/28.jpg)
K‐Distance (The Density Factor)
• K Distance between theinstance T and its Knearest neighbors, i.e.how much the K nearestneighbors instances areclose to the instance T.
28
( ) ( )∑K
=iitT,d=TKDist
1
![Page 29: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/29.jpg)
Class Outlier: The proposed Definition
• Class Outliers are the top N instances which satisfies the following:– The K‐Distance to its K nearest neighbours is the least.
– Its Deviation is the greatest.
–Has different class label form that of its Knearest neighbours.
29
![Page 30: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/30.jpg)
Class Outlier Factor (COF)
• COF: The Class Outlier Factor of the instance T is the degree of being Class Outlier. The Class Outlier Factor of the instance T is defined as:
• PCL(T) from [1/K,1] to [1,K] by multiplying it by K. α and β factors are to control the importance and the effects of Deviation and K-Distance, where 0 ≤ α ≤ Mand 0 ≤ β ≤ 1. M is a changeable value based on the application domain and the initial experimental results. 30
( ) ( ) ( ) ( )TKDistβ+TDeviation
α+TPCLK=TCOF ∗∗∗1
![Page 31: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/31.jpg)
Guidelines for choosing K, α and β
• If the Deviation in hundreds for example, the best value for α is 100, and if the Deviation in tens, then the best value for α is 10 and so on.
• The optimal value of K is determined by trial and error technique. – There are many factors affecting the optimal value,
for example dataset size and number of classes are very important factors that affect choosing the value of K.
31
![Page 32: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/32.jpg)
Optimal value of K
• High value of K might result in wrong estimation of PCL.
• Low value of K means KNN is not well utilized.
• Odd values of K would make more sense for PCL value.
32
![Page 33: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/33.jpg)
CODB Algorithm Basic Steps
33
• Rank each instance in the dataset D. – This is done by calling the Rank procedure after providing the CODB with all the necessary data such as the value of α, β and K.
– The Rank Procedure finds out the rank of each instance using the formula in slide 28 and gives back the rank to CODB
• CODBmaintains a list of only the instances of the top N class outliers. – The less is the value of COF of an instance, the higher is the priority of the instance to be a class outlier.
![Page 34: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/34.jpg)
CODB features
• Direct method: no need for clustering.
• Handle numeric (continues), nominal and mixed dataset.
• Works on datasets with more than two classes.
• More specific in data object ranking than other related methods.
34
![Page 35: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/35.jpg)
Experimental results
• The CODB algorithm has been applied on five different real world datasets.
• All the datasets are publicly available at the UCI machine learning repository .
• The datasets are chosen from various domains that might have single or mixed data types and with two or more class labels. This variation is being tested on our proposed algorithm to show its capabilities.
35
![Page 36: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/36.jpg)
We performed experiments on the following datasets:
• Votes dataset: Nominal, 2 class labels.
• Hepatitis dataset: Mixed, 2 class labels.
• Heart‐statlog dataset. Mixed, 2 class labels.
• Credit approval (credit‐a) dataset: good mix of attributes: continuous, nominal with small numbers of values, and nominal with larger numbers of values, 2 class labels
• Vehicle dataset. Continues, 4 class labels
36
![Page 37: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/37.jpg)
Votes dataset experimental results
• 1984 United States Congressional Voting Records Database.
• Includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes.
• 16 Boolean attributes + class name = 17
• 435 instances, 2 classes (61.38% Democrats, 38.62% Republicans).
37
![Page 38: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/38.jpg)
THE TOP 20 CLASS OUTLIERS OF HOUSE‐VOTE‐84 DATASET
• K = 7
• Top N COF = 20
• Distance type: Euclidean Distance
• α = 100
• β = 0.1
• Remove Instance With Missing Values: false
• Replace Missing Values: false
38
![Page 39: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/39.jpg)
# Inst. # PCL Dev KDist # Inst. # PCL Dev KDist
1 4071 896.24 6.0
11 1762 519.94 8.49
COF: 1.71158 COF: 3.04086
2 3751 881.78 8.07
12 3842 832.95 9.34
COF: 1.92051 COF: 3.0543
3 3881 857.35 8.07
13 3652 836.65 9.44
COF: 1.92375 COF: 3.0634
4 1611 819.35 8.49
14 62 849.33 9.76
COF: 1.97058 COF: 3.0934
5 2671 523.66 8.49
15 3552 524.28 9.12
COF: 2.03949 COF: 3.10283
6 711 535.52 9.44
16 1642 845.19 10.07
COF: 2.13061 COF: 3.12576
7 771 799.64 10.39
17 4022 480.79 10.93
COF: 2.16429 COF: 3.30081
8 3252 846.64 8.49
18 1512 879.84 12.0
COF: 2.96664 COF: 3.31366
9 1602 829.17 8.49
19 1733 839.0 8.07
COF: 2.96913 COF: 3.9263
10 3822 851.76 9.02
20 753 841.28 9.44
COF: 3.01986 COF: 4.06275 39
![Page 40: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/40.jpg)
Comparison Study
• We performed a comparison study with He’s method (2002, 2004).
• Semantic Outlier Factor vs. Class Outlier Factor (SOF vs. COF).
40
![Page 41: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/41.jpg)
SOF vs. COF
41
![Page 42: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/42.jpg)
THE TOP 20 SEMANTIC OUTLIERS FOR
HOUSE‐VOTE‐84 DATASET
42
# Inst. # SOF # Inst. # SOF1 176 0.3036 11 375 1.45202 71 0.3394 12 151 1.49273 355 0.3645 13 372 1.49504 267 0.3659 14 388 1.63655 183 0.8726 15 2 1.64896 97 0.9892 16 382 1.67277 88 1.0724 17 215 1.70108 402 1.1690 18 164 1.71689 407 1.3309 19 6 1.723610 248 1.3487 20 325 1.7259
![Page 43: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/43.jpg)
Comparison Study (Cont.)
• Instance # 176 of votes dataset– Rank 1 (the top) using SOF.
– Rank 11 using COF.
– PCL(176) = 2/7 → there is another instance of the same class within the seven nearest neighbours.
43
![Page 44: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/44.jpg)
Comparison Study (Cont.)
• Instance # 407 of votes dataset– Rank 9 using SOF.
– Rank 1 (the top) using COF.
– PCL(407) = 1/7.
– The Deviation is the greatest which implies sort of uniqueness of the instance (object) behaviour.
– The K-Distance of the instance is very small (high density of other class type)
– SOF(407) = rank 9 → indicates the disability of recognizing such important cases.
44
![Page 45: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/45.jpg)
Conclusions
• In this research we proposed and introduced:– A novel approach for Class Outliers mining based on the K nearest neighbours using distance‐based similarity function to determine the nearest neighbours.
– Motivation about Class Outliers and their significance as exceptional cases.
– Ranking score that is Class Outlier Factor (COF) to measure the degree of being a Class Outlier for an object.
45
![Page 46: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/46.jpg)
Conclusions (Cont.)
– An efficient algorithm for mining and detection Class Outliers.
– An implementation has been developed using Weka framework.
• We presented:– Experimental results of the algorithm applied on various domains dataset (medical, business, and other domains), and for different dataset type (continues, nominal with small numbers of values, nominal with larger numbers of values, and mixed).
46
![Page 47: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/47.jpg)
Conclusions (Cont.)
– A comparison study has been performed with other methods and results show that our proposed algorithm gives more plausible and reasonable results than others. In addition, it considers mixed data types and more than two class label.
47
![Page 48: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/48.jpg)
Future work
• Proposing Class Outlier Detection Model.
• Getting advantage of the output of this work to find out a scheme to induce Censored Productions Rules (CPRs) from large datasets.
• Developing a weighted distance similarity function, where feature weight determination might be based on the information gain.
48
![Page 49: Class Outlier Mining](https://reader035.vdocuments.net/reader035/viewer/2022081512/555cc948d8b42a64718b56fd/html5/thumbnails/49.jpg)
Thank You !... Q&A
49