data mining

17
08/27/22 1 Data mining A Linear Method for Deviation Detection in Large databases Presented by: Ali Triki Date: 09/30/1999

Upload: quinn-heath

Post on 01-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Data mining. A Linear Method for Deviation Detection in Large databases Presented by: Ali Triki Date: 09/30/1999. Content. What are Deviations Approach Exact exception problem Sequential exception problem Algorithm Dissimilarity function Experimental results conclusion. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data mining

04/19/23 1

Data mining

A Linear Method for Deviation Detection in Large databases

Presented by: Ali Triki

Date: 09/30/1999

Page 2: Data mining

04/19/23 2

Content

What are Deviations Approach Exact exception problem Sequential exception problem Algorithm Dissimilarity function Experimental results conclusion

Page 3: Data mining

04/19/23 3

What are Deviations?

Deviations are errors or noise in data Several approaches for detecting

deviations (or exceptions) in the areas of Databases and Machine Learning

Statistical approach (Hoaglin 1983)Extending learning algorithms to cope with

small amount of noise (Aha 1991) Impact of erroneous examples on the

learning results (Quinlan 1986)

Page 4: Data mining

04/19/23 4

Approach

Use the implicit redundancy in the data to detect deviations.

Clustering data into 2 clusters: deviation and non deviations.

Do not discard deviation as noise, but try to isolate small minorities.

Page 5: Data mining

04/19/23 5

Exact Exception Problem

Problem descriptionSet of Items I= {1,4,4,4}Cardinality function: C(I)Dissimilarity Function: the variance of the numbers

in the set = 1/n (xi- x)2

Smoothing factor: C(I-Ij) * (D(I)-D(I-Ij))

By computing each candidate exception set Ij we get the following results:

Page 6: Data mining

04/19/23 6

The candidate set = {1} is an exception because it has a large smoothing factor SF

Example

Page 7: Data mining

04/19/23 7

Sequential Exception Problem

After seeing a series of similar data, an element disturbing the series is considered an exception

Given: A set of items I A sequence S of subsets:: Ij I and Ij-1 Ij

Cardinality function Smoothing factor: SF(Ij)=C(Ij-Ij-1) * (D(Ij)-D(Ij-1))The Smoothing factor consider the difference with the

preceding set instead of the complimentary set

Page 8: Data mining

04/19/23 8

Algorithm

1- Get the first element i1 of the item set I making up the element subset I1I and compute Ds(I1)

2- For each following element ij in S, create the subset Ij taking Ij= Ij-1U {ij} and compute the difference in dissimilarity values dj=Ds(Ij) – Ds(Ij-1)

3- Consider that element ij with the maximal value of dj>0 to be the answer for this iteration. If dj 0 for all Ij in S, there is no exception

Page 9: Data mining

04/19/23 9

Algorithm

If an exception ij is found:

For each element ik where k>j compute dk0=Ds(Ij-1U {ik}) –Ds (Ij-1)

dk1=Ds(IjU {ik}) –Ds (Ij)

Add to Ix those ik for which dk0 –dk1 dj

For m iterations, we get m competing exception sets Ix, select the one with the largest value of difference in dissimilarity dj scaled with the dissimilarity function C

Page 10: Data mining

04/19/23 10

Dissimilarity function

Handles the comparison of the character strings, it maintains a pattern of a regular expression that matches all the character strings seen so far.

Starting with the pattern of the 1st string, we introduce wildcard characters as more strings need to be covered.

Ds(Ij)= Ds(Ij-1) + J*(Ms(Ij)-Ms(Ij-1))/Ms(Ij) Auxiliary function Ms(Ij )= 1/ (3*c-w+2) With c being the total number of characters And w being the number of needed wildcards

Page 11: Data mining

04/19/23 11

Experimental Results 1

Page 12: Data mining

04/19/23 12

Experimental Results 2

Page 13: Data mining

04/19/23 13

Experimental Results 3

Page 14: Data mining

04/19/23 14

A Failure example

Page 15: Data mining

04/19/23 15

Why did it fail?

The dissimilarity function used couldn’t catch the exception.

Once 2 values ‘..,n,..’ and ‘..,y,..’ are seen , the pattern takes the form ‘...,*,…’ from then on, there is no change in pattern when ‘?’ appears in the same column as the pattern covers it.

Need a more powerful dissimilarity function.

Page 16: Data mining

04/19/23 16

Conclusion

We presented a linear algorithm for sequential exception problem.

Experimental evaluation shows that the effectiveness of the algorithm depends on the dissimilarity function used.

It seems helpful to have some predefined D.F that works well for particular datasets.

Page 17: Data mining

04/19/23 17

References:

A. Arning, R. Agrawal, P. Raghavan: "A Linear Method for Deviation Detection in Large Databases", Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996

S. Sarawagi, R. Agrawal, N. Megiddo: "Discovery-driven exploration of OLAP data cubes", Proc. of the Sixth Int'l Conference on Extending Database Technology (EDBT), Valencia, Spain, March 1998

R. Agrawal and R Srikant “Fast Algorithms for mining association rules” In Proceedings of the VLDB Conference 1994