knowledge discovery & data mining classification & fraud ... · knowledge discovery &...

53
Knowledge discovery & data mining Classification & fraud detection 5/24/00 Click here to start Table of Contents Knowledge discovery & data mining Classification & fraud detection Module outline The classification task Classification systems and inductive learning Train & test Train & test Training step Test step Prediction Machine learning terminology Comparing classifiers Classical example: play tennis? Module outline Bayesian classification Estimating a-posteriori probabilities Naïve Bayesian Classification Play-tennis example: estimating P(xi|C) Author: Dino Pedreschi Knowledge discovery & data mining Classification & fraud detection file:///C|/My Documents/2DM_class/index.htm (1 of 3) [5/24/2000 10:49:42 PM]

Upload: duongthu

Post on 14-Feb-2019

247 views

Category:

Documents


0 download

TRANSCRIPT

Knowledge discovery & data miningClassification & fraud detection

5/24/00

Click here to start

Table of Contents

Knowledge discovery & data miningClassification & fraud detection

Module outline

The classification task

Classification systems and inductive learning

Train & test

Train & test

Training step

Test step

Prediction

Machine learning terminology

Comparing classifiers

Classical example: play tennis?

Module outline

Bayesian classification

Estimating a-posteriori probabilities

Naïve Bayesian Classification

Play-tennis example: estimating P(xi|C)

Author: Dino Pedreschi

Knowledge discovery & data mining Classification & fraud detection

file:///C|/My Documents/2DM_class/index.htm (1 of 3) [5/24/2000 10:49:42 PM]

Play-tennis example: classifying X

The independence hypothesis…

Module outline

Decision trees

Classical example: play tennis?

Decision tree obtained with ID3 (Quinlan 86)

From decision trees to classification rules

Decision tree induction

Generate_DT(samples, attribute_list) =

Criteria for finding the best split

Information gain (ID3 – C4.5)

Information gain (ID3 – C4.5)

Information gain - play tennis example

Gini index (CART)

Gini index - play tennis example

Entropy vs. Gini (on continuous attributes)

Other criteria in decision tree construction

The overfitting problem

Stopping vs. pruning

If dataset is large

If data set is not so large

Categorical vs. continuous attributes

Summarizing …

Scalability to large databases

Module outline

Backpropagation

Knowledge discovery & data mining Classification & fraud detection

file:///C|/My Documents/2DM_class/index.htm (2 of 3) [5/24/2000 10:49:42 PM]

Backpropagation

Prediction and (statistical) regression

Other methods (not covered)

Classification with decision trees

What have we achieved?

References - classification

References - classification

Knowledge discovery & data mining Classification & fraud detection

file:///C|/My Documents/2DM_class/index.htm (3 of 3) [5/24/2000 10:49:42 PM]

First   Previous   Next   Last      Index   Text  

Slide 1 of 50

Knowledge discovery & data mining

file:///C|/My Documents/2DM_class/sld001.htm [5/24/2000 10:50:12 PM]

First   Previous   Next   Last      Index   Text  

Slide 2 of 50

Module outline

file:///C|/My Documents/2DM_class/sld002.htm [5/24/2000 10:50:13 PM]

First   Previous   Next   Last      Index   Text  

Slide 3 of 50

The classification task

file:///C|/My Documents/2DM_class/sld003.htm [5/24/2000 10:50:14 PM]

First   Previous   Next   Last      Index   Text  

Slide 4 of 50

Classification systems and inductive learning

file:///C|/My Documents/2DM_class/sld004.htm [5/24/2000 10:50:15 PM]

First   Previous   Next   Last      Index   Text  

Slide 5 of 50

Train & test

file:///C|/My Documents/2DM_class/sld005.htm [5/24/2000 10:50:16 PM]

First   Previous   Next   Last      Index   Text  

Slide 6 of 50

Train & test

file:///C|/My Documents/2DM_class/sld006.htm [5/24/2000 10:50:17 PM]

First   Previous   Next   Last      Index   Text  

Slide 7 of 50

Training step

file:///C|/My Documents/2DM_class/sld007.htm [5/24/2000 10:50:18 PM]

First   Previous   Next   Last      Index   Text  

Slide 8 of 50

Test step

file:///C|/My Documents/2DM_class/sld008.htm [5/24/2000 10:50:19 PM]

First   Previous   Next   Last      Index   Text  

Slide 9 of 50

Prediction

file:///C|/My Documents/2DM_class/sld009.htm [5/24/2000 10:50:20 PM]

First   Previous   Next   Last      Index   Text  

Slide 10 of 50

Machine learning terminology

file:///C|/My Documents/2DM_class/sld010.htm [5/24/2000 10:50:20 PM]

First   Previous   Next   Last      Index   Text  

Slide 11 of 50

Comparing classifiers

file:///C|/My Documents/2DM_class/sld011.htm [5/24/2000 10:50:21 PM]

First   Previous   Next   Last      Index   Text  

Slide 12 of 50

Classical example: play tennis?

file:///C|/My Documents/2DM_class/sld012.htm [5/24/2000 10:50:23 PM]

First   Previous   Next   Last      Index   Text  

Slide 13 of 50

Module outline

file:///C|/My Documents/2DM_class/sld013.htm [5/24/2000 10:50:24 PM]

First   Previous   Next   Last      Index   Text  

Slide 14 of 50

Bayesian classification

file:///C|/My Documents/2DM_class/sld014.htm [5/24/2000 10:50:24 PM]

First   Previous   Next   Last      Index   Text  

Slide 15 of 50

Estimating a-posteriori probabilities

file:///C|/My Documents/2DM_class/sld015.htm [5/24/2000 10:50:25 PM]

First   Previous   Next   Last      Index   Text  

Slide 16 of 50

Naïve Bayesian Classification

file:///C|/My Documents/2DM_class/sld016.htm [5/24/2000 10:50:26 PM]

First   Previous   Next   Last      Index   Text  

Slide 17 of 50

Play-tennis example: estimating P(xi|C)

file:///C|/My Documents/2DM_class/sld017.htm [5/24/2000 10:50:28 PM]

First   Previous   Next   Last      Index   Text  

Slide 18 of 50

Play-tennis example: classifying X

file:///C|/My Documents/2DM_class/sld018.htm [5/24/2000 10:50:29 PM]

First   Previous   Next   Last      Index   Text  

Slide 19 of 50

The independence hypothesis…

file:///C|/My Documents/2DM_class/sld019.htm [5/24/2000 10:50:30 PM]

First   Previous   Next   Last      Index   Text  

Slide 20 of 50

Module outline

file:///C|/My Documents/2DM_class/sld020.htm [5/24/2000 10:50:31 PM]

First   Previous   Next   Last      Index   Text  

Slide 21 of 50

Decision trees

file:///C|/My Documents/2DM_class/sld021.htm [5/24/2000 10:50:32 PM]

First   Previous   Next   Last      Index   Text  

Slide 22 of 50

Classical example: play tennis?

file:///C|/My Documents/2DM_class/sld022.htm [5/24/2000 10:50:33 PM]

First   Previous   Next   Last      Index   Text  

Slide 23 of 50

Decision tree obtained with ID3 (Quinlan 86)

file:///C|/My Documents/2DM_class/sld023.htm [5/24/2000 10:50:34 PM]

First   Previous   Next   Last      Index   Text  

Slide 24 of 50

From decision trees to classification rules

file:///C|/My Documents/2DM_class/sld024.htm [5/24/2000 10:50:35 PM]

First   Previous   Next   Last      Index   Text  

Slide 25 of 50

Decision tree induction

file:///C|/My Documents/2DM_class/sld025.htm [5/24/2000 10:50:36 PM]

First   Previous   Next   Last      Index   Text  

Slide 26 of 50

Generate_DT(samples, attribute_list) =

file:///C|/My Documents/2DM_class/sld026.htm [5/24/2000 10:50:37 PM]

First   Previous   Next   Last      Index   Text  

Slide 27 of 50

Criteria for finding the best split

file:///C|/My Documents/2DM_class/sld027.htm [5/24/2000 10:50:38 PM]

First   Previous   Next   Last      Index   Text  

Slide 28 of 50

Information gain (ID3 – C4.5)

file:///C|/My Documents/2DM_class/sld028.htm [5/24/2000 10:50:39 PM]

First   Previous   Next   Last      Index   Text  

Slide 29 of 50

Information gain (ID3 – C4.5)

file:///C|/My Documents/2DM_class/sld029.htm [5/24/2000 10:50:40 PM]

First   Previous   Next   Last      Index   Text  

Slide 30 of 50

Information gain - play tennis example

file:///C|/My Documents/2DM_class/sld030.htm [5/24/2000 10:50:42 PM]

First   Previous   Next   Last      Index   Text  

Slide 31 of 50

Gini index (CART)

file:///C|/My Documents/2DM_class/sld031.htm [5/24/2000 10:50:43 PM]

First   Previous   Next   Last      Index   Text  

Slide 32 of 50

Gini index - play tennis example

file:///C|/My Documents/2DM_class/sld032.htm [5/24/2000 10:50:44 PM]

First   Previous   Next   Last      Index   Text  

Slide 33 of 50

Entropy vs. Gini (on continuous attributes)

file:///C|/My Documents/2DM_class/sld033.htm [5/24/2000 10:50:45 PM]

First   Previous   Next   Last      Index   Text  

Slide 34 of 50

Other criteria in decision tree construction

file:///C|/My Documents/2DM_class/sld034.htm [5/24/2000 10:50:46 PM]

First   Previous   Next   Last      Index   Text  

Slide 35 of 50

The overfitting problem

file:///C|/My Documents/2DM_class/sld035.htm [5/24/2000 10:50:47 PM]

First   Previous   Next   Last      Index   Text  

Slide 36 of 50

Stopping vs. pruning

file:///C|/My Documents/2DM_class/sld036.htm [5/24/2000 10:50:48 PM]

First   Previous   Next   Last      Index   Text  

Slide 37 of 50

If dataset is large

file:///C|/My Documents/2DM_class/sld037.htm [5/24/2000 10:50:49 PM]

First   Previous   Next   Last      Index   Text  

Slide 38 of 50

If data set is not so large

file:///C|/My Documents/2DM_class/sld038.htm [5/24/2000 10:50:50 PM]

First   Previous   Next   Last      Index   Text  

Slide 39 of 50

Categorical vs. continuous attributes

file:///C|/My Documents/2DM_class/sld039.htm [5/24/2000 10:50:51 PM]

First   Previous   Next   Last      Index   Text  

Slide 40 of 50

Summarizing …

file:///C|/My Documents/2DM_class/sld040.htm [5/24/2000 10:50:52 PM]

First   Previous   Next   Last      Index   Text  

Slide 41 of 50

Scalability to large databases

file:///C|/My Documents/2DM_class/sld041.htm [5/24/2000 10:50:53 PM]

First   Previous   Next   Last      Index   Text  

Slide 42 of 50

Module outline

file:///C|/My Documents/2DM_class/sld042.htm [5/24/2000 10:50:53 PM]

First   Previous   Next   Last      Index   Text  

Slide 43 of 50

Backpropagation

file:///C|/My Documents/2DM_class/sld043.htm [5/24/2000 10:50:54 PM]

First   Previous   Next   Last      Index   Text  

Slide 44 of 50

Backpropagation

file:///C|/My Documents/2DM_class/sld044.htm [5/24/2000 10:50:55 PM]

First   Previous   Next   Last      Index   Text  

Slide 45 of 50

Prediction and (statistical) regression

file:///C|/My Documents/2DM_class/sld045.htm [5/24/2000 10:50:56 PM]

First   Previous   Next   Last      Index   Text  

Slide 46 of 50

Other methods (not covered)

file:///C|/My Documents/2DM_class/sld046.htm [5/24/2000 10:50:57 PM]

First   Previous   Next   Last      Index   Text  

Slide 47 of 50

Classification with decision trees

file:///C|/My Documents/2DM_class/sld047.htm [5/24/2000 10:50:58 PM]

First   Previous   Next   Last      Index   Text  

Slide 48 of 50

What have we achieved?

file:///C|/My Documents/2DM_class/sld048.htm [5/24/2000 10:50:59 PM]

First   Previous   Next   Last      Index   Text  

Slide 49 of 50

References - classification

file:///C|/My Documents/2DM_class/sld049.htm [5/24/2000 10:51:01 PM]

First   Previous   Next   Last      Index   Text  

Slide 50 of 50

References - classification

file:///C|/My Documents/2DM_class/sld050.htm [5/24/2000 10:51:02 PM]