ppt

21
10/21/10 1 Prediction of Diabetes by Employing a Meta-Heuristic Which Can Optimize the Performance of Existing Data Mining Approaches by Huy Nguyen Anh Pham and Evangelos Triantaphyllou ICIS’2008 – Portland, Oregon, May 14 - 16, 2008 Department of Computer Science, Louisiana State University Baton Rouge, LA 70803 Emails: [email protected] and [email protected] These slides and the source codes are available at: www.csc.lsu.edu/~huypham

Upload: roger961

Post on 16-Jul-2015

385 views

Category:

Documents


0 download

TRANSCRIPT

10/21/10 1

Prediction of Diabetes by Employing a Meta-Heuristic Which Can Optimize the Performance of Existing Data Mining Approaches

by Huy Nguyen Anh Pham and Evangelos TriantaphyllouICIS’2008 – Portland, Oregon, May 14 - 16, 2008Department of Computer Science, Louisiana State UniversityBaton Rouge, LA 70803Emails: [email protected] and [email protected]

These slides and the source codes are available at: www.csc.lsu.edu/~huypham

10/21/10 2

Outline

Diabetes and the Pima Indian Diabetes (PID) dataset

Selected current work Motivation The Homogeneity Based Algorithm (HBA) Rationale for the HBA Some computational results Conclusions

10/21/10 3

Diabetes and the PID dataset

Diabetes: If the body does not produce or properly use insulin, the redundant amount of sugar will be driven out by urination. This phenomenon (or disease) is called diabetes.

20.8 million children and adults in the United States (approximately 7% of the population) were diagnosed with diabetes (American Diabetes Association, 11/2007).

The Pima Indian Diabetes (PID) dataset : 768 records describing female patients of Pima Indian heritage which are at least 21 years old living near Phoenix, Arizona, USA (UCI-Machine Learning Repository, 2007).

10/21/10 4

Diabetes and the PID dataset – Cont’d

The eight attributes for each record in the PID:

Age (years)8

Diabetes Pedigree function7

Body mass index (kg/m2)6

2-hour serum insulin (μU/ml)5

Triceps skin fold thickness (mm)4

Diastolic blood pressure (mm/Hg)3

Plasma glucose concentration in an oral glucose tolerance test2

Number of times pregnant1

AttributeNo.

10/21/10 5

Selected Current work

76.0% diagnosis accuracy by Smith et al (1988) when using an early neural network.

77.6% diagnosis accuracy by Jankowski and Kadirkamanathan (1997) when using IncNet.

77.6% diagnosis accuracy by Au and Chan (2001) using a fuzzy approach.

78.6% diagnosis accuracy by Rutkowski and Cpalka (2003) when using a flexible neural-fuzzy inference system (FLEXNFIS).

81.8% diagnosis accuracy by Davis (2006) when using a fuzzy neural network.

Less than 78% diagnosis accuracy by the Statlog project (1994) when using different classification algorithms.

10/21/10 6

Motivation

In medical diagnosis there are three different types of possible errors: The false-negative type in which a patient, who in reality

has that disease, is diagnosed as disease free.

The false-positive type in which a patient, who in reality does not have that disease, is diagnosed as having that disease.

The unclassifiable type in which the diagnostic system cannot diagnose a given case. This happens due to insufficient knowledge extracted from the historic data.

10/21/10 7

Motivation – Cont’d Current medical data mining approaches often:

Assign equal penalty costs for the false-positive and the false-negative types: Diagnose a new patient to be in the false-positive type:

Make the patient to worry unnecessarily. Lead to unnecessary treatments and expenses. Not life-threatening possibilities.

Diagnose a new patient to be in the false-negative type: No treatment on time or none at all. Conditions may deteriorate and the patient’s life may be at

risk.

=> The two penalty costs for the false-positive and the false-negative types may be significantly different.

10/21/10 8

Motivation – Cont’d

Current medical data mining approaches ignore the penalty cost for the unclassifiable type: Because of insufficient knowledge extracted from the historic

data, a given patient should be predicted as in the unclassifiable type.

However, in reality current approaches have often predicted the patient as either having diabetes or being disease free.

Such misdiagnosis may lead to either unnecessary treatments or no treatment when one is needed.

=> Consideration for the unclassifiable type is required.

10/21/10 9

Outline

Diabetes and the PID dataset Selected current work Motivation The Homogeneity Based Algorithm (HBA) Rationale for the HBA Some computational results Conclusions

10/21/10 10

The Homogeneity Based Algorithm - HBA

Developed by the authors of this presentation (Pham and Triantaphyllou, 2007 and 2008).

Define the total misclassification cost TC as an optimization problem in terms of the false-positive, the false-negative, and the unclassifiable costs is:

(1)

Where:- RateFP, RateFN, and RateUC are the false positive, the false negative, and

the unclassifiable rates, respectively.

- CFP, CFN, and CUC are the penalty costs for the false positive, the false negative, and the unclassifiable cases, respectively. Usually, CFN is much higher then CFP and CUC.

Minimize the total misclassification cost.

)(min RateUCCRateFNCRateFPCTC UCFNFP ×+×+×=

10/21/10 11

The HBA - Some key observations

P

Q

Please notice that:

• Pattern A covers a region that is not adequately populated by training points.

• Pattern B does not have such sparely covered regions.

• The assumption that point P is a diabetes case may not be that accurate.

• However, the assumption that point Q is a diabetes case may be more accurate.

Q

P

• The accuracy of the inferred systems can be increased if the derived patterns correspond to homogenous sets.

• A homogenous set describes a steady or uniform distribution of a set of distinct points.

10/21/10 12

Q

P

B

A1

A2

The HBA - Some key observations – cont’d

Break pattern A into A1 and A2. Suppose that all patterns A1, A2 and B correspond to homogenous sets.

The number of points in B is higher than that in A1.

Thus, the assumption that point Q is a diabetes case may be more accurate than the assumption that point S is a diabetes case .

The accuracy of the inferred systems may also be affected by the density, thus to be used as the Homogeneity Degree (HD).

S

10/21/10 13

The HBA – The Main Algorithm Phase # 1: Assume that given is a training dataset T. We divide T into the two sub datasets:

T1 whose size is, say, equal to 90% of T ’s size

T2 whose size is, say, equal to 10% of T ’s size.

The training points in T1 are randomly selected from T.

Phase #2: Apply a classification approach (such as a DT, ANN, or SVM) on the training dataset T1

to infer the classification systems. Suppose that each classification system consists of a set of patterns.

Break the inferred patterns into hyperspheres.

Phase #3: Determine whether or not the hyperspheres are homogenous sets. If so, then compute their Homogeneity Degrees and go to phase #4. Otherwise, break them into smaller hyperspheres and repeat phase #3 until all the

hyperspheres are homogenous sets.

10/21/10 14

The HBA – The Main Algorithm – cont’d Phase #4:

Sort the Homogeneity Degrees in decreasing order. For each homogenous set, do

If its Homogeneity Degree is greater than a certain threshold value, then expand it. Otherwise, break it into smaller homogenous sets and then we may expand them. The approach stops when all of the homogenous sets have been processed.

Phase #5: Apply a genetic algorithm (GA) for Phases #2 to #4 to find optimal threshold values by

using the total misclassification cost as an objective function and the dataset T2 as a calibration dataset.

After obtaining the optimal threshold values, the training points in T2 can be divided into two sub datasets: T2,1 consists of the classifiable points T2,2 includes the unclassifiable points.

Let S1 denote an inferred classification system after the GA approach is completed. The dataset T2,2 then uses Phases #2 to #4 with the optimal threshold values obtained

from the GA approach to infer the additional classification system S2. The final classification system is the union of S1 and S2.

10/21/10 15

Rationale for the HBA Consider the problem as a optimization formulation in

terms of the false-positive, the false-negative, and the unclassifiable costs.

The HBA optimally adjusts the inferred classification systems.

We use the Homogeneity Degree in the control conditions for both expansion (to control generalization) and breaking (to control fitting).

Homogenous sets are expanded in decreasing order of their Homogeneity Degrees.

10/21/10 16

Some computational results

The four parameters needed in the HBA consist of: Two expansion threshold values α+ and α- to be used for expanding the positive

and negative homogenous sets, respectively. Two breaking threshold values β+ and β- to be used for breaking the positive and

negative homogenous sets, respectively.

Experimental methodology:

Step 1: The original algorithm was first trained on the training dataset T and then derived the value for TC by using the testing dataset.

Step 2: The HBA was trained on the training dataset T1 and then derived the value for TC by also using the testing dataset.

Step 3: The two values of TC returned in steps 1 and 2 were compared with each other.

Parameters:

10/21/10 17

Some computational results – Cont’d Case 1: minimize TC = 1 × RateFP + 1 × RateFN + 0 × RateUC

(i.e., only the false-positive and the false-negative costs are considered and do so equally).

The HBA, on the average, decreased the total misclassification cost by about 81.57%.

83.61%10143100ANN-HBA

74.60%16113160DT-HBA

86.49%10143100SVM-HBA

611183922ANN

631183627DT

74109740SVM

Avg. % of improvementTC RateUCRateFNRateFPAlgorithm

13.45%91.67%DT-HBA

16.53%94.79%ANN-HBA

16.53%94.79%SVM-HBA

77.7%Statlog

81.8%Davis

78.6%Rutkowski & Cpalka

77.6%Au & Chan

77.6%Jankowski & Kadirkamanathan

76.0%Smith et al

Avg. % of improvement% AccuracyAlgorithm in

10/21/10 18

Some computational results – Cont’d Case 2: minimize TC = 3 × RateFP + 3 × RateFN + 3 × RateUC

(i.e., all three costs are assumed to be equal).

The HBA, on the average, decreased the total misclassification cost by about 50.48%.

51.40%26129571ANN-HBA

52.49%25824611DT-HBA

47.54%28854402SVM-HBA

5371183922ANN

5431183627DT

549109740SVM

% of improvementTC RateUCRateFNRateFPAlgorithm

Case 3: minimize TC = 1 × RateFP + 20 × RateFN + 3 × RateUC(i.e., the false-negative cost is assumed to be much higher than the other two costs).

The HBA, on the average, decreased the total misclassification cost by about 51.59%.

45.59%629143100ANN-HBA

44.32%613136105DT-HBA

64.86%635105160SVM-HBA

1,1561183922ANN

1,1011183627DT

1,807109740SVM

% of improvementTC RateUCRateFNRateFPAlgorithm

10/21/10 19

Some computational results – Cont’d

94.25%14930059ANN-HBA

91.84%25232156DT-HBA

41.90%23324161SVM-HBA

2,5931183922ANN

3,0901183627DT

401109740SVM

% of improvementTC RateUCRateFNRateFPAlgorithm

Case 4: minimize TC = 1 × RateFP + 100 × RateFN + 3 × RateUC(i.e., the false-negative cost is assumed to the significantly higher than the other two costs).

The HBA, on the average, decreased the total misclassification cost by about 76.00%.

The higher the penalty cost for false-negative type is set, the fewer cases of false-negative can be found.

10/21/10 20

Conclusions Millions of people in the United States and in the world have diabetes.

The ability to predict diabetes early plays an important role for the patient’s treatment process.

The correct prediction percentage of current algorithms may oftentimes be coincidental.

This study identified the need for different penalty costs for the false-positive, the false-negative, and the unclassifiable types of errors in medical data mining.

This study applied a meta heuristic approach, called the Homogeneity-Based Algorithm (HBA), for enhancing the diabetes prediction.

The HBA first defines the desired goal as an optimization problem in terms of the false-positive, the false-negative, and the unclassifiable costs.

The HBA is then used in conjunction with traditional classification algorithms (such as SVMs, DTs, ANNs, etc) to enhance the diabetes prediction.

The Pima Indian diabetes dataset has been used for evaluating the performance of the HBA.

The obtained results appear to be very important both for accurately predicting diabetes and also for the medical data mining community in general.

10/21/10 21

References Asuncion A. and D.J. Newman, “UCI-Machine Learning Repository,” University of California,

Irvine, California, USA, School of Information and Computer Sciences, 2007. Smith J. W., J. E. Everhart, W. C. Dickson, W. C. Knowler, and R. S. Johannes, “Using the ADAP

learning algorithm to forecast the onset of diabetes mellitus,” Proceedings of 12th Symposium on Computer Applications and Medical Care, Los Angeles, California, USA, 1988, pp. 261 - 265.

Jankowski N. and V. Kadirkamanathan, “Statistical control of RBF-like networks for classification,” Proceedings of the 7th International Conference on Artificial Neural Networks (ICANN), Lausanne, Switzerland, 1997, pp. 385 - 390.

Au W. H. and K. C. C. Chan, “Classification with degree of membership: A fuzzy approach,” Proceedings of the 1st IEEE Int'l Conference on Data Mining, San Jose, California, USA, 2001, pp. 35 - 42.

Rutkowski L. and K. Cpalka, “Flexible neuro-fuzzy systems,” IEEE Transactions on Neural Networks, Vol. 14, 2003, pp. 554 - 574.

Davis W. L. IV, “Enhancing Pattern Classification with Relational Fuzzy Neural Networks and Square BK-Products,” PhD Dissertation in Computer Science, 2006, pp. 71 - 74.

Michie D., D. J. Spiegelhalter, and C. C. Taylor, “Machine Learning, Neural and Statistical Classification,” Englewood Cliffs in Series Artificial Intelligence, Prentice Hall, Chapter 9, 1994, pp. 157 - 160.

Pham H. N. A. and E. Triantaphyllou, “The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining,” in Soft Computing for Knowledge Discovery and Data Mining, (O. Maimon and L. Rokach, Editors), Springer, New-York, USA, 2007, Part 4, Chapter 5, pp. 391 - 431.

Pham H. N. A. and E. Triantaphyllou, "An Optimization Approach for Improving Accuracy by Balancing Overfitting and Overgeneralization in Data Mining," submitted for publication, January 2008.

Thank youAny questions?